Lightweight alternatives to Google Analytics
More and more web-site owners are concerned about the "all-seeing Google" tracking users as they browse around the web. Google Analytics (GA) is a full-featured web-analytics system that is available for free and, despite the privacy concerns, has become the de facto analytics tool for small and large web sites alike. However, in recent years, a growing number of alternatives are helping break Google's dominance. In this article we'll look at two of the lightweight open-source options, namely GoatCounter and Plausible. In a subsequent article, we'll look at a few of the larger tools.
GA is by far the biggest player here: BuiltWith shows that around 86% of the top 100,000 web sites use it. This figure goes down to 64% for the top one-million web sites. These figures have grown steadily for the past 15 years, since Google acquired Urchin and rebranded it as Google Analytics. In addition to privacy concerns, GA is more complex and feature-heavy than some web-site owners need; many of them just want to see how much traffic is going to the pages on their site, and where that traffic is coming from. So it's not surprising that a number of simpler, more open tools have taken off in the past few years.
It should be noted that LWN does use GA, though we are evaluating other choices. Those who turn off ads in their preferences will not be served with the GA code, however.
What Google tracks, and why it's concerning
If asked what information Google tracks, a cynic might say, "everything". Part of the problem is that this isn't too far from the truth: Google tracks and stores a huge amount of information about users.
A 2018 paper [PDF] by Douglas Schmidt highlights the extent of Google's tracking, with location tracking on Android devices as one example:
Both Android and Chrome send data to Google even in the absence of any user interaction. Our experiments show that a dormant, stationary Android phone (with Chrome active in the background) communicated location information to Google 340 times during a 24-hour period, or at an average of 14 data communications per hour.
The paper distinguishes between "active" and "passive" tracking. Active
tracking is when the user directly uses or logs into a Google service, such as
performing a search, logging into Gmail, and so on. In addition to recording all
of a user's search keywords, Google passively tracks users as they visit
web sites that use GA and other Google publisher tools. Schmidt
found that in an example "day in the life"
scenario, "Google collected or inferred over two-thirds of the information
through passive means
".
Schmidt's paper details how GA cookie tracking works, noting the difference between "1st-party" and "3rd-party" cookies — the latter of which track users and their ad clicks across multiple sites:
While a GA cookie is specific to the particular domain of the website that user visits (called a "1st-party cookie"), a DoubleClick cookie is typically associated with a common 3rd-party domain (such as doubleclick.net). Google uses such cookies to track user interaction across multiple 3rd-party websites.
When a user interacts with an advertisement on a website, DoubleClick's conversion tracking tools (e.g. Floodlight) places cookies on a user’s computer and generates a unique client ID. Thereafter, if the user visits the advertised website, the stored cookie information gets accessed by the DoubleClick server, thereby recording the visit as a valid conversion.
Because such a large percentage of web sites use Google advertising products as well as GA, this has the effect that the company knows a large fraction of users' browsing history across many web sites, both popular sites and smaller "mom and pop" sites. In short, Google knows a lot about what you like, where you are, and what you buy.
Google does provide ways to turn off features like targeted advertising and location tracking, as well as to delete the personalized profile associated with an account. However, these features are almost entirely opt-in, and most users either don't know about them or just never bother to turn them off.
Of course, just switching away from GA won't eliminate all of these privacy issues (for example, it will do nothing to stop Android location tracking or search tracking), but it's one way to reduce the huge amount of data Google collects. In addition, for site owners that use a GA alternative, Google does not get a behind-the-scenes look at the site's traffic patterns — data which it could conceivably use in the future to build a competing tool.
LWN readers likely skew toward privacy-conscious: using Firefox instead of Google Chrome, turning on ad blockers, and so on. However, the users of the web sites they build may not be so privacy-conscious. For web-site developers, the analytics tools they choose can help respect their users' privacy and avoid Google knowing quite so much about their users' browsing patterns.
GoatCounter
GoatCounter is one of the more recent web-analytics tools, launched in August 2019. Created by Martin Tournoij, it has more of a "made by a single developer" feel than other tools; it's a little less slick-looking than some, but it is also developer-friendly and simple to set up.
The tool supports all of the basic analytics: page views and visits by URL, browser and operating system statistics, device screen sizes, locations, and referrer information. By default GoatCounter shows the last seven days with counts broken down by hour, but site owners can adjust the date span with simple controls.
GoatCounter has an unusual pricing model, with its source code licensed under the copyleft European Union Public License (EUPL). Companies can host the software themselves, or use GoatCounter's hosted version for a small fee (though the hosted version doesn't cost anything for "personal" projects). Tournoij has a lengthy article discussing why he chose the EUPL, noting:
I still don't really care what people do with my code, but I do care if my ability to make a living would be unreasonably impeded. Taking my MIT code and working full-time on enhancements that aren't sent back to me means my competitor has double the amount of people working on it: me (for free, from their perspective), and them. They will always have an advantage over me.
GoatCounter is written in Go, and uses vanilla JavaScript in its UI for some lightweight interactivity. JavaScript frameworks often get in the way of web accessibility, and GoatCounter's prioritization of accessibility (mentioned on its home page) struck a chord with "ctoth", who thanked Tournoij on Hacker News:
First time I've ever seen a comment about accessibility on the homepage of a mainstream product like this. As a blind developer this was just awesome, made me really feel like somebody out there is listening. Thank you for making this.
In addition to counting page views, GoatCounter tracks sessions using a hash of the browser's user agent and IP address to identify the client without storing any personal information. The salt used to generate these hashes is rotated every 4 hours with a sliding window. Tournoij has a detailed write-up about the technical aspects of session tracking, including a comparison with other solutions that have similar aims.
For web-site owners who prefer to avoid JavaScript or who want analytics from users with JavaScript disabled, GoatCounter supports non-JavaScript tracking scheme. It uses a 1x1 transparent GIF image in an "<img>" tag on the pages to be counted, though this approach will not record the referrer or screen size.
The hosted version of GoatCounter is easy to set up — taking about five minutes to set up an account and add the one line of JavaScript to my web site. Analytics data started showing up within a few seconds. Even with the hosted version, the site owner fully owns the data, and can export the full dump or delete their account at any time.
The self-hosted version is also straightforward to set up using the Linux binaries or by building from source — it took me less than ten minutes to build from source and set it up locally with the default SQLite database configuration. In contrast to Plausible (discussed below), it was much lighter to install, didn't download anything, and started up almost instantly.
Plausible
Plausible is another relatively new analytics tool that was launched in early 2019. Soon after launching, it switched to open source, with the code licensed under the permissive MIT license. The company's business model is to charge for the hosting, with pricing aimed at small businesses. In addition to making its source code available, Plausible is one of an increasing number of companies that has a publicly-visible roadmap for better transparency. It also posts informational content for potential customers on its blog.
Plausible is unique from a technology perspective, with its server code written in Elixir, which is a functional programming language that runs on the Erlang virtual machine. Its frontend UI uses a small amount of vanilla JavaScript for the interactive parts, rather than a rendering framework like React. It also boasts one of the smallest analytics scripts, with plausible.js weighing in at 781 bytes (1.2KB uncompressed) at the time of this writing. GA's analytics.js, by comparison, is almost 18KB (46KB uncompressed), while GoatCounter's count.js is 2.3KB (6.3KB uncompressed). That size can make a meaningful difference since the scripts are loaded for each page on the site.
In terms of user interface, Plausible is definitely more polished than GoatCounter. It is fairly minimalist, though, perhaps even more so than GoatCounter, providing total visitor counts, page-view counts per path, referrer information, map location, and devices (broken down by screen size, browser, and operating system). The tool also provides a "bounce rate" metric, though the exact definition is unclear.
Plausible's home page states that it provides "100% data ownership", and it is possible to export the CSV data for a single chart (as well as delete a Plausible.io account). However, the data dump is significantly less useful than GoatCounter's full data dump, which includes detailed information for every event.
Self-hosting Plausible is possible (even plausible), though as founder Uku Taht points out in the announcement of switching to open source:
It's worth noting that for now, there's no explicit support for self-hosting Plausible. The project is still evolving quickly and maintaining a self-hosted solution would slow product development down considerably. I would love to offer a self-hosted solution in the future once the product and the business are more stable.
That said, just a few weeks ago, Plausible added a document that describes an experimental way to self-host the system using Docker. Following those recommendations, I tried to use docker-compose to get it running locally. It was a little disconcerting how many Docker and npm packages it downloaded during the minutes-long installation process, and even when it was done, there was a hard-to-comprehend error with a PostgreSQL migration which prevented it from starting — the "experimental" label definitely fits.
Proprietary options, briefly
There are also a couple of lightweight proprietary tools with a focus on privacy worth mentioning. Obviously, these don't have the advantages of open development or self-hosting, but still provide a low-cost way out of Google's data-collection net.
One is the minimalist Simple Analytics product, which is a cloud-based tool created by solo developer Adriaan van Rossum; it has a clean-looking interface with only the few key metrics, similar to Plausible. Another is Fathom, which was open source initially, but the current version is proprietary (although the company hopes to start maintaining the open-source code base again in the future).
Summary
The last few years have seen a number of good alternatives to Google Analytics, particularly for those who only need a few basic features. Many of the recent alternatives are both open source and privacy-conscious, which means there are fewer reasons for projects and businesses to continue using proprietary analytics systems.
For site owners who just need basic traffic numbers, GoatCounter and Plausible both seem like excellent options. Those who like more visual polish and documentation might prefer Plausible; those who value a more developer-friendly tool with easy self-hosting will probably prefer GoatCounter. We will soon be publishing a second article that looks at some heavier-weight GA alternatives, as well as tools that provide analytics from web-server logs.
Index entries for this article | |
---|---|
GuestArticles | Hoyt, Ben |
Posted Jun 17, 2020 19:09 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link]
Posted Jun 17, 2020 21:13 UTC (Wed)
by dw (subscriber, #12017)
[Link] (4 responses)
Posted Jun 17, 2020 21:20 UTC (Wed)
by jake (editor, #205)
[Link]
Stay tuned :)
jake
Posted Jun 17, 2020 21:32 UTC (Wed)
by dw (subscriber, #12017)
[Link]
For completeness, the client juju is simply:
This also creates an opportunity for logging the hit data, so you get the best of both worlds: hassle freedom of GA with all the raw GA preserved should you wish to migrate to another solution in future.
Finally, since the entirety of the data received by Google is controlled, and if you're sufficiently paranoid, it's even possible to anonymize the domain being tracked.
Posted Jun 18, 2020 2:58 UTC (Thu)
by anarcat (subscriber, #66354)
[Link] (1 responses)
I think the key part of the title you might have missed is "Lightweight". I wouldn't qualify Matamo as lightweight if only because it's primarily designed as a (relatively fat) Javascript client (~200KB) that talks to a fairly large PHP web app which does a ton of stuff. The tools evaluated here seem much more lightweight. :)
Posted Jun 18, 2020 4:41 UTC (Thu)
by dw (subscriber, #12017)
[Link]
Posted Jun 18, 2020 2:07 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (13 responses)
Posted Jun 18, 2020 2:55 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Having good alternatives to ever-present GA is a good thing.
Posted Jun 18, 2020 15:59 UTC (Thu)
by sarunas (subscriber, #33117)
[Link] (1 responses)
Posted Jun 18, 2020 17:50 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 18, 2020 3:02 UTC (Thu)
by anarcat (subscriber, #66354)
[Link] (7 responses)
I find that's a reasonable tradeoff: yes, it's better if you don't store any personally identifiable information at all. But once you start doing that, you realize it's actually incredibly difficult. Your uplink might keep track of those streams. Those IP addresses and other PII do land in your computer memory whether you like it or not, and that means it can end up on disk (thanks to swap).
So I prefer to assume that there is *some* leakage, make *some* use of it, and limit it over time. Because it *does* have some use. Maybe it's just vanity, but I do like to get feedback of which articles I am writing are valuable to my readers, and while comments and direct feedback are a measure of that, there are way too few of those to provide a meaningful measures. Visits, on the other hands, are a direct metric I can use.
And that's without starting on abuse control, for which IP address tracking is kind of invaluable. For example, if you have a site requiring a login and you are not rate limiting password-guessing attempts, you are doing it wrong. And I don't know how you would do that *other* than by logging *some* IP addresses...
Posted Jun 18, 2020 4:21 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (5 responses)
Visit count is not a direct indicator of value. I visit plenty of websites and blog posts that I don't find useful after the fact. There are various ways to artificially inflate visit count without providing value or providing negative value (such as clickbait and or false headlines on social media).
Password based logins should be replaced with cryptographic logins (Webauthn, TLS client certs or Tor onion client auth for eg), which presumably solves the brute-force issue too.
Posted Jun 18, 2020 16:23 UTC (Thu)
by anarcat (subscriber, #66354)
[Link] (4 responses)
I agree! I don't find my solution to be particularly interesting, especially from a privacy perspective. It was just *simple*... :)
That said, that very requirement is why I find projects like GoatCounter interesting: it does a special effort at counting events *without* storing PII! It has a pretty elegant design in that perspective. So I would definitely consider it as an alternative to goaccess...
> Visit count is not a direct indicator of value. I visit plenty of websites and blog posts that I don't find useful after the fact. There are various ways to artificially inflate visit count without providing value or providing negative value (such as clickbait and or false headlines on social media).
Granted. I would consider this an attack vector as any other though. It doesn't mean there is *no* value in visitor count. Sure, you could count only "hits" and decide what's useful and what's not. Or you could just pretend all this stuff doesn't matter. But there are plenty of users who like to see those stats, and I believe in "harm reduction" and provide safer tools by default than pretending that requirement does not exist...
> Password based logins should be replaced with cryptographic logins (Webauthn, TLS client certs or Tor onion client auth for eg), which presumably solves the brute-force issue too.
Ha! I would like to believe too. But to break it apart: 1) webauthn is definitely useful right now, but it's generally used for 2FA, not for primary. After all, you don't really want people to just login with a "key" (something that you own) because once that is stolen you are totally screwed (you also need "something you know"). 2. TLS client certs would be great if clients implemented them in any meaningful way. But unfortunately, they are going more and more towards the trashbin. And they definitely have their own user-tracking concerns, at least in the current implementation, maybe even worse than regular cookies. 3. Tor is not ubiquitous (yet) so I wouldn't assume it's a good replacement for password authentication just yet.
I hear you: passwords suck. But they're still a thing and hard to get rid of! And even if you would get rid of it, I would still argue for rate-limiting in authentication attempts, even with public key authentication.
Posted Jun 19, 2020 1:17 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (3 responses)
Posted Jun 21, 2020 18:21 UTC (Sun)
by anarcat (subscriber, #66354)
[Link] (2 responses)
But when you're a client, you have different tradeoffs. You don't want to send that certificate everywhere all the time, because it acts as a unique token that can be used to track you across websites. Firefox has rudimentary protection against this: when I go on a site that wants access to my TLS client cert, it first prompts whether I want to actually authenticate with my cert. But that UI is terrible: it pops open all the time, at random moments, and doesn't remember the "yes I trust this site" checkbox, which seems to do nothing.
It's also not clear to me whether the server actually knows about my client cert at this point or whether the dialog is actually effective in not disclosing my identity. And that's just on firefox, which has some support for TLS client certs. I suspect the situation could be catastrophically worse on other servers.
I will also note that SSH does not have those protections *at all*. It will happily send *all* the public keys it knows about when trying to login to a random server, which is kind of disturbing when you think about it:
$ ssh -v lwn.net
No confirmation prompt whatsoever here. And they would be annoying too... i guess SSH expects you to divulge your public key identity when you connect to a server... but in the wild wild web, it seems like a delicate thing to do, so I wonder if a good usability trade-off is possible at all here.
Posted Jun 22, 2020 3:29 UTC (Mon)
by pabs (subscriber, #43278)
[Link] (1 responses)
I guess if web browsers wanted to they could easily mitigate this by pinning each cert to the domain it was created for and only ever sending it to that domain.
Also, I wonder if the client cert is in the clear in the TLS handshake, or if Encrypted Client Hello (new name for ESNI) is needed to hide them.
Posted Jun 22, 2020 14:58 UTC (Mon)
by anarcat (subscriber, #66354)
[Link]
I suspect near-absolutely no one does this...
> I guess if web browsers wanted to they could easily mitigate this by pinning each cert to the domain it was created for and only ever sending it to that domain.
Assuming they cared about client certs at all...
> Also, I wonder if the client cert is in the clear in the TLS handshake, or if Encrypted Client Hello (new name for ESNI) is needed to hide them.
I would assume the worse. ;)
Posted Jul 11, 2020 21:10 UTC (Sat)
by anarcat (subscriber, #66354)
[Link]
An interesting addition to this...
It turns out that goaccess *does* record the IP addresses of visitors after processing, when the `VISITORS` panel is enabled. It shows the per-IP top visitors and therefore does keep PII, contrary to what I first believed. I tried to disable that panel (which I do not find very interesting anyways) but then it breaks visitor tracking in the rest of the reports, so that's definitely a problem.
I am, again, really interested in trying out goatcounter, then. :)
Posted Jun 18, 2020 20:43 UTC (Thu)
by ddevault (subscriber, #99589)
[Link]
Posted Jun 19, 2020 11:29 UTC (Fri)
by Lennie (subscriber, #49641)
[Link]
We need to keep logs for the first things I mentioned to deal with abuse and spam.
The second part is very important to know how to improve things and what does and doesn't work. Now I would prefer more tools with throw more away and just keep the statistics. But more importantly, as long as it's just us running the website who have the data and not some company like Google or Facebook tracking you on multiple sites that's a very large privacy difference.
Posted Jun 30, 2020 12:08 UTC (Tue)
by ihucos (guest, #127147)
[Link]
For me one interesting aspect for better privacy is how unique views are tracked. GoatCounter seems to have something like sessions "the right way" or is at least mitigating it's effects on privacy concerns. Plausible on the other hand (and many others) uses the hashed user agent and IP as sessions id and stores that permanently. In my opinion that is even worse than cookies, which are more transparent, easier controllable by users and usually some random id that gets forgotten.
Simple Analytics (not to be confused with "my" product - similar naming - they where first) makes something quite interesting, which is to simply inspect the `document.referrer`. If it's not the site being tracked, it must be a new visitor. "My" product uses the HTTP cache to ensure each use is only counted once a day but also additionally counts on `sessionStorage` for more accuracy.
From my naive understanding of the GDPR you cannot have any session id's (so also no fingerprinting) if you want to avoid consent banners. That is in my opinion also an interesting but nebulous and difficult topic, with which providers you don't need GDPR consent banners.
Lightweight alternatives to Google Analytics
I find the absence of Matomo inexplicable, it's by far the most feature-complete (and feature-comparable) alternative to Google Analytics around
Matomo
Matomo
It's worth mentioning the possibility of removing some of the sting from Google Analytics using the measurement protocol and a local copy of analytics.js. You host a proxy script that forwards the hit on to GA, after making any desirable privacy-preserving changes, such as lopping off some of the IP address (rather than rely on the equivalent Google setting). On the client, configuring analytics.js with a custom sendHitTask delivers data to the script.
Privacy-preserving Google Analytics
ga('create', 'UA-XXXXXXX-1', 'auto');
ga(function(tracker) {
tracker.set('sendHitTask', function(model) {
var xhr = new XMLHttpRequest();
xhr.open('POST', '/wrapper-script');
xhr.send(model.get('hitPayload'));
});
});
Matomo
Matomo
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
[...]
debug1: Next authentication method: publickey
debug1: Offering public key: cardno:N RSA SHA256:XXXX agent
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Offering public key: rsa w/o comment RSA SHA256:XXXX agent
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
[...]
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics
Lightweight alternatives to Google Analytics