|
|
Subscribe / Log in / New account

Lightweight alternatives to Google Analytics

June 17, 2020

This article was contributed by Ben Hoyt

More and more web-site owners are concerned about the "all-seeing Google" tracking users as they browse around the web. Google Analytics (GA) is a full-featured web-analytics system that is available for free and, despite the privacy concerns, has become the de facto analytics tool for small and large web sites alike. However, in recent years, a growing number of alternatives are helping break Google's dominance. In this article we'll look at two of the lightweight open-source options, namely GoatCounter and Plausible. In a subsequent article, we'll look at a few of the larger tools.

GA is by far the biggest player here: BuiltWith shows that around 86% of the top 100,000 web sites use it. This figure goes down to 64% for the top one-million web sites. These figures have grown steadily for the past 15 years, since Google acquired Urchin and rebranded it as Google Analytics. In addition to privacy concerns, GA is more complex and feature-heavy than some web-site owners need; many of them just want to see how much traffic is going to the pages on their site, and where that traffic is coming from. So it's not surprising that a number of simpler, more open tools have taken off in the past few years.

It should be noted that LWN does use GA, though we are evaluating other choices. Those who turn off ads in their preferences will not be served with the GA code, however.

What Google tracks, and why it's concerning

If asked what information Google tracks, a cynic might say, "everything". Part of the problem is that this isn't too far from the truth: Google tracks and stores a huge amount of information about users.

A 2018 paper [PDF] by Douglas Schmidt highlights the extent of Google's tracking, with location tracking on Android devices as one example:

Both Android and Chrome send data to Google even in the absence of any user interaction. Our experiments show that a dormant, stationary Android phone (with Chrome active in the background) communicated location information to Google 340 times during a 24-hour period, or at an average of 14 data communications per hour.

The paper distinguishes between "active" and "passive" tracking. Active tracking is when the user directly uses or logs into a Google service, such as performing a search, logging into Gmail, and so on. In addition to recording all of a user's search keywords, Google passively tracks users as they visit web sites that use GA and other Google publisher tools. Schmidt found that in an example "day in the life" scenario, "Google collected or inferred over two-thirds of the information through passive means".

Schmidt's paper details how GA cookie tracking works, noting the difference between "1st-party" and "3rd-party" cookies — the latter of which track users and their ad clicks across multiple sites:

While a GA cookie is specific to the particular domain of the website that user visits (called a "1st-party cookie"), a DoubleClick cookie is typically associated with a common 3rd-party domain (such as doubleclick.net). Google uses such cookies to track user interaction across multiple 3rd-party websites.

When a user interacts with an advertisement on a website, DoubleClick's conversion tracking tools (e.g. Floodlight) places cookies on a user’s computer and generates a unique client ID. Thereafter, if the user visits the advertised website, the stored cookie information gets accessed by the DoubleClick server, thereby recording the visit as a valid conversion.

Because such a large percentage of web sites use Google advertising products as well as GA, this has the effect that the company knows a large fraction of users' browsing history across many web sites, both popular sites and smaller "mom and pop" sites. In short, Google knows a lot about what you like, where you are, and what you buy.

Google does provide ways to turn off features like targeted advertising and location tracking, as well as to delete the personalized profile associated with an account. However, these features are almost entirely opt-in, and most users either don't know about them or just never bother to turn them off.

Of course, just switching away from GA won't eliminate all of these privacy issues (for example, it will do nothing to stop Android location tracking or search tracking), but it's one way to reduce the huge amount of data Google collects. In addition, for site owners that use a GA alternative, Google does not get a behind-the-scenes look at the site's traffic patterns — data which it could conceivably use in the future to build a competing tool.

LWN readers likely skew toward privacy-conscious: using Firefox instead of Google Chrome, turning on ad blockers, and so on. However, the users of the web sites they build may not be so privacy-conscious. For web-site developers, the analytics tools they choose can help respect their users' privacy and avoid Google knowing quite so much about their users' browsing patterns.

GoatCounter

GoatCounter is one of the more recent web-analytics tools, launched in August 2019. Created by Martin Tournoij, it has more of a "made by a single developer" feel than other tools; it's a little less slick-looking than some, but it is also developer-friendly and simple to set up.

[GoatCounter UI
from www.goatcounter.com]

The tool supports all of the basic analytics: page views and visits by URL, browser and operating system statistics, device screen sizes, locations, and referrer information. By default GoatCounter shows the last seven days with counts broken down by hour, but site owners can adjust the date span with simple controls.

GoatCounter has an unusual pricing model, with its source code licensed under the copyleft European Union Public License (EUPL). Companies can host the software themselves, or use GoatCounter's hosted version for a small fee (though the hosted version doesn't cost anything for "personal" projects). Tournoij has a lengthy article discussing why he chose the EUPL, noting:

I still don't really care what people do with my code, but I do care if my ability to make a living would be unreasonably impeded. Taking my MIT code and working full-time on enhancements that aren't sent back to me means my competitor has double the amount of people working on it: me (for free, from their perspective), and them. They will always have an advantage over me.

GoatCounter is written in Go, and uses vanilla JavaScript in its UI for some lightweight interactivity. JavaScript frameworks often get in the way of web accessibility, and GoatCounter's prioritization of accessibility (mentioned on its home page) struck a chord with "ctoth", who thanked Tournoij on Hacker News:

First time I've ever seen a comment about accessibility on the homepage of a mainstream product like this. As a blind developer this was just awesome, made me really feel like somebody out there is listening. Thank you for making this.

In addition to counting page views, GoatCounter tracks sessions using a hash of the browser's user agent and IP address to identify the client without storing any personal information. The salt used to generate these hashes is rotated every 4 hours with a sliding window. Tournoij has a detailed write-up about the technical aspects of session tracking, including a comparison with other solutions that have similar aims.

For web-site owners who prefer to avoid JavaScript or who want analytics from users with JavaScript disabled, GoatCounter supports non-JavaScript tracking scheme. It uses a 1x1 transparent GIF image in an "<img>" tag on the pages to be counted, though this approach will not record the referrer or screen size.

The hosted version of GoatCounter is easy to set up — taking about five minutes to set up an account and add the one line of JavaScript to my web site. Analytics data started showing up within a few seconds. Even with the hosted version, the site owner fully owns the data, and can export the full dump or delete their account at any time.

The self-hosted version is also straightforward to set up using the Linux binaries or by building from source — it took me less than ten minutes to build from source and set it up locally with the default SQLite database configuration. In contrast to Plausible (discussed below), it was much lighter to install, didn't download anything, and started up almost instantly.

Plausible

Plausible is another relatively new analytics tool that was launched in early 2019. Soon after launching, it switched to open source, with the code licensed under the permissive MIT license. The company's business model is to charge for the hosting, with pricing aimed at small businesses. In addition to making its source code available, Plausible is one of an increasing number of companies that has a publicly-visible roadmap for better transparency. It also posts informational content for potential customers on its blog.

[Plausible UI
from plausible.io]

Plausible is unique from a technology perspective, with its server code written in Elixir, which is a functional programming language that runs on the Erlang virtual machine. Its frontend UI uses a small amount of vanilla JavaScript for the interactive parts, rather than a rendering framework like React. It also boasts one of the smallest analytics scripts, with plausible.js weighing in at 781 bytes (1.2KB uncompressed) at the time of this writing. GA's analytics.js, by comparison, is almost 18KB (46KB uncompressed), while GoatCounter's count.js is 2.3KB (6.3KB uncompressed). That size can make a meaningful difference since the scripts are loaded for each page on the site.

In terms of user interface, Plausible is definitely more polished than GoatCounter. It is fairly minimalist, though, perhaps even more so than GoatCounter, providing total visitor counts, page-view counts per path, referrer information, map location, and devices (broken down by screen size, browser, and operating system). The tool also provides a "bounce rate" metric, though the exact definition is unclear.

Plausible's home page states that it provides "100% data ownership", and it is possible to export the CSV data for a single chart (as well as delete a Plausible.io account). However, the data dump is significantly less useful than GoatCounter's full data dump, which includes detailed information for every event.

Self-hosting Plausible is possible (even plausible), though as founder Uku Taht points out in the announcement of switching to open source:

It's worth noting that for now, there's no explicit support for self-hosting Plausible. The project is still evolving quickly and maintaining a self-hosted solution would slow product development down considerably. I would love to offer a self-hosted solution in the future once the product and the business are more stable.

That said, just a few weeks ago, Plausible added a document that describes an experimental way to self-host the system using Docker. Following those recommendations, I tried to use docker-compose to get it running locally. It was a little disconcerting how many Docker and npm packages it downloaded during the minutes-long installation process, and even when it was done, there was a hard-to-comprehend error with a PostgreSQL migration which prevented it from starting — the "experimental" label definitely fits.

Proprietary options, briefly

There are also a couple of lightweight proprietary tools with a focus on privacy worth mentioning. Obviously, these don't have the advantages of open development or self-hosting, but still provide a low-cost way out of Google's data-collection net.

One is the minimalist Simple Analytics product, which is a cloud-based tool created by solo developer Adriaan van Rossum; it has a clean-looking interface with only the few key metrics, similar to Plausible. Another is Fathom, which was open source initially, but the current version is proprietary (although the company hopes to start maintaining the open-source code base again in the future).

Summary

The last few years have seen a number of good alternatives to Google Analytics, particularly for those who only need a few basic features. Many of the recent alternatives are both open source and privacy-conscious, which means there are fewer reasons for projects and businesses to continue using proprietary analytics systems.

For site owners who just need basic traffic numbers, GoatCounter and Plausible both seem like excellent options. Those who like more visual polish and documentation might prefer Plausible; those who value a more developer-friendly tool with easy self-hosting will probably prefer GoatCounter. We will soon be publishing a second article that looks at some heavier-weight GA alternatives, as well as tools that provide analytics from web-server logs.


Index entries for this article
GuestArticlesHoyt, Ben


to post comments

Lightweight alternatives to Google Analytics

Posted Jun 17, 2020 19:09 UTC (Wed) by ibukanov (subscriber, #3942) [Link]

Given how many sites wordpress runs on, it will be nice to have a review of its analytics plugins. Installation and configuration of those on a self-hosted Wordpress is trivial.

Matomo

Posted Jun 17, 2020 21:13 UTC (Wed) by dw (subscriber, #12017) [Link] (4 responses)

I find the absence of Matomo inexplicable, it's by far the most feature-complete (and feature-comparable) alternative to Google Analytics around

Matomo

Posted Jun 17, 2020 21:20 UTC (Wed) by jake (editor, #205) [Link]

> I find the absence of Matomo inexplicable,

Stay tuned :)

jake

Privacy-preserving Google Analytics

Posted Jun 17, 2020 21:32 UTC (Wed) by dw (subscriber, #12017) [Link]

It's worth mentioning the possibility of removing some of the sting from Google Analytics using the measurement protocol and a local copy of analytics.js. You host a proxy script that forwards the hit on to GA, after making any desirable privacy-preserving changes, such as lopping off some of the IP address (rather than rely on the equivalent Google setting). On the client, configuring analytics.js with a custom sendHitTask delivers data to the script.

For completeness, the client juju is simply:

    ga('create', 'UA-XXXXXXX-1', 'auto');
    ga(function(tracker) {
        tracker.set('sendHitTask', function(model) {
            var xhr = new XMLHttpRequest();
            xhr.open('POST', '/wrapper-script');
            xhr.send(model.get('hitPayload'));
        });
    });

This also creates an opportunity for logging the hit data, so you get the best of both worlds: hassle freedom of GA with all the raw GA preserved should you wish to migrate to another solution in future.

Finally, since the entirety of the data received by Google is controlled, and if you're sufficiently paranoid, it's even possible to anonymize the domain being tracked.

Matomo

Posted Jun 18, 2020 2:58 UTC (Thu) by anarcat (subscriber, #66354) [Link] (1 responses)

> I find the absence of Matomo inexplicable, it's by far the most feature-complete (and feature-comparable) alternative to Google Analytics around

I think the key part of the title you might have missed is "Lightweight". I wouldn't qualify Matamo as lightweight if only because it's primarily designed as a (relatively fat) Javascript client (~200KB) that talks to a fairly large PHP web app which does a ton of stuff. The tools evaluated here seem much more lightweight. :)

Matomo

Posted Jun 18, 2020 4:41 UTC (Thu) by dw (subscriber, #12017) [Link]

It would appear two decades around enterprise software has damaged my definition of lightweight ;) Stood next to the typical 342 KiB of script payload on a modern Google search home page, the 23 KiB gzipped Matomo tracker JS might at least still be considered lightweight by some reasonable standard.

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 2:07 UTC (Thu) by pabs (subscriber, #43278) [Link] (13 responses)

As a visitor to websites, I think a better alternative is to not track your visitors at all. Don't log their visits anywhere, don't record anything about them.

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 2:55 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Visitor tracking helps to find problematic areas on a website and for commercial websites to understand who is actually using it. It might not matter only for simple content websites (like blogs).

Having good alternatives to ever-present GA is a good thing.

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 15:59 UTC (Thu) by sarunas (subscriber, #33117) [Link] (1 responses)

I would lean to disagree. Apart from ethical concerns for planting tracking cookies, given the widespread use of tracker blockers and now browsers themselves starting to block trackers, data collected must be skewed to the point of being worthless or even misleading...

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 17:50 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Trackers on individual sites don't suffer from blocked cookies, as users are typically logged in and have a unique session ID. Blocked trackers are more problematic, but even server-side tracking is usually better than none.

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 3:02 UTC (Thu) by anarcat (subscriber, #66354) [Link] (7 responses)

I record people for 7 days on my site, and shove the data into https://goaccess.io/ which sends me the result in a crappy email summary, without IP addresses. So I'm effectively keeping zero in the long term, although I *am* retaining *some* data in the short term.

I find that's a reasonable tradeoff: yes, it's better if you don't store any personally identifiable information at all. But once you start doing that, you realize it's actually incredibly difficult. Your uplink might keep track of those streams. Those IP addresses and other PII do land in your computer memory whether you like it or not, and that means it can end up on disk (thanks to swap).

So I prefer to assume that there is *some* leakage, make *some* use of it, and limit it over time. Because it *does* have some use. Maybe it's just vanity, but I do like to get feedback of which articles I am writing are valuable to my readers, and while comments and direct feedback are a measure of that, there are way too few of those to provide a meaningful measures. Visits, on the other hands, are a direct metric I can use.

And that's without starting on abuse control, for which IP address tracking is kind of invaluable. For example, if you have a site requiring a login and you are not rate limiting password-guessing attempts, you are doing it wrong. And I don't know how you would do that *other* than by logging *some* IP addresses...

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 4:21 UTC (Thu) by pabs (subscriber, #43278) [Link] (5 responses)

While it is improbable to avoid being on the internet without having intermediaries and increasingly improbable that those intermediaries are dumb pipes (and thus trustworthy) and software leaking PII is indeed hard to control, but you can at least avoid intentionally storing PII yourself, or at least calculating the visit count within the request handler without storing any info on the corresponding individual request events.

Visit count is not a direct indicator of value. I visit plenty of websites and blog posts that I don't find useful after the fact. There are various ways to artificially inflate visit count without providing value or providing negative value (such as clickbait and or false headlines on social media).

Password based logins should be replaced with cryptographic logins (Webauthn, TLS client certs or Tor onion client auth for eg), which presumably solves the brute-force issue too.

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 16:23 UTC (Thu) by anarcat (subscriber, #66354) [Link] (4 responses)

> you can at least avoid intentionally storing PII yourself, or at least calculating the visit count within the request handler without storing any info on the corresponding individual request events.

I agree! I don't find my solution to be particularly interesting, especially from a privacy perspective. It was just *simple*... :)

That said, that very requirement is why I find projects like GoatCounter interesting: it does a special effort at counting events *without* storing PII! It has a pretty elegant design in that perspective. So I would definitely consider it as an alternative to goaccess...

> Visit count is not a direct indicator of value. I visit plenty of websites and blog posts that I don't find useful after the fact. There are various ways to artificially inflate visit count without providing value or providing negative value (such as clickbait and or false headlines on social media).

Granted. I would consider this an attack vector as any other though. It doesn't mean there is *no* value in visitor count. Sure, you could count only "hits" and decide what's useful and what's not. Or you could just pretend all this stuff doesn't matter. But there are plenty of users who like to see those stats, and I believe in "harm reduction" and provide safer tools by default than pretending that requirement does not exist...

> Password based logins should be replaced with cryptographic logins (Webauthn, TLS client certs or Tor onion client auth for eg), which presumably solves the brute-force issue too.

Ha! I would like to believe too. But to break it apart: 1) webauthn is definitely useful right now, but it's generally used for 2FA, not for primary. After all, you don't really want people to just login with a "key" (something that you own) because once that is stolen you are totally screwed (you also need "something you know"). 2. TLS client certs would be great if clients implemented them in any meaningful way. But unfortunately, they are going more and more towards the trashbin. And they definitely have their own user-tracking concerns, at least in the current implementation, maybe even worse than regular cookies. 3. Tor is not ubiquitous (yet) so I wouldn't assume it's a good replacement for password authentication just yet.

I hear you: passwords suck. But they're still a thing and hard to get rid of! And even if you would get rid of it, I would still argue for rate-limiting in authentication attempts, even with public key authentication.

Lightweight alternatives to Google Analytics

Posted Jun 19, 2020 1:17 UTC (Fri) by pabs (subscriber, #43278) [Link] (3 responses)

I'm interested in the privacy issues you mention with TLS client certs. It seems to me that they are basically the ideal auth mechanism in terms of privacy, since with the right browser implementation you could make the authentication choice on a per-request basis, allowing you to only authenticate POST requests or only authenticate URLs the user clicked on and not things the page loads.

Lightweight alternatives to Google Analytics

Posted Jun 21, 2020 18:21 UTC (Sun) by anarcat (subscriber, #66354) [Link] (2 responses)

I'm not super familiar with the details, but there's a similar problem with SSH, I believe. When you authenticate to a server with public key authentication, either the server or the client at some point need to disclose which public keys are authorized or to try to authorize. When we do server authentication (ie. regular HTTPS) this doesn't matter: the site is public and it's not trying to hide its identity, it's trying to *prove* it to the world!

But when you're a client, you have different tradeoffs. You don't want to send that certificate everywhere all the time, because it acts as a unique token that can be used to track you across websites. Firefox has rudimentary protection against this: when I go on a site that wants access to my TLS client cert, it first prompts whether I want to actually authenticate with my cert. But that UI is terrible: it pops open all the time, at random moments, and doesn't remember the "yes I trust this site" checkbox, which seems to do nothing.

It's also not clear to me whether the server actually knows about my client cert at this point or whether the dialog is actually effective in not disclosing my identity. And that's just on firefox, which has some support for TLS client certs. I suspect the situation could be catastrophically worse on other servers.

I will also note that SSH does not have those protections *at all*. It will happily send *all* the public keys it knows about when trying to login to a random server, which is kind of disturbing when you think about it:

$ ssh -v lwn.net
[...]
debug1: Next authentication method: publickey
debug1: Offering public key: cardno:N RSA SHA256:XXXX agent
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Offering public key: rsa w/o comment RSA SHA256:XXXX agent
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
[...]

No confirmation prompt whatsoever here. And they would be annoying too... i guess SSH expects you to divulge your public key identity when you connect to a server... but in the wild wild web, it seems like a delicate thing to do, so I wonder if a good usability trade-off is possible at all here.

Lightweight alternatives to Google Analytics

Posted Jun 22, 2020 3:29 UTC (Mon) by pabs (subscriber, #43278) [Link] (1 responses)

IIRC for SSH the solutions to this are either separate SSH agents per identity or the IdentitiesOnly option.

I guess if web browsers wanted to they could easily mitigate this by pinning each cert to the domain it was created for and only ever sending it to that domain.

Also, I wonder if the client cert is in the clear in the TLS handshake, or if Encrypted Client Hello (new name for ESNI) is needed to hide them.

Lightweight alternatives to Google Analytics

Posted Jun 22, 2020 14:58 UTC (Mon) by anarcat (subscriber, #66354) [Link]

> IIRC for SSH the solutions to this are either separate SSH agents per identity or the IdentitiesOnly option.

I suspect near-absolutely no one does this...

> I guess if web browsers wanted to they could easily mitigate this by pinning each cert to the domain it was created for and only ever sending it to that domain.

Assuming they cared about client certs at all...

> Also, I wonder if the client cert is in the clear in the TLS handshake, or if Encrypted Client Hello (new name for ESNI) is needed to hide them.

I would assume the worse. ;)

Lightweight alternatives to Google Analytics

Posted Jul 11, 2020 21:10 UTC (Sat) by anarcat (subscriber, #66354) [Link]

> I record people for 7 days on my site, and shove the data into https://goaccess.io/ which sends me the result in a crappy email summary, without IP addresses. So I'm effectively keeping zero in the long term, although I *am* retaining *some* data in the short term.

An interesting addition to this...

It turns out that goaccess *does* record the IP addresses of visitors after processing, when the `VISITORS` panel is enabled. It shows the per-IP top visitors and therefore does keep PII, contrary to what I first believed. I tried to disable that panel (which I do not find very interesting anyways) but then it breaks visitor tracking in the rest of the reports, so that's definitely a problem.

I am, again, really interested in trying out goatcounter, then. :)

Lightweight alternatives to Google Analytics

Posted Jun 18, 2020 20:43 UTC (Thu) by ddevault (subscriber, #99589) [Link]

Thank you for saying this, I agree entirely. We should not be encouraging people to quit Google Analytics for something else, we should be encouraging them to quit analytics entirely. It's not right to spy on your users. 9 times out of 10, analytics exist only to provide a dopamine fix to the web admin - ask anyone you know to tell you exactly how their changes are informed by analytics data, and you'll likely hear crickets.

Lightweight alternatives to Google Analytics

Posted Jun 19, 2020 11:29 UTC (Fri) by Lennie (subscriber, #49641) [Link]

As someone who runs a pretty large website with sign ups and forum comments, etc. and needing a way to see how websites are used.

We need to keep logs for the first things I mentioned to deal with abuse and spam.

The second part is very important to know how to improve things and what does and doesn't work. Now I would prefer more tools with throw more away and just keep the statistics. But more importantly, as long as it's just us running the website who have the data and not some company like Google or Facebook tracking you on multiple sites that's a very large privacy difference.

Lightweight alternatives to Google Analytics

Posted Jun 30, 2020 12:08 UTC (Tue) by ihucos (guest, #127147) [Link]

If I can make some shameless advertising for my own product: https://simple-web-analytics.com/

For me one interesting aspect for better privacy is how unique views are tracked. GoatCounter seems to have something like sessions "the right way" or is at least mitigating it's effects on privacy concerns. Plausible on the other hand (and many others) uses the hashed user agent and IP as sessions id and stores that permanently. In my opinion that is even worse than cookies, which are more transparent, easier controllable by users and usually some random id that gets forgotten.

Simple Analytics (not to be confused with "my" product - similar naming - they where first) makes something quite interesting, which is to simply inspect the `document.referrer`. If it's not the site being tracked, it must be a new visitor. "My" product uses the HTTP cache to ensure each use is only counted once a day but also additionally counts on `sessionStorage` for more accuracy.

From my naive understanding of the GDPR you cannot have any session id's (so also no fingerprinting) if you want to avoid consent banners. That is in my opinion also an interesting but nebulous and difficult topic, with which providers you don't need GDPR consent banners.


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds