Django debates user tracking

November 30, 2016

This article was contributed by Antoine Beaupré

In recent years, privacy issues have become a growing concern among free-software projects and users. As more and more software tasks become web-based, surveillance and tracking of users is also on the rise. While some software may use advertising as a source of revenue, which has the side effect of monitoring users, the Django community recently got into an interesting debate surrounding a proposal to add user tracking—actually developer tracking—to the popular Python web framework.

Tracking for funding

A novel aspect of this debate is that the initiative comes from concerns of the Django Software Foundation (DSF) about funding. The proposal suggests that "relying on the free labor of volunteers is ineffective, unfair, and risky" and states that "the future of Django depends on our ability to fund its development". In fact, the DSF recently hired an engineer to help oversee Django's development, which has been quite successful in helping the project make timely releases with fewer bugs. Various fundraising efforts have resulted in major new Django features, but it is difficult to attract sponsors without some hard data on the usage of Django.

The proposed feature tries to count the number of "unique developers" and gather some metrics of their environments by using Google Analytics (GA) in Django. The actual proposal (DEP 8) is done as a pull request, which is part of Django Enhancement Proposal (DEP) process that is similar in spirit to the Python Enhancement Proposal (PEP) process. DEP 8 was brought forward by longtime Django developer Jacob Kaplan-Moss.

The rationale is that "if we had clear data on the extent of Django's usage, it would be much easier to approach organizations for funding". The proposal is essentially about adding code in Django to send a certain set of metrics when "developer" commands are run. The system would be "opt-out", enabled by default unless turned off, although the developer would be warned the first time the phone-home system is used. The proposal notes that an opt-in system "severely undercounts" and is therefore not considered "substantially better than a community survey" that the DSF is already doing.

Information gathered

The pieces of information reported are specifically designed to run only in a developer's environment and not in production. The metrics identified are, at the time of writing:

an event category (the developer commands: startproject, startapp, runserver)
the HTTP User-Agent string identifying the Django, Python, and OS versions
a user-specific unique identifier (a UUID generated on first run)

The proposal mentions the use of the GA aip flag which, according to GA documentation, makes "the IP address of the sender 'anonymized'". It is not quite clear how that is done at Google and, given that it is a proprietary platform, there is no way to verify that claim. The proposal says it means that "we can't see, and Google Analytics doesn't store, your actual IP". But that is not actually what Google does: GA stores IP addresses, the documentation just says they are anonymized, without explaining how.

GA is presented as a trade-off, since "Google's track record indicates that they don't value privacy" as highly as the DSF does. The alternative, deploying its own analytics software, was presented as making sustainability problems worse. According to the proposal, Google "can't track Django users. [...] The only thing Google could do would be to lie about anonymizing IP addresses, and attempt to match users based on their IPs".

The truth is that we don't actually know what Google means when it "anonymizes" data: Jannis Leidel, a Django team member, commented that "Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services" that limit even Google's capacity of ensuring its users' anonymity. Leidel also argued that the legal framework of the US may not apply elsewhere in the world: "for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option".

Furthermore, the proposal claims that "if we discovered Google was lying about this, we'd obviously stop using them immediately", but it is unclear exactly how this could be implemented if the software was already deployed. There are also concerns that an implementation could block normal operation, especially in countries (like China) where Google itself may be blocked. Finally, some expressed concerns that the information could constitute a security problem, since it would unduly expose the version number of Django that is running.

In other projects

Django is certainly not the first project to consider implementing analytics to get more information about its users. The proposal is largely inspired by a similar system implemented by the OS X Homebrew package manager, which has its own opt-out analytics.

Other projects embed GA code directly in their web pages. This is apparently the option chosen by the Oscar Django-based ecommerce solution, but that was seen by the DSF as less useful since it would count Django administrators and wasn't seen as useful as counting developers. Wagtail, a Django-based content-management system, was incorrectly identified as using GA directly, as well. It actually uses referrer information to identify installed domains through the version updates checks, with opt-out. Wagtail didn't use GA because the project wanted only minimal data and it was worried about users' reactions.

NPM, the JavaScript package manager, also considered similar tracking extensions. Laurie Voss, the co-founder of NPM, said it decided to completely avoid phoning home, because "users would absolutely hate it". But NPM users are constantly downloading packages to rebuild applications from scratch, so it has more complete usage metrics, which are aggregated and available via a public API. NPM users seem to find this is a "reasonable utility/privacy trade". Some NPM packages do phone home and have seen "very mixed" feedback from users, Voss said.

Eric Holscher, co-founder of Read the Docs, said the project is considering using Sentry for centralized reporting, which is a different idea, but interesting considering Sentry is fully open source. So even though it is a commercial service (as opposed to the closed-source Google Analytics), it may be possible to verify any anonymity claims.

Debian's response

Since Django is shipped with Debian, one concern was the reaction of the distribution to the change. Indeed, "major distros' positions would be very important for public reception" to the feature, another developer stated.

One of the current maintainers of Django in Debian, Raphaël Hertzog, explicitly stated from the start that such a system would "likely be disabled by default in Debian". There were two short discussions on Debian mailing lists where the overall consensus seemed to be that any opt-out tracking code was undesirable in Debian, especially if it was aimed at Google servers.

I have done some research to see what, exactly, was acceptable as a phone-home system in the Debian community. My research has revealed ten distinct bug reports against packages that would unexpectedly connect to the network, most of which were not directly about collecting statistics but more often about checking for new versions. In most cases I found, the feature was disabled. In the case of version checks, it seems right for Debian to disable the feature, because the package cannot upgrade itself: that task is delegated to the package manager. One of those issues was the infamous "OK Google" voice activation binary blob controversy that was previously reported here and has since then been fixed (although other issues remain in Chromium).

I have also found out that there is no clearly defined policy in Debian regarding tracking software. What I have found, however, is that there seems to be a strong consensus in Debian that any tracking is unacceptable. This is, for example, an extract of a policy that was drafted (but never formally adopted) by Ian Jackson, a longtime Debian developer:

Software in Debian should not communicate over the network except: in order to, and as necessary to, perform their function[...]; or for other purposes with explicit permission from the user.

In other words, opt-in only, period. Jackson explained that "when we originally wrote the core of the policy documents, the DFSG [Debian Free Software Guidelines], the SC [Social Contract], and so on, no-one would have considered this behaviour acceptable", which explains why no explicit formal policy has been adopted yet in the Debian project.

One of the concerns with opt-out systems (or even prompts that default to opt-in) was well explained back then by Debian developer Bas Wijnen:

It very much resembles having to click through a license for every package you install. One of the nice things about Debian is that the user doesn't need to worry about such things: Debian makes sure things are fine.

One could argue that Debian has its own tracking systems. For example, by default, Debian will "phone home" through the APT update system (though it only reports the packages requested). However, this is currently not automated by default, although there are plans to do so soon. Furthermore, Debian members do not consider APT as tracking, because it needs to connect to the network to accomplish its primary function. Since there are multiple distributed mirrors (which the user gets to choose when installing), the risk of surveillance and tracking is also greatly reduced.

A better parallel could be drawn with Debian's popcon system, which actually tracks Debian installations, including package lists. But as Barry Warsaw pointed out in that discussion, "popcon is 'opt-in' and [...] the overwhelming majority in Debian is in favour of it in contrast to 'opt-out'". It should be noted that popcon, while opt-in, ~~defaults to "yes" if users click through the install process~~. [Update: As pointed out in the comments, popcon actually defaults to "no" in Debian.] There are around 200,000 submissions at this time, which are tracked with machine-specific unique identifiers that are submitted daily. Ubuntu, which also uses the popcon software, gets around 2.8 million daily submissions, while Canonical estimates there are 40 million desktop users of Ubuntu. This would mean there is about an order of magnitude more installations than what is reported by popcon.

Policy aside, Warsaw explained that "Debian has a reputation for taking privacy issues very serious and likes to keep it".

Next steps

There are obviously disagreements within the Django project about how to handle this problem. It looks like the phone-home system may end up being implemented as a proxy system "which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves", another Django developer, Aymeric Augustin, said. Augustin also stated that the feature wouldn't "land before Django drops support for Python 2", which is currently estimated to be around 2020. It is unclear, then, how the proposal would resolve the funding issues, considering how long it would take to deploy the change and then collect the information so that it can be used to spur the funding efforts.

It also seems the system may explicitly prompt the user, with an opt-out default, instead of just splashing a warning or privacy agreement without a prompt. As Shai Berger, another Django contributor, stated, "you do not get [those] kind of numbers in community surveys". Berger also made the argument that "we trust the community to give back without being forced to do so"; furthermore:

I don't believe the increase we might get in the number of reports by making it harder to opt-out, can be worth the ill-will generated for people who might feel the reporting was "sneaked" upon them, or even those who feel they were nagged into participation rather than choosing to participate.

Other options may also include gathering metrics in pip or PyPI, which was proposed by Donald Stufft. Leidel also proposed that the system could ask to opt-in only after a few times the commands are called.

It is encouraging to see that a community can discuss such issues without heating up too much and shows great maturity for the Django project. Every free-software project may be confronted with funding and sustainability issues. Django seems to be trying to address this in a transparent way. The project is willing to engage with the whole spectrum of the community, from the top leaders to downstream distributors, including individual developers. This practice should serve as a model, if not of how to do funding or tracking, at least of how to discuss those issues productively.

Everyone seems to agree the point is not to surveil users, but improve the software. As Lars Wirzenius, a Debian developer, commented: "it's a very sad situation if free software projects have to compromise on privacy to get funded". Hopefully, Django will be able to improve its funding without compromising its principles.

Index entries for this article
Security	Privacy
GuestArticles	Beaupré, Antoine

Django debates user tracking

Posted Dec 1, 2016 4:09 UTC (Thu) by distinguishedcorgi (guest, #100058) [Link] (5 responses)

>Since Django is shipped with Debian

I don't understand why GNU/Linux distributions ship Python libraries at all. Rather than being installed globally to the system, Python libraries should be packaged with the application in a virtualenv and managed with pip, Python's package manager. I'm not sure who is benefiting from Debian packaging Django -- as far as I can tell it does not make things easier for developers; they likely work on multiple projects with dependencies on different versions of these libraries.

Django debates user tracking

Posted Dec 1, 2016 4:30 UTC (Thu) by pabs (subscriber, #43278) [Link]

https://enricozini.org/blog/2014/debian/debops/

Django debates user tracking

Posted Dec 1, 2016 19:28 UTC (Thu) by Felix (guest, #36445) [Link]

> I don't understand why GNU/Linux distributions ship Python libraries at all.

Well, it might be trivial to install some python-only library if you are using a virtualenv anyway. However I'm really glad that Fedora ships things like numpy, opencv, PyQT/PySide readily available. Also simplejson gets regular (security) updates which are too easy to miss if you just set up a virtualenv and leave it there.

Basically package managers like dnf/apt do a *way* better job than pip (with regards to keeping the system up-to-date). Oh, and they also ensure that you can actually trust the packages you installed. This is part of the reason why I really would like to see some combination of rpm/dnf (or deb/apt if they get there first) and virtualenv.

Django debates user tracking

Posted Dec 2, 2016 21:59 UTC (Fri) by mstone_ (subscriber, #66309) [Link]

ah, yes, the user-friendly automatic network based "install-and-never-ever-update-ever-again" model. it does have some down sides.

and it really blows for installing stuff on machines with limited network connectivity.

Django debates user tracking

Posted Dec 3, 2016 12:15 UTC (Sat) by valhalla (guest, #56634) [Link]

It may not make things easier for the developers, but surely it does make things much easier for the people (sysadmins, devops, whatever) who have to keep the project up and running and hopefully with a minimum amount of known vulnerabilities.

Also, it makes developer more confident that their machine is not running code from untrusted sources which could take control of it.

Or, if you want to read me ranting more in detail: https://www.trueelena.org/computers/articles/candy_from_s...

Django debates user tracking

Posted Dec 7, 2016 8:30 UTC (Wed) by debacle (subscriber, #7114) [Link]

I disagree completely. I never would want to have any libraries or tools installed by pip, gem, elpa, npm, and what not. My main reasons are:

I always want to have an overview about currently installed software. I want to be able to add, remove, upgrade packages. This is the case when I use one tool for all packages (here: apt), not when I have cluttered directories by different language specific tools.
I more or less trust my distribution (here: Debian), that it makes sure, the software has a free license, works with my system, is compatible with the rest of it, etc. In my experience as a packager, I know, that not all upstreams take license issues serious.
Most relevant here: I trust my distribution (here: Debian), that they would disable any "home phoning", even if upstream would have it enabled.

User tracking & Debian

Posted Dec 1, 2016 8:08 UTC (Thu) by robbe (guest, #16131) [Link] (5 responses)

> It should be noted that popcon, while opt-in, defaults to "yes" if users click through the
> install process.

How do you figure?
The package is installed by default, but – as far as I can see – does not do anything.
https://anonscm.debian.org/cgit/popcon/popcon.git/tree/de...
has "Default: false". The question is asked with "high" priority, so you are bound to see it (c.f. https://anonscm.debian.org/cgit/popcon/popcon.git/tree/de...) unless you go out of your way to avoid debconf questions … and even then default will kick in.

For testing, I just installed version 1.61 from stable with the whiptail debconf interface. The question /was/ asked, and the "No" answer was highlighted.

User tracking & Debian

Posted Dec 1, 2016 11:11 UTC (Thu) by lamby (subscriber, #42621) [Link] (1 responses)

This is very much the case. Perhaps the article could be updated?

User tracking & Debian

Posted Dec 1, 2016 17:21 UTC (Thu) by jake (editor, #205) [Link]

> This is very much the case. Perhaps the article could be updated?

We just did that, sorry for the confusion.

thanks,

jake

User tracking & Debian

Posted Dec 1, 2016 16:30 UTC (Thu) by ballombe (subscriber, #9523) [Link]

In Debian it defaults to NO.
On Ubuntu it (used to at least) defaults to YES, hence the confusion.
Furthermore popcon.ubuntu.com has been broken for years, so it is unclear popularity-contest on Ubuntu really matter anymore except as a privacy leak.

About apt, in Debian, apt is configured right in the installer to use whatever Debian mirror you like (even your own) so this this is quite different from phoning home. It phones when and where you asked it to.

User tracking & Debian

Posted Dec 1, 2016 17:21 UTC (Thu) by anarcat (subscriber, #66354) [Link] (1 responses)

Ouch: sorry, my bad - I relied on comments from the Django discussions on Github and didn't verify the claim. I have installed Debian probably hundreds of times, but most of those have been automated recently, so I haven't seen that dialog in ages. :)

Apologies to everyone, the article is being correct as we speak.

User tracking & Debian

Posted Dec 1, 2016 21:46 UTC (Thu) by ballombe (subscriber, #9523) [Link]

Also the claim that Ubuntu popcon get 2.8 millions submissions daily is completely overblown.
First popcon only submit once per week, not per day, and second, while Debian popcon remove submissions that have not been updated in the last 20 days, Ubuntu popcon does not do so.
This means 2.8 millions in the grand total of Ubuntu installations that reported successfully at least one, since
the start of popcon.ubuntu.com about ten years ago.

Debian popcon struggles to handle 200 000 submissions by week already. It is unclear how Ubuntu popcon handles it because the website is not regularly updated (see the "Last generated on Wed Jun 22 15:09:05 2016 UTC.") which suggest that the daily processing either fail or is not done at all.

Django debates user tracking

Posted Dec 1, 2016 9:19 UTC (Thu) by misc (subscriber, #73730) [Link]

That's interesting, because I was discussing that with Ralph Bean (of Fedora fame) in the bus going to Devconf.cz back in February 2016, and we finally did explore a approach to do the counting based on dns, a public/private key schema and some kind of public log. IE, people use public key crypto to encode the payload, convert it to something that can be added to dns, and make a request to a central shared dns domain who publish his log.

In fact, first, we wanted to do that over a tor hidden service (now called onion services), but this would trigger too much IDS alert, and likely be blocked in some countries.

We need a specific domain, and a dns server. Each thing we want to count would trigger a json or whatever encoded blob to be sent over dns (in the host part), encrypted with a public/private key scheme.

The tracking party never see the ip, because the request is relayed by the dns at the isp level (or google dns, or public dns around), who do hide it. The log of DNS requests is supposed can be public, hence encryption. Given that the limited value of the data over time, we do think that this should be safe enough. And because we did envision the system to be shared (long term) among projects who did wish to have user counting, indication of something making a request to that dns domain wouldn't reveal anything to anyone, since it could come from homebrew or django or whoever use that. The idea is that someone wshing to count would take all string, attempts to decrypt using their key and see if it bring something useful or not (ie, proper json if encoding json, etc), and discard all crap as "not for me".

And the central dns server would just purge data on a regular basis. This doesn't prevent someone from doing a copy of course, but do prevent opportunistic attack in case of key compromission.

But we didn't publish the notes of our discussion, we did discuss that with Remy Decausemaker right after the bus, but Ralph changed role inside the company, and Remy left RH, so this didn't went anywhere.

There is a few shortcoming on the proposal, such as "handling the infra for that", likely crypto and security "details" such as getting it right, not hitting dns limit. And the usual downside of counting downloads, as I did tried to explain also in the past in http://community.redhat.com/blog/2015/09/lies-damned-lies...

And all the things that 2 engineers in a post FOSDEM week at the back of a Czech bus for 3h couldn't think about, of course.

But I truly think that something can be done, and so far, this proposal would prevent user identification by tracking IP and various way inspired by tor. The infra could be operated by LF or any others orgs. And there is a bunch of DNS that do not track people ( https://diyisp.org/dokuwiki/doku.php?id=technical:dnsreso... , https://servers.opennicproject.org/ ), some handled by non profit in different juridictions to avoid issue of "governement are asking broad access".

Then, the main remaining issue is the uuid handling as central to have a proper count, and see the difference between CI job and long term usage, etc, etc.

I also suspect there is things to do with homomorphic encryption, but I am not knowledgeable enough to tell what :)

(and the proposal only focus on "getting info in a database", the whole "prepare a proper UI" is left as a exercise to the user)

Django debates user tracking

Posted Dec 1, 2016 18:44 UTC (Thu) by dnaber (guest, #56178) [Link] (1 responses)

> The alternative, deploying its own analytics software, was presented as making sustainability problems worse.

I'm not sure, you can install Piwik, which is OpenSource, but you can also use it as a service. For example: http://www.mysnip-solutions.de/en/hosting/piwik.html or https://piwik.pro

Piwik and other analytics

Posted Dec 12, 2016 18:57 UTC (Mon) by Nemo_bis (guest, #88187) [Link]

Indeed there isn't any reason to automatically discard FLOSS analytics, Piwik is used by many projects with little headaches. I know http://stats.kiwix.org/ for instance.

I don't know what's the default retention policy for IP addresses and other personal identifying information, but cryptolog can probably help.

Django debates user tracking

Posted Dec 7, 2016 8:41 UTC (Wed) by debacle (subscriber, #7114) [Link]

A note about Debians APT: Updating and upgrading via APT is not considered home-phoning, because APT does not connect to a unique server, but to whatever server you chose as repository. This might be one of the hundreds of public Debian mirrors (https://www.debian.org/mirror/list) or this might be your private APT repository or cache server. APT might even use a local CD/DVD/USB drive or a directory updated via rsync etc. By default APT uses the channel you installed your system from. So there is really no home phoning in APT and in no way any global collection of the installation data can take place via APT.