Django debates user tracking
In recent years, privacy issues have become a growing concern among free-software projects and users. As more and more software tasks become web-based, surveillance and tracking of users is also on the rise. While some software may use advertising as a source of revenue, which has the side effect of monitoring users, the Django community recently got into an interesting debate surrounding a proposal to add user tracking—actually developer tracking—to the popular Python web framework.
Tracking for funding
A novel aspect of this debate is that the initiative comes from concerns
of the
Django Software
Foundation (DSF) about funding. The proposal
suggests that "relying on the free labor of volunteers is
ineffective, unfair, and risky
" and states that "the future of Django
depends on our ability to fund its development
". In fact, the DSF
recently hired an engineer to help oversee Django's development, which has
been quite successful in helping the project make timely releases with
fewer bugs. Various fundraising efforts have resulted in major new Django
features, but it is difficult to attract sponsors without some hard data on
the usage of Django.
The proposed feature tries to count the number of "unique
developers
" and gather some metrics of their environments by using
Google Analytics (GA) in
Django. The actual proposal (DEP 8) is done as a pull request, which is
part of Django
Enhancement Proposal (DEP) process that is similar in
spirit to the Python Enhancement Proposal (PEP) process. DEP 8 was
brought forward by longtime Django developer Jacob Kaplan-Moss.
The rationale is that "if we
had clear data on the extent of Django's usage, it would be much
easier to approach organizations for funding
". The proposal is
essentially about adding code in Django to send a certain set of
metrics when "developer" commands are run. The system would be
"opt-out", enabled by default unless turned off,
although the developer would be warned the first time the phone-home
system is used. The proposal notes that an opt-in system "severely
undercounts
" and is therefore not considered "substantially better
than a community survey
" that the DSF is already doing.
Information gathered
The pieces of information reported are specifically designed to run only in a developer's environment and not in production. The metrics identified are, at the time of writing:
- an event category (the developer commands:
startproject
,startapp
,runserver
) - the HTTP User-Agent string identifying the Django, Python, and OS versions
- a user-specific unique identifier (a UUID generated on first run)
The proposal mentions the use of the GA aip
flag which,
according to GA documentation, makes "the IP
address of the sender 'anonymized'
". It is not quite clear how that
is done at Google and, given that it is a proprietary platform, there
is no way to verify that claim. The proposal says it means that
"we can't see, and Google Analytics doesn't store, your actual IP
".
But that is not actually what Google does: GA stores IP addresses,
the documentation just says they are anonymized, without
explaining how.
GA is presented as a trade-off, since "Google's track record
indicates that they don't value privacy
" as highly as
the DSF does. The alternative, deploying its own analytics
software, was presented as making sustainability problems
worse. According to the proposal, Google "can't track Django
users. [...] The only thing Google could do would be to lie about
anonymizing IP addresses, and attempt to match users based on their
IPs
".
The truth is that we don't actually know what Google means when it
"anonymizes" data: Jannis Leidel, a Django team member, commented
that "Google has previously been subjected to secret US court orders
and was required to collaborate in mass surveillance conducted by US
intelligence services
" that limit even Google's capacity of ensuring
its users' anonymity. Leidel also argued that the legal framework of
the US may not apply elsewhere in the world: "for example the
strict German (and by extension EU) privacy laws would exclude the
automatic opt-in as a lawful option
".
Furthermore, the proposal claims that "if we discovered Google was
lying about this, we'd obviously stop using them immediately
", but it
is unclear exactly how this could be implemented if the software was
already deployed. There are also concerns that an
implementation could block normal operation, especially in countries
(like China) where Google itself may be blocked. Finally, some expressed
concerns that the information could constitute a security problem, since
it would unduly expose the version number of Django that is running.
In other projects
Django is certainly not the first project to consider implementing analytics to get more information about its users. The proposal is largely inspired by a similar system implemented by the OS X Homebrew package manager, which has its own opt-out analytics.
Other projects embed GA code directly in their web pages. This is apparently the option chosen by the Oscar Django-based ecommerce solution, but that was seen by the DSF as less useful since it would count Django administrators and wasn't seen as useful as counting developers. Wagtail, a Django-based content-management system, was incorrectly identified as using GA directly, as well. It actually uses referrer information to identify installed domains through the version updates checks, with opt-out. Wagtail didn't use GA because the project wanted only minimal data and it was worried about users' reactions.
NPM, the JavaScript package manager, also considered similar
tracking extensions. Laurie Voss, the co-founder of NPM, said it decided to completely avoid
phoning home, because "users would absolutely hate it
". But NPM
users are constantly downloading packages to rebuild applications
from scratch, so it has
more complete usage metrics, which are aggregated and available via a public
API. NPM users seem to find this is a "reasonable utility/privacy
trade
". Some NPM packages do phone home and have seen "very
mixed
" feedback from users, Voss said.
Eric Holscher, co-founder of Read the Docs, said the project is considering using Sentry for centralized reporting, which is a different idea, but interesting considering Sentry is fully open source. So even though it is a commercial service (as opposed to the closed-source Google Analytics), it may be possible to verify any anonymity claims.
Debian's response
Since Django is shipped with Debian, one concern
was the reaction of the distribution to the change. Indeed, "major distros' positions would be very important for
public reception
" to the feature, another developer stated.
One of the current maintainers of Django in Debian, Raphaël Hertzog,
explicitly stated from the start that such a system would "likely
be disabled by default in Debian
". There were two short
discussions
on Debian mailing lists where the overall consensus
seemed to be that any opt-out tracking code was undesirable in
Debian, especially if it was aimed at Google servers.
I have done some research to see what, exactly, was acceptable as a phone-home system in the Debian community. My research has revealed ten distinct bug reports against packages that would unexpectedly connect to the network, most of which were not directly about collecting statistics but more often about checking for new versions. In most cases I found, the feature was disabled. In the case of version checks, it seems right for Debian to disable the feature, because the package cannot upgrade itself: that task is delegated to the package manager. One of those issues was the infamous "OK Google" voice activation binary blob controversy that was previously reported here and has since then been fixed (although other issues remain in Chromium).
I have also found out that there is no clearly defined policy in Debian regarding tracking software. What I have found, however, is that there seems to be a strong consensus in Debian that any tracking is unacceptable. This is, for example, an extract of a policy that was drafted (but never formally adopted) by Ian Jackson, a longtime Debian developer:
In other words, opt-in only, period. Jackson explained that "when
we originally wrote the core of the
policy documents, the DFSG [Debian Free Software Guidelines], the SC
[Social Contract], and so on, no-one would have
considered this behaviour acceptable
", which explains why no
explicit formal
policy has been adopted yet in the Debian project.
One of the concerns with opt-out systems (or even prompts that default to opt-in) was well explained back then by Debian developer Bas Wijnen:
One could argue that Debian has its own tracking systems. For example, by default, Debian will "phone home" through the APT update system (though it only reports the packages requested). However, this is currently not automated by default, although there are plans to do so soon. Furthermore, Debian members do not consider APT as tracking, because it needs to connect to the network to accomplish its primary function. Since there are multiple distributed mirrors (which the user gets to choose when installing), the risk of surveillance and tracking is also greatly reduced.
A better parallel could be drawn with Debian's popcon system,
which actually tracks Debian installations, including package
lists. But as Barry Warsaw pointed
out in that discussion,
"popcon is 'opt-in' and [...] the overwhelming majority in Debian is
in favour of it in contrast to 'opt-out'
". It should be noted that
popcon, while opt-in, defaults to "yes" if users click through the
install process. [Update: As pointed out in the comments, popcon actually defaults to "no" in Debian.] There are around 200,000 submissions at this
time, which are tracked with machine-specific unique
identifiers that are submitted daily. Ubuntu, which also uses the
popcon software, gets around 2.8 million daily submissions, while
Canonical estimates there are 40 million desktop users of
Ubuntu. This would mean there is about an order of magnitude more
installations than what is reported by popcon.
Policy aside, Warsaw explained that "Debian has a reputation for taking
privacy issues very serious and likes to keep it
".
Next steps
There are obviously disagreements within the Django project about how to
handle this problem.
It looks like the phone-home system may end up being implemented as a
proxy system "which would allow us to strip IP addresses instead
of relying on Google to anonymize them, or to anonymize them
ourselves
", another Django developer, Aymeric Augustin, said. Augustin
also stated that the feature wouldn't "land before Django drops
support for Python 2
", which is currently estimated to be around
2020. It is unclear, then, how the proposal would resolve the funding
issues, considering how long it would take to deploy the change and
then collect the information so that it can be used to
spur the funding efforts.
It also seems the system may explicitly prompt the user, with an
opt-out default, instead of just splashing a warning or privacy
agreement without a prompt. As Shai Berger, another Django
contributor, stated, "you do not get [those] kind of numbers in
community surveys
". Berger also made the argument that "we
trust the community to give back without being forced to do so
";
furthermore:
Other options may also include gathering
metrics in pip
or PyPI, which was
proposed by Donald Stufft. Leidel also proposed that
the system could ask to opt-in only after a few times the commands are
called.
It is encouraging to see that a community can discuss such issues without heating up too much and shows great maturity for the Django project. Every free-software project may be confronted with funding and sustainability issues. Django seems to be trying to address this in a transparent way. The project is willing to engage with the whole spectrum of the community, from the top leaders to downstream distributors, including individual developers. This practice should serve as a model, if not of how to do funding or tracking, at least of how to discuss those issues productively.
Everyone seems to agree the point is not to surveil users, but improve
the software. As Lars Wirzenius, a Debian
developer, commented: "it's a very sad situation if
free software projects have to compromise on privacy to get funded
".
Hopefully, Django will be able to improve its funding without
compromising its principles.
Index entries for this article | |
---|---|
Security | Privacy |
GuestArticles | Beaupré, Antoine |
Posted Dec 1, 2016 4:09 UTC (Thu)
by distinguishedcorgi (guest, #100058)
[Link] (5 responses)
I don't understand why GNU/Linux distributions ship Python libraries at all. Rather than being installed globally to the system, Python libraries should be packaged with the application in a virtualenv and managed with pip, Python's package manager. I'm not sure who is benefiting from Debian packaging Django -- as far as I can tell it does not make things easier for developers; they likely work on multiple projects with dependencies on different versions of these libraries.
Posted Dec 1, 2016 19:28 UTC (Thu)
by Felix (guest, #36445)
[Link]
Well, it might be trivial to install some python-only library if you are using a virtualenv anyway. However I'm really glad that Fedora ships things like numpy, opencv, PyQT/PySide readily available. Also simplejson gets regular (security) updates which are too easy to miss if you just set up a virtualenv and leave it there.
Basically package managers like dnf/apt do a *way* better job than pip (with regards to keeping the system up-to-date). Oh, and they also ensure that you can actually trust the packages you installed. This is part of the reason why I really would like to see some combination of rpm/dnf (or deb/apt if they get there first) and virtualenv.
Posted Dec 2, 2016 21:59 UTC (Fri)
by mstone_ (subscriber, #66309)
[Link]
and it really blows for installing stuff on machines with limited network connectivity.
Posted Dec 3, 2016 12:15 UTC (Sat)
by valhalla (guest, #56634)
[Link]
Also, it makes developer more confident that their machine is not running code from untrusted sources which could take control of it.
Or, if you want to read me ranting more in detail: https://www.trueelena.org/computers/articles/candy_from_s...
Posted Dec 7, 2016 8:30 UTC (Wed)
by debacle (subscriber, #7114)
[Link]
Posted Dec 1, 2016 8:08 UTC (Thu)
by robbe (guest, #16131)
[Link] (5 responses)
How do you figure?
For testing, I just installed version 1.61 from stable with the whiptail debconf interface. The question /was/ asked, and the "No" answer was highlighted.
Posted Dec 1, 2016 11:11 UTC (Thu)
by lamby (subscriber, #42621)
[Link] (1 responses)
Posted Dec 1, 2016 17:21 UTC (Thu)
by jake (editor, #205)
[Link]
We just did that, sorry for the confusion.
thanks,
jake
Posted Dec 1, 2016 16:30 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
About apt, in Debian, apt is configured right in the installer to use whatever Debian mirror you like (even your own) so this this is quite different from phoning home. It phones when and where you asked it to.
Posted Dec 1, 2016 17:21 UTC (Thu)
by anarcat (subscriber, #66354)
[Link] (1 responses)
Apologies to everyone, the article is being correct as we speak.
Posted Dec 1, 2016 21:46 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
Debian popcon struggles to handle 200 000 submissions by week already. It is unclear how Ubuntu popcon handles it because the website is not regularly updated (see the "Last generated on Wed Jun 22 15:09:05 2016 UTC.") which suggest that the daily processing either fail or is not done at all.
Posted Dec 1, 2016 9:19 UTC (Thu)
by misc (subscriber, #73730)
[Link]
In fact, first, we wanted to do that over a tor hidden service (now called onion services), but this would trigger too much IDS alert, and likely be blocked in some countries.
We need a specific domain, and a dns server. Each thing we want to count would trigger a json or whatever encoded blob to be sent over dns (in the host part), encrypted with a public/private key scheme.
The tracking party never see the ip, because the request is relayed by the dns at the isp level (or google dns, or public dns around), who do hide it. The log of DNS requests is supposed can be public, hence encryption. Given that the limited value of the data over time, we do think that this should be safe enough. And because we did envision the system to be shared (long term) among projects who did wish to have user counting, indication of something making a request to that dns domain wouldn't reveal anything to anyone, since it could come from homebrew or django or whoever use that. The idea is that someone wshing to count would take all string, attempts to decrypt using their key and see if it bring something useful or not (ie, proper json if encoding json, etc), and discard all crap as "not for me".
And the central dns server would just purge data on a regular basis. This doesn't prevent someone from doing a copy of course, but do prevent opportunistic attack in case of key compromission.
But we didn't publish the notes of our discussion, we did discuss that with Remy Decausemaker right after the bus, but Ralph changed role inside the company, and Remy left RH, so this didn't went anywhere.
There is a few shortcoming on the proposal, such as "handling the infra for that", likely crypto and security "details" such as getting it right, not hitting dns limit. And the usual downside of counting downloads, as I did tried to explain also in the past in http://community.redhat.com/blog/2015/09/lies-damned-lies...
And all the things that 2 engineers in a post FOSDEM week at the back of a Czech bus for 3h couldn't think about, of course.
But I truly think that something can be done, and so far, this proposal would prevent user identification by tracking IP and various way inspired by tor. The infra could be operated by LF or any others orgs. And there is a bunch of DNS that do not track people ( https://diyisp.org/dokuwiki/doku.php?id=technical:dnsreso... , https://servers.opennicproject.org/ ), some handled by non profit in different juridictions to avoid issue of "governement are asking broad access".
Then, the main remaining issue is the uuid handling as central to have a proper count, and see the difference between CI job and long term usage, etc, etc.
I also suspect there is things to do with homomorphic encryption, but I am not knowledgeable enough to tell what :)
(and the proposal only focus on "getting info in a database", the whole "prepare a proper UI" is left as a exercise to the user)
Posted Dec 1, 2016 18:44 UTC (Thu)
by dnaber (guest, #56178)
[Link] (1 responses)
I'm not sure, you can install Piwik, which is OpenSource, but you can also use it as a service. For example: http://www.mysnip-solutions.de/en/hosting/piwik.html or https://piwik.pro
Posted Dec 12, 2016 18:57 UTC (Mon)
by Nemo_bis (guest, #88187)
[Link]
I don't know what's the default retention policy for IP addresses and other personal identifying information, but cryptolog can probably help.
Posted Dec 7, 2016 8:41 UTC (Wed)
by debacle (subscriber, #7114)
[Link]
Django debates user tracking
Django debates user tracking
Django debates user tracking
Django debates user tracking
I disagree completely. I never would want to have any libraries or tools installed by pip, gem, elpa, npm, and what not. My main reasons are:
Django debates user tracking
User tracking & Debian
> install process.
The package is installed by default, but – as far as I can see – does not do anything.
https://anonscm.debian.org/cgit/popcon/popcon.git/tree/de...
has "Default: false". The question is asked with "high" priority, so you are bound to see it (c.f. https://anonscm.debian.org/cgit/popcon/popcon.git/tree/de...) unless you go out of your way to avoid debconf questions … and even then default will kick in.
User tracking & Debian
User tracking & Debian
User tracking & Debian
On Ubuntu it (used to at least) defaults to YES, hence the confusion.
Furthermore popcon.ubuntu.com has been broken for years, so it is unclear popularity-contest on Ubuntu really matter anymore except as a privacy leak.
User tracking & Debian
User tracking & Debian
First popcon only submit once per week, not per day, and second, while Debian popcon remove submissions that have not been updated in the last 20 days, Ubuntu popcon does not do so.
This means 2.8 millions in the grand total of Ubuntu installations that reported successfully at least one, since
the start of popcon.ubuntu.com about ten years ago.
Django debates user tracking
Django debates user tracking
Piwik and other analytics
Django debates user tracking