By Jonathan Corbet
July 4, 2008
Google's purchase of YouTube always seemed questionable to some observers:
it looked as if Google were buying itself a whole new source of copyright
lawsuits. One of the benefits of that purchase came through on
July 2, when a U.S. District Court ordered Google to hand over its
complete set of YouTube traffic logs, containing information about every
video viewed on the service.
See
Groklaw for the full text of the order. If this order stands (and it
appears that Google will not appeal it), millions
of users worldwide will have their viewing data handed over to a litigious
entertainment industry company. There's a couple of important implications
to draw from this turn of events, so LWN will venture a little far afield
and take a look.
The data involved includes, for each video viewed, the time, which video
was involved, which YouTube user account was used, and the IP address the
request came from. Viacom claimed that the privacy of YouTube users is not
threatened by this release of data, and the court agreed. But account
names can be correlated across sites, and IP addresses (especially
time-correlated IP addresses) can easily identify exactly who was watching
a particular video. Viacom promises it would never use this data to launch
enforcement actions against individuals; the fact that the company feels
the need to make that promise suggests that Viacom feels it could
use this data to that end.
One other interesting aspect of the ruling which has been commented upon
less is this: Google has also been ordered to hand over every video which
has been removed from the site. Once again, that is a great deal of data.
It also drives home the point that, on a site like YouTube, nothing is
really removed: all of those "removed" videos are still there, waiting for
some company with enough lawyers to go after it.
All of this data is to be handed over regardless of what jurisdiction the
users thought they were in. Nobody's privacy or data retention laws apply
here. This is a worldwide compromise of personal data.
So lesson number one is obvious: attending to one's personal security
requires being very careful about the data tracks that one leaves on other
peoples' servers. Regardless of any site's privacy policy or any country's
data sharing laws, that data is there for the grabbing. The course of
events which led to the compromise of vast amounts of video-viewing data
can also lead to the disclosure of electronic mail, accounting data, online
chat sessions, purchase histories, software downloads, or which edgy Second
Life neighborhood one likes to hang out in. Indeed, records of video
viewing activity are more strongly protected in the U.S. than many other
types of data; other types of information may well prove easier to get.
What we leave on remote
machines seems to stay there indefinitely, and it's an open book for those with
sufficient legal power on their side.
[PULL QUOTE:
If you gather together that much
information on the behavior of many millions of people, somebody,
somewhere, is going to try to get their hands on it.
END QUOTE]
The second lesson is for anybody running a publicly-available server, as
many LWN readers do. The video activity database being grabbed by Viacom
is said to be about 12 terabytes deep - before getting into the
"removed" videos. It should not be surprising that a data stash of that
size would attract this kind of action. If you gather together that much
information on the behavior of many millions of people, somebody,
somewhere, is going to try to get their hands on it. How could it possibly
be any other way?
Not enough people are asking this question: why does Google/YouTube hold
that much data about its users? Why does it retain the ability to replay
their actions years after the fact? And why do "removed" videos not go
away? If that data did not exist in the first place, there would be no
question of disclosing it to an attacking corporation. A company which
keeps that amount of data around is prioritizing whatever commercial value
it sees in that data over the privacy and security of its users. And, by
inviting raids from corporations (which we hear about) and governments
(which we might not hear about), such companies are not helping their own
security either.
So there are strong arguments for simply not retaining all that data in the
first place. Naturally, some governments are doing their best to force
that kind of retention, but that's a different battle. In the absence of
legal constraints, a standard policy mandating short data retention periods
makes a lot of sense. It behooves all of
us to think about what kind of data we leave lying around - either through
our activities or by facilitating the activities of others - and to keep it
to a minimum. The most secure data is data which does not exist.
(
Log in to post comments)