LWN.net Logo

The perils of big data

By Jake Edge
August 29, 2012

Data about us—our habits, associates, purchases, and so on—is collected all the time. That's been true at smaller scales for hundreds or even thousands of years, but today's technology makes it much easier to gather, store, and analyze that data. While some of the results of that analysis may make (some) people's lives better—think tailored search results or Amazon's recommendations—there is a strong temptation to secretly, or at least quietly, use the collected data in other, less benign, ways.

Because the data collection and analysis is typically done without any fanfare, it often flies under the radar. So it makes sense to stop and think about what it all means from a privacy perspective. A recent essay by Alistair Croll does exactly that. He notes that we have reached a time where the constraint of "big, fast, and varied—pick any two" for databases is no longer valid. Because of that, it is common for data to be collected without any particular plan for how it will be used, under the assumption that some use will eventually be found. It doesn't cost that much to do, which leads to the rise of "big data".

There are some eye-opening things that can be done using big data. It is not difficult to determine someone's race, gender, and sexual orientation using just the words in their Twitter or Facebook feeds, for example. Much of that information is completely public, and could be mined fairly easily by banks, insurance companies, prospective employers, and so on. Those attributes that can be derived could then be used to set rates, deny coverage, choose to interview or not, and more.

It is easy to forget that the data collection is even happening. "Loyalty" cards that provide a discount at grocery and other stores gather an enormous amount of information about our habits, for example. Deriving race, gender, family size, and other characteristics from that data should not be very difficult. If that information is used to give discounts on other products one might be likely to buy, it may seem relatively harmless. But if it is being sold to others to help determine voting patterns, foreclosure likelihood, or credit-worthiness, things are definitely amiss. But, as Croll points out, that is exactly what is happening with that data at times.

Croll notes several different examples in his essay, but examples are not hard to come by. Almost every day, it seems, there are new abuses, or worries about abuses of big data. People in Texas are concerned about the kinds of data that would be collected by "smart" electricity meters—to the point of running off the smart meter installers. Mitt Romney's campaign for the US Presidency is using a secretive organization to analyze data to find potential donors—President Obama's campaign is certainly doing much the same.

Another example is the "anonymized" data sets that have been released for various purposes over the past few years They show that it is quite difficult to truly anonymize data. When trying to derive a signal from the data (movie recommendations for Netflix, for example), surprising correlations can be made. This shows the power of big data even when someone is trying not to reveal our secrets in a data set. A new technique may help by providing a way to release data without compromising privacy.

The real problems may come when these disparate data sets are combined. Truly personally identifiable information correlated from multiple sources is likely to give a distressingly accurate picture of an individual. It could be used by companies and other organizations for a wide range of purposes. Those could be relatively harmless, even helpful, or downright malicious depending on one's perspective and privacy consciousness. One organization that is likely quite interested in this kind of data is the same that some would like to turn to for protection from abuses of big data: government.

There are clearly good uses that such data can be put to. Croll identifies things like detecting and tracking disease outbreaks, improving learning, reducing commute times, etc. But the "Big Brother" overtones are worrisome as well. It's not at all clear how regulations would impact the collection and analysis of big data, and governments' interest in using it (for good or "bad" purposes) makes for an interesting conundrum. Until and unless a solid chunk of people are concerned about the problem—and express that concern to their governments and to other organizations in some visible way—things will continue much as they are. In that, the problem is little different than many other privacy issues; those who truly care are going to have to jealously guard their privacy themselves, as best they can.


(Log in to post comments)

The perils of big data

Posted Aug 30, 2012 12:11 UTC (Thu) by dps (subscriber, #5725) [Link]

The EU does have laws about this. Most databases, with a very few exceptions, require disclosure of what data is collected and why. Any sale of data about what you buy to political parties would almost certainly be illegal.

Sending data to countries without similar laws, including America, is illegal.

Any agency short of the government is likely not to have access to some significant information. My employer does know I have an annual hospital appointment, which is both free and absolutely predictable, but not the details about why and what happens is none of their business.

The perils of big data

Posted Sep 3, 2012 23:18 UTC (Mon) by smitty_one_each (subscriber, #28989) [Link]

This is really a business opportunity in disguise.
Waving my hands at implementation details, if you offer an "online agent" service that proxies all of the data-trail producing aspects of operating in the modern economy, you might find plenty of customers.
But many don't seem too worried, short of identity theft. The alternative would seem to be joining an Amish community. But even there, you still have to interface with the government.

The perils of big data

Posted Aug 30, 2012 22:14 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

> He notes that we have reached a time where the constraint of "big, fast, and varied—pick any two" for databases is no longer valid.

it's now "big, fast, varied, and cheap" pick any three (with cheap being a relative term)

the concerns about Big Data and the uses it can be put to are all valid, but one of the worst things that could happen is to start carving out exceptions, either in terms of who is allowed to do data analysis (the government can, but nobody else), or in terms of who can be tracked (you can track everyone except celebrities or government officials)

And one thing to remember, much of this is passive work. Unless someone tells you that they are doing it, or takes action as a result of this, you may never know that it happened.

Much of the information that goes into these worrisome databases is data that is public to start with, or that we choose to deliberately make public (like twitter posts). No matter what laws are put in place, if the value of gathering this data is large enough, someone will be doing so. It may be rival governments looking for people that their spies can compromise to gather useful information, but someone will do it and all the laws you get passed cannot prevent it from happening.

All that the public can do is punish the worst abusers by not doing business with them, and make it clear that at some point it's not acceptable to do things, even if it's legal to do them.

This is something the public doesn't tend to be very good at, unfortunately. (or at least, they aren't good at doing it in a timeframe that deters most corporate boards)

The perils of big data

Posted Sep 4, 2012 1:35 UTC (Tue) by Baylink (subscriber, #755) [Link]

I'm a little surprised, and unpleasantly so, that this article pulled 3 comments, while "Linux.conf.au ending?" got 150....

The problem here is one I've dubbed "capability creep", by analogy to mission creep, and it's the reason why I've declined, rather strenuously, to give my SSN to anyone who isn't authorized by law to require it (that's employers, banks, and the IRS, BTW; anyone else is lying).

In that particular case, the real flaw is "using an identifier as an authenticator", but the ability to correlate otherwise unrelated databases by using SSN as a primary key is a key (sorry) contributor to capability creep.

It's not necessary that you do that though; my favorite example of capability creep is automated toll pass records being subpoenaed to prove a divorce respondent was not where he claimed.

Things like this are why people like me are paranoid.

Paranoid enough that I purposely do *not* use the same handle on Twitter and Facebook, and IM, and here, and LJ, and...

The perils of big data

Posted Sep 6, 2012 8:26 UTC (Thu) by ortalo (subscriber, #4654) [Link]

In my humble opinion, even paranoia will not help you enough nowadays in front of the massive opportunities of data mining that modern connected computing devices offer combined with the appetite of (every category of) regular users for the easiness that automatic data exchange can provide (arguably: sometimes).

Personnally, both for my own peace of mind but also for well founded reasons, I am tempted to evolve away from paranoia [1]. I wonder if it would not be wiser to fight these personal security issues with trust instead. I mean: propose "new" ways of building a useful and very strong network of trust relationships upon which you could later rely to explicitly authorize data exploitation.
The problem I see currently is that it is a pretty big thing to do (like for a world-wide capable distributed authentication server... initially) and it is pretty difficult to bootstrap the thing. We need to find practical and useful things to focus on first.
(I wonder if money -or maybe more precisely: money accounting- may be an opportunity here.)

There are very encouraging examples of already existing trust infrastructure like this (Debian's maintainers, kernel devs, etc.) and I don't see why such things should be retricted to advanced users. (Well, in fact, I see pretty well why it is for the moment, but I think such difficulties could be solved.)

[1] ;-)

The perils of big data

Posted Sep 6, 2012 8:30 UTC (Thu) by massimiliano (subscriber, #3048) [Link]

It's not necessary that you do that though; my favorite example of capability creep is automated toll pass records being subpoenaed to prove a divorce respondent was not where he claimed.

Hmmm... I generally agree with what you said in principle, but every time I see something like the text I quoted above I think that privacy protection risks to go overboard.

There are cases where what you did has legal implications, you should be legally required to say the truth about it, and "witnesses" should be legally required to testify about what you did, again telling the truth.

In those cases I see nothing wrong looking at the trail you left behind paying at an automated toll pass, or switching on your cell phone, or touching a glass without gloves and leaving your fingerprints on it. And I would even say that it is a good thing that those trails are there as long as they are only used when they are legally relevant.

Now, I understand that ensuring this "correctness of use" is not easy. But I also don't like it ending up like where I live, where the government is reducing the ability of the police to intercept phone calls so that the corrupted politicians can safely do their dirty business, and who cares if some other thief or murderer gets away with it in the process...

Back to the spouse case: I don't know if the laws regulating the contract of marriage require the spouses to be sincere with each other but I think they require them to be faithful, so lying in that context sounds illegal to me. And complaining that there's too much data about you that can be looked at to prove that you did something illegal looks strange to me: it's like saying "I want the right to do illegal things and get away with it".

So, again: I agree with what you were saying "in principle" but the example you choose looks plain wrong as an argument supporting privacy: if you want to retain the right to lie to your spouse, please don't get married.

The perils of big data

Posted Sep 6, 2012 9:22 UTC (Thu) by ortalo (subscriber, #4654) [Link]

The devil is in the details (and in that specific case, in romantic stories too ;-).

Reality is rarely black or white, mostly nuances of gray.
Most of the time this is used to demonstrate evidence of cheating (prior to the separation) *in front of the judge* just to lower the other party credibility and gain more money. This is a misleading example IMHO because it seems to be related to love affairs, while it is a very business-like money issue.

BTW, legally speaking, I think cheating one's partner not only is not reprehensible, but it's a right.
Morally speaking, that's of course very questionable.
(And technically speaking, I imagine that's an organizational nightmare...)

The perils of big data

Posted Sep 6, 2012 9:23 UTC (Thu) by ortalo (subscriber, #4654) [Link]

Important personal footnote: Mary, note I said "I imagine"!

The perils of big data

Posted Sep 6, 2012 9:53 UTC (Thu) by massimiliano (subscriber, #3048) [Link]

Reality is rarely black or white, mostly nuances of gray.

This is true, but...

Most of the time this is used to demonstrate evidence of cheating (prior to the separation) *in front of the judge* just to lower the other party credibility and gain more money.

...there are cases where the true color of things should be assessed :-)

Specifically: when you are in front of a judge you should either say the truth, or have your lawyer claim that the issue is irrelevant and say nothing.

In this specific case, if it is relevant that you were in place X you should be true about it.

Or if it is irrelevant that you were in place X, and the other party is just trying to show that you are a liar, well: is it relevant that you are a liar? Would your soon-ex-spouse be legally entitled to more money because you, as a liar, made his-her life more painful?

Once again, if it is irrelevant just have your lawyer state it. But if it is relevant, in my book you are required to openly state "yes, I have been a liar to my spouse" and not childishly complain that there are facts that can prove it :-)

And BTW, your personal comment is priceless, it definitely made me smile (in a positive way!) :-D As a side-footnote, you could even not imagine it but "know it indirectly" because you see other people cheating, maybe your coworkers, and you see the mess it involves. So your Mary can still trust you even if you don't have to "imagine" :-)

Ok, enough LWN-offtopic for today!

The perils of big data

Posted Sep 6, 2012 13:10 UTC (Thu) by ortalo (subscriber, #4654) [Link]

That's not so off-topic.

First: Today, even LWN's comments are "worldwide publications" so... well... And that's the problem raised by the good old central database and the Internet as it is today. No way to have something like a random conversation somewhere that fully fades away in a short time (or other minor things that do not deserve posterity at all). That's something new with respect to privacy. Well, not so new now that it's 20-years old; but honestly I still find it difficult to adapt to the new scheme.

Second: Your reasoning is an interesting example of the potential impact of information inference, frequently forgotten.
From what I state, you can also deduce further things; and you also have to reason about the truth of what I say. And me, I would have to take into account when I write what every reader can deduce (no offense, but well... especially for Mary ;-).
And in this case, you did not even include information from multiple sources for your deductions (employer and job position for example).

What seems to me is that, confronted with these new issues in a connected world, there is nearly no way to fight against this privacy invasion; except by using the same tools to build more trust as a compensation.
For example: I'd like my computer to tell me more about readers of this comment that also look at my profile on a profesional social network, or that look at my physical address, etc. I don't think I would invade their privacy more than they would mine, I would be protecting myself as well as establishing a better trust network *with* them.

Very soon, we will rejoin: identifying liars is key to the topic.
Not only in front of a judge (though in this case, it is also legally reprehensible). Like investigators, one usually need evidence to spot liars and make a difference between random guys and... trustable husbands. ;-)

The perils of big data

Posted Sep 8, 2012 18:02 UTC (Sat) by ccurtis (guest, #49713) [Link]

[..] SSN to anyone who isn't authorized by law to require it (that's employers, banks, and the IRS, BTW; anyone else is lying).
You forgot state DMVs, thanks to the REALID Act.

take a look at "Earthweb" as a possible long-term approach

Posted Sep 6, 2012 19:49 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

the book Earthweb (available free, without DRM from the publisher at http://www.baenebooks.com/p-833-earthweb.aspx ) shows an interesting future with a pretty logical extrapolation of the current trends in online identity and privacy. Even if you don't like the entire book, it's worth reading the first few chapters as food for thought about how it may be possible to deal with bigdata and similar threats.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds