LWN: Comments on "Guarding personally identifiable information"

Guarding personally identifiable information

anselm — Tue, 27 Jun 2017 11:50:10 +0000

With MD5 or SHA-1, the hash is basically identical to the internal state of the hash function after it has seen the input up to that point. You can use this to hash more stuff and the result will be indistinguishable from what would have been returned if you had applied SHA-1 to the concatenation of the original input (which you technically don't know) and your stuff in one go.

SHA-3 avoids this by using an algorithm where the hash value it outputs doesn't let you infer its internal state. This means that even if you know the hash value, that doesn't tell you everything you need to know to set up your own instance of the hash function after it has seen the material whose hash value you have, so you can perform the extension attack. (For the gory details, check Wikipedia.)

Guarding personally identifiable information

paulj — Tue, 27 Jun 2017 11:01:55 +0000

V interesting comment. BTW, in the interests of "knowing what's going on here", could you explain how SHA-3 eliminates the 'traditional' crypto-hash extension attack?

Guarding personally identifiable information

tialaramex — Mon, 26 Jun 2017 14:42:15 +0000

The key is in your first sentence. DES has to be _brute forced_. My comment already explained that "DES key lengths were already too short in 1975". The papers go back a LOT further than eight years, the EFF DES Cracker is _last century_. But what the papers don't do is break DES algorithmically, the algorithm is still, decades later, working exactly as intended, you can't find out what the message is without just trying all the keys.

And _that_ is why, again exactly as I wrote, 3DES is still safe. It's not a new algorithm using the same name, it's just three lots of DES because the algorithm is still fine, specifically 3DES is E(key3,(D(key2,(E(key1, message))) so that if you set key1= key2= key3 you get DES as before, but if you set them differently the attacker must either brute force all 168 bits of key or they must rely on the Meet-in-the-middle attack and do 2^112 operations, which is tighter but still impractical today.

Also Val's page is a member of the set of things which assume relatively short past trends will continue in order to predict the future. Her warning that you should plan on being _able_ to replace the hash in your shiny new thing is a sensible one, but the thing about such trends is that they're only notable while they stay true. Nobody is going to make a web page called "To our astonishment SHA-2 is still fine after 75 years". I always want to point to Disco Stu's graph of Disco record sales here...

And finally, while Val's advice is all very well, probably even _more_ useful would be to take the extra hour and learn more about what these things are. On the other side of the fence lots of effort has gone into making modern algorithms and libraries have fewer "sharp edges", such as SHA-3's elimination of length extension, but the edges are only sharp if you have no idea what these crypto algorithms do and do not promise for you. People writing SHA2(m1) and being surprised an adversary can use that to produce SHA2(m1 | chosen suffix) correctly without knowing m1 are protected by using SHA3() instead where their adversary can't pull that off, but they'd _also_ be protected, and better, by knowing what's going on here so they wouldn't fall for that mistake in the first place.

Guarding personally identifiable information

nybble41 — Mon, 19 Jun 2017 15:01:21 +0000

> merchants and targeted advertised HAVE access to the credit cards database

Agreed, but the point still stands. You can't get anyone's shopping habits or location from just a credit card number. When combined with additional information, sure, but not from the number alone. It's not the number itself which is personal, but rather the web of connections linking the number to other (likewise individually non-personal) pieces of data.

Guarding personally identifiable information

hummassa — Mon, 19 Jun 2017 12:07:06 +0000

Actually, you are making my point: merchants and targeted advertised HAVE access to the credit cards database, being it the original financial ones or some database they can collect along the way.

Guarding personally identifiable information

farnz — Mon, 19 Jun 2017 09:14:09 +0000

Have you read up on Data Protection legislation (which is about to be beefed up by the GDPR)? It actually implements the sorts of protections we're talking about.

Guarding personally identifiable information

nix — Sat, 17 Jun 2017 12:02:42 +0000

Look up what happened to Jeremy Clarkson when he said the same in a newspaper article.

Guarding personally identifiable information

jtc — Sat, 17 Jun 2017 05:08:05 +0000

"But that doesn't mean that the individual pieces aren't worth protecting on their own, on general principles."

I don't think that's particularly useful or practical, if you're talking about, e.g., protecting an individual CC#, street address, etc. I could, for example, take a walk around my neighborhood and write down a house's address, the license plate # of a car parked on the road, or look in the phone book and write down a phone number, etc.. I could then publish this information (with no other associated data), legally, on the internet and, of course, anyone else could do the same. That can't be prevented, which shows why it's not practical.

Furthermore, publishing such info without any other data to go with it (such as a name, or, worse [whether true or false] an accusation that the person owning the car/house/etc. committed a felony or a particular crime) is extremely unlikely to cause any harm to the person associated with that data (house owner, car owner, ...).

To extend my example to the point of absurdity, I could write in a blog: "somebody has heart disease and his or her doctor has recommended heart-bypass surgery" (As a matter of fact, I've just done that!). This is certain to be true for more than one person in the world right now. But since there's no identifying data to go along with this claim, it does no harm whatsoever.

Maybe this is not what you meant, but if so, what you meant is not at all clear, IMO.

Guarding personally identifiable information

Wol — Fri, 16 Jun 2017 16:59:23 +0000

> > Although people viscerally fear the collection of personal data

> I don't. I don't really care about it.

Not even when it places your livelihood (or life!) at risk?

Someone leaked my name, phone number, and the fact that I'd had an accident. It ended up on a list being sold to ambulance chasers. I really do NOT appreciate receiving a *flood* of nuisance calls to my mobile, seeing as they mostly arrive while I'm driving and it's not only illegal, but also dangerous, to talk on a mobile while driving.

Cheers,
Wol

Guarding personally identifiable information

Wol — Fri, 16 Jun 2017 16:04:24 +0000

Neutrality is over-rated - and impossible! What matters is that the author's position is open and discernible. I really don't like "objective" reporting, because it rarely is ...

Cheers,
Wol

Guarding personally identifiable information

Wol — Fri, 16 Jun 2017 15:50:18 +0000

> Of course. But personal data is a puzzle. Every little piece matters, and a lot of the pieces in the puzzles leak. Considering the semi-frequent announcement of "Company X hacked--millions of accounts leaked".

We had a classic case a few years back. Rape victims are supposed to be kept anonymous. But one newspaper printed a story about "a vicar's daughter" while another said "from Ealing". Both bits, in isolation, could refer to many thousands of people. Put together, the victim's identity was revealed almost instantly.

Cheers,
Wol

Guarding personally identifiable information

bronson — Thu, 15 Jun 2017 05:28:42 +0000

If people are working this hard, it can't have been TOO bad. :)

Guarding personally identifiable information

bronson — Thu, 15 Jun 2017 04:31:32 +0000

With 200 GPUs, DES is brute forced in a matter of days. And 200 GPUs are available today on AWS. Papers describing this sort of attack go back at least 8 years. It's folly to claim that DES is still safe.

Here is a nice picture of the treadmill, for hash functions anyway: http://valerieaurora.org/hash.html

Guarding personally identifiable information

andyo — Thu, 15 Jun 2017 01:56:14 +0000

As the author, I appreciate the thoughtful comments about my analogy between de-identification and encryption. I think the analogy is better than gdt and frostsnow think, but perhaps they are right. If you don't like the analogy, just take the general principle that the perfect is the enemy of the good, and that something is better than nothing. I'm also glad to see that the main controversy in the comments concerns a relatively minor point within the larger article.

Guarding personally identifiable information

anselm — Wed, 14 Jun 2017 14:38:31 +0000

But that doesn't mean that the individual pieces aren't worth protecting on their own, on general principles.

In any case, in practice if a cracker steals company XYZ's customer database, chances are that it will already come with people's names, street addresses, e-mail addresses, and credit card numbers nicely prepackaged.

Guarding personally identifiable information

nybble41 — Wed, 14 Jun 2017 14:21:00 +0000

> Individual pieces of data are almost never personal information. A city, a gender, a CC number, a film. Even a license plate. But as soon as you can correlate the data "X lives in city Y", "X is of gender Y", "X owns credit card Y", "X likes film Y", "X has at some point driven car Y".

Exactly. It's those connections which are personal, not the individual pieces of data.

Guarding personally identifiable information

nybble41 — Wed, 14 Jun 2017 14:17:17 +0000

> If I have the access to the Visa/Master/AmEx database...

You're making my point for me. To get any personal information you need those databases, not just the CC#.

Guarding personally identifiable information

paulj — Wed, 14 Jun 2017 12:58:52 +0000

The last two digits of your CC# + your gender certainly could help narrow down your identity. It just needs a few more "non-identifying, but narrowing" dimensions of information potentially to uniquely identify you.

Guarding personally identifiable information

hummassa — Wed, 14 Jun 2017 12:20:14 +0000

> Here's a credit card number: 4024007129431648. Just from the number, what can you tell me about the owner's shopping habits or location?

If I have the access to the Visa/Master/AmEx database (even hacked, dated versions of it), you bet I can. Try posting your real CC# in this forum and you'll see.

Guarding personally identifiable information

tao — Wed, 14 Jun 2017 10:13:13 +0000

Of course. But personal data is a puzzle. Every little piece matters, and a lot of the pieces in the puzzles leak. Considering the semi-frequent announcement of "Company X hacked--millions of accounts leaked".

The information doesn't even have to be recent. Let's say I find 5-year old account info, with a CC used at two different sites. The former being "Explicit gay porn!" and the latter being "Reactionary Bible-thumpers united".

Now simply obtaining the CC:NAME tuple would yield pretty damn good material for blackmailing.

So yes, CC might not be a breach of personal integrity, but as soon as you have the CC:NAME tuple you're well on your way towards nasty integrity violations. This goes for all kinds of tuples of data; NAME:EMAIL, NAME:NICKNAKE, NICKNAME:EMAIL, NAME:FAVOURITE RESTAURANT, etc. A typical tuple would be <just about anything>:IP ADDRESS.

I always read the same set of webpages in the morning. I open them all at once. If you could track this + the IP-address you could easily find out that "Oh, tao is at the airport today" no matter which of my laptops I use, even if I use a brand new one, simply by recognising the pattern + the IP-address ("This address belongs to airport X").

Individual pieces of data are almost never personal information. A city, a gender, a CC number, a film. Even a license plate. But as soon as you can correlate the data "X lives in city Y", "X is of gender Y", "X owns credit card Y", "X likes film Y", "X has at some point driven car Y".

Enough data points will tell a story. Whether the story is the right one or not isn't always clear ("X has driven car Y" doesn't discern between "X owns car Y", "X borrowed car Y" or "X rented car Y", but finding out more facts about car Y might be enough to clear that up, without finding out anything else about X).

Some information might seem trivial; "X is male", for instance. But if you combine that with "X regularly buys women's underwear"?

Guarding personally identifiable information

Cyberax — Wed, 14 Jun 2017 09:39:43 +0000

Why would you want to limit the data of CC#s and create a dataset with them? It serves no useful purpose whatsoever.

Guarding personally identifiable information

dgm — Wed, 14 Jun 2017 09:33:03 +0000

It's not the same. Your gender cannot identify you; your (complete) CC# can. If you limited the data to the last two digits, for example, then that would not be the case.

Guarding personally identifiable information

nybble41 — Tue, 13 Jun 2017 21:43:19 +0000

> My specific CC# ... can be used to identify shopping habits, and, at times, physical location.

Here's a credit card number: 4024007129431648. Just from the number, what can you tell me about the owner's shopping habits or location?

Of course that number was fake, but the point remains: on its own a CC# says very little. To get shopping habits or location you would need to correlate it with other data about where and how the card was used. It is the connection between the CC# and this other data (e.g. order history) which is "personal", not the CC# itself.

Guarding personally identifiable information

ssmith32 — Tue, 13 Jun 2017 15:57:10 +0000

My specific CC# is unique to me, and only to me, and can be used to identify shopping habits, and, at times, physical location.

If that's not personal information, what do you define as personal information?

Guarding personally identifiable information

flussence — Mon, 12 Jun 2017 22:42:06 +0000

>"There exists a person with this birthday." "There exists a person with this ethnicity." "There exists a person with this common first name."

People think they're safe leaving this information in public, and then things like this happen: https://medium.com/@CodyBrown/how-to-lose-8k-worth-of-bit...

Guarding personally identifiable information

nybble41 — Mon, 12 Jun 2017 20:18:47 +0000

> Basically, you can define "about" as absurdly as you want to the point that no information is personal.

That's the thing, very little information _is_ personal on its own. "There exists a person with this credit card number." "There exists a person with this birthday." "There exists a person with this ethnicity." "There exists a person with this common first name." None of that is particularly personal; many other individuals share the same characteristics. It's only when multiple facts are aggregated together that one can start to draw conclusions about specific individuals—and that remains true even if the individual facts do not appear to be the least bit "personal".

We should not be looking at this as a question of whether a particular bit of information is or is not "personal data". The question is what conclusions can be draw from a complete data set, not one isolated fact. Trying to classify types of information as either "personal" or "non-personal" leads to the equally absurd position that _all_ information is personal, because _any_ form of information can potentially be used for that purpose.

Guarding personally identifiable information

dskoll — Mon, 12 Jun 2017 19:33:48 +0000

What do you mean, it doesn't say anything "about" you? How do you define "about"?

Your name doesn't say anything about you. It's just an identifier that (probably) your parents assigned to you. It might help someone guess at your sex, but even that's not foolproof.

Your marital status doesn't say anything about you. Plenty of people are single. Plenty are married.

Basically, you can define "about" as absurdly as you want to the point that no information is personal.

Guarding personally identifiable information

nybble41 — Mon, 12 Jun 2017 15:45:47 +0000

> A credit card number _does_ say something about you. At the very least it says that you've got a credit card -- which implies that you are rich enough for a bank to give you one.

First, something you have in common with the vast majority of the population is hardly "personal information". Simply guessing that a given individual has a credit card would be correct most of the time. Second, the card number doesn't say anything about any its owner by itself; if all you have is a card number then all you can say is that _someone_ has a credit card, which isn't personal at all.

> And besides, the number also contains the Issuer Identification Number, i.e. it identifies the bank or other provider that gave you the card -- which will narrow down where you live, etc.

That is a bit closer to personal data, but the same caveat applies: by itself that doesn't say anything about any particular individual, only the issuing bank. To make inferences about "where you live" one would first need to link the card to _you_. Otherwise all they can say is that _someone_ has a card from that bank.

The problem isn't a special class of "personal data", with a few obvious exceptions like name and address which are always filtered out anyway. Even a credit card number is not an issue in isolation (or wouldn't be given a reasonable minimum standard of security in payments). The problem is data sets which allow one to correlate _multiple_ types of otherwise _non-personal_ data in order to identify specific individuals. The data becomes personal only when aggregated together: the Latino Netflix subscriber, age 18-25, with a zip code starting with 407 and a credit card from Springfield Credit Union. No one part of that data is "personal", but taken together it can potentially single out a specific individual. The key is that almost any sort of data can be used for that sort of "fingerprinting", even data which no one considers personal.

Guarding personally identifiable information

nijhof — Mon, 12 Jun 2017 13:11:07 +0000

A credit card number _does_ say something about you. At the very least it says that you've got a credit card -- which implies that you are rich enough for a bank to give you one. And besides, the number also contains the Issuer Identification Number, i.e. it identifies the bank or other provider that gave you the card -- which will narrow down where you live, etc.

Guarding personally identifiable information

Cyberax — Sun, 11 Jun 2017 22:40:17 +0000

No, it's not personal data. A credit card number can be used to identify me, but it doesn't say anything _about_ me.

If we had payment systems with more security than a wet towel, I wouldn't even care if my credit card numbers leaked. It also absolutely not a problem to secure CC numbers - simply replace them with unique tokens and keep the mapping secret. Or you if you're feeling fancy, you can use encryption instead.

Anonymization solves a completely different problem and I'm amazed that people don't understand that.

Guarding personally identifiable information

jaromil — Sun, 11 Jun 2017 10:32:36 +0000

Well said kfiles. This is a great article, well written and covering a well relevant topic on which we are busy for instance with the https://decodeproject.eu . An applause to the LWN editors for this great content, especially considering how hard is to catch up in this research field that has relatively high noise. An article like this alone is worthed more than a year subscription for me. Thanks.

Guarding personally identifiable information

paulj — Sat, 10 Jun 2017 15:46:05 +0000

It does to someone who has some retailers database dump, and can look up your card number to perhaps find the rest of your details (name, email, address, etc.).

Guarding personally identifiable information

frostsnow — Fri, 09 Jun 2017 20:40:24 +0000

It's not so much that there are *50* ways that encryption and de-identification are different, but that they are *fundamentally* different.

The *security* of a strong encryption algorithm rests in the *secrecy* of a key. A *perfect* encryption scheme from a security perspective does exist, it's called a One-Time Pad (https://en.wikipedia.org/wiki/One-time_pad). The point about a perfect encryption scheme is that, given a ciphertext, *any* plaintext of the same size is *equally* likely. This is a fundamental point of encryption's security. To quote Schneier's "red book":

>A random key sequence added to a nonrandom plaintext message produces a completely random ciphertext message and no amount of computing power can change that.

Thus to compare de-identification to encryption on the notion that an increase in computing power will allow users to "break our encryption" and that "nothing better exists" shows a fundamental lack of understanding of what encryption is. It's not a good comparison.

THAT BEING SAID, I understand *why* the author wrote what they did, and what they *meant*. They are looking at the security that we expect from commonly-implemented block and public/private key ciphers, which make security/time/space trade-offs that (hopefully) give us a security that lasts relative to a certain growth in computing power that is projected to take place over the next couple of decades. They then take this idea and claim that it would be useful to look at de-identification with regards to this concept that we only need security over a certain time period, after which it becomes drastically less useful, and this point may be true.

However, it is *not* clear to me that the security provided by commonly-implemented encryption algorithms and the "security" provided by de-identification are in *any* way equivalent whatsoever. Encryption has a theoretical component that allows *perfect* security, de-identification is entirely based on obscurity. Again, with encryption, we have purposefully chosen weaker algorithms for pragmatic reasons, but it is not clear that such a choice even *exists* for de-identification algorithms.

Encryption and de-identification aren't analogous.

Disclaimer: I am not a cryptography expert and may have gotten something wrong. Also: https://www.xkcd.com/386/

Guarding personally identifiable information

hkario — Fri, 09 Jun 2017 15:32:20 +0000

You only don't care because you don't understand what actually is personal data ("I don't have anything important on my phone", that is, until you loose it):
https://www.youtube.com/watch?v=XEVlyP4_11M

Guarding personally identifiable information

tialaramex — Fri, 09 Jun 2017 08:49:52 +0000

The article talks about essentially the encryption treadmill, this idea that incremental advances will inevitably obsolete your encryption. But it's not really like that. The main reason crypto people assume secrets have a finite lifetime is that secrets are kept by people, and people leak - regardless of the technology involved. In practice a finite lifetime is acceptable.

Mechanically we're probably past the point where you can expect incremental technical advances to have any effect on symmetric encryption. DES key lengths were already too short in 1975, you should be able to find contemporary writing that backs up this criticism. Sure enough the only attack that's actually been successfully used on DES is a brute force attack on the key. Rijndael increases the minimum key size to 128-bits, which puts a brute force attack likely permanently out of reach, but even the original DES algorithm - in the back-to-back-to-back 3DES construction so as to use longer keys - is still safe today if you don't mind it being slow and awkward.

We have tended to abandon encryption algorithms once someone demonstrates that even in theory they can sometimes successfully break it with practical resources. In contrast anonymization techniques are _always_ theoretically broken, it's just that sometimes nobody bothers to break them in practice. If we're going to compare to something, how about the Yale lock on a farmhouse back door. Probably the door isn't locked anyway, and if it is anyone who spends a few minutes learning how online can break the lock. That's where we are with anonymization. Our best hope is that nobody even _wants_ to de-anonymize the data, not that they can't.

Guarding personally identifiable information

bronson — Thu, 08 Jun 2017 22:54:00 +0000

> The parallel with encryption doesn't hold up

Care to say more? The article did a good job of describing the similarities. You can probably find 50 ways that they're different but that's not important to literary comparison -- it doesn't affect the ways that they're similar.

Guarding personally identifiable information

kfiles — Thu, 08 Jun 2017 19:19:01 +0000

I strongly disagree, gdt. The editorial voice of LWN is one of the things that makes it special. The editors here do not merely report on press releases or pass on the conference minutes, they provide their perspectives on its importance to the Linux community. As they are informed by years of work in the distro and kernel community, I find their perspectives very useful.

Guarding personally identifiable information

dskoll — Thu, 08 Jun 2017 18:22:37 +0000

You are splitting hairs. Of course it's "personal data". It's data that belongs to you and only you. And it can be used by someone to impersonate you, or at least as part of such impersonation.

Guarding personally identifiable information

pizza — Thu, 08 Jun 2017 15:23:17 +0000

The point is that Personally Identifiable Information can be used to impersonate you in ways that are, while not outright legally-binding, can utterly screw you over or at least inconvenience you for years down the line.

And there is quite a lot of "PII" in "Personal Data"

Guarding personally identifiable information

Cyberax — Thu, 08 Jun 2017 15:06:41 +0000

It's PII, but it's not personal data. They don't say anything about me.