LWN: Comments on "In-band deduplication for Btrfs"

In-band deduplication for Btrfs

JoeyUnknown — Thu, 04 Aug 2016 15:24:43 +0000

It's unlikely, but it should still not be that difficult to read the data and compare both. For those that don't need the performance hit, that would be turned off and you would remove a dimension from your data structure (btree[hashkey]->bucket->blocks to btree[hashkey]->block).

It should be a performance option. I can think of plenty of cases where for me a hash is fine. In some cases however, I don't really want to play dice with my data. Secondary to that, while right now the possibility of a collision is low, in future things can happen that might change that.

In some cases, depending on scenario, I would rather a system that performs worse than a possibility of a bizarre hidden integrity failure which can make a heck of a mess. If there ever was a hash collision, chances are it wouldn't be detected. The data would just have to be rebuilt and repaired or something. It's just one less vetor to worry about when it comes to big data where integrity is sensitive.

In-band deduplication for Btrfs

JoeyUnknown — Thu, 04 Aug 2016 15:08:19 +0000

SHA is more relevant if you have a major security concern or something and if you're using an implementation that isn't collision safe (as SHA tends to avoid it more than more basic hash algorithms).

I think that this should be made this to work safely first. It should not be an incomplete optimised solution first.

A basic system doesn't need to use cryptographically secure but it does need to not have a risk of collision no matter how unlikely.

Your hash can be really anything from CRC to the most significant bits of the block. Something that avoids excessive bucketing however would be ideal (IE all bits contribute more or less equally). That might be similar to what you get with cryptographic functions, but it doesn't mean you need a cryptographic function. It doesn't really matter for just making it work.

Then when you have a match or matches, all you can do is a full content compare to confirm and use the match. You will need to be able to bucket because one key could match multiple blocks.

Once you have something that works and that is safe, then you can start to look at optimising it as much as possible while still keeping it entirely reliable. Once you've reached the point you can go no longer with that, you should then add options for performance at the cost of risk and options for greater security.

Options might be skipping the read check, using a more secure or large hash, writing blocks but then deferring deduplication as a background task,

Partial oportunistic dedupelication is another option as well. For example, there's a very good chance when a block is written that is that same as another, well, if A[n] == B[n] the probability of A[n +- 1] == B[n +- 1] is very high. In this case, you would deduplicate where cheap and easy but let a scheduled task clean up the rest later. Another option might be hinting, that is, during a big operation to turn on dedupe in some manner. You could also prioritse the most common data. A zeroed out block is one of the obvious cases.

You can also do some direct content indexing with various tree structure (IE, each byte or word) but that does not work well at all for random data. Those kind of trees more or less represent something akin to compression trees (and also similar to deduplication). Random data does not compress well at all so that would be an attack vector or risk of relying on a less lossy key compression based system.

This isn't really a simple thing but I can say this, I would not feel comfortable using a solution vulnerable to hash collision no matter how unlikely it is, especially one that can't detect it. I might be sensitive but I would prefer a flawless system.

There may be other similar systems out there you can look at that basically do the same but outside of the realm of file systems. Hash tables have collisions and need to detect them so that might be a place to start looking at.

In-band deduplication for Btrfs

quwenruo — Mon, 28 Mar 2016 01:56:28 +0000

Thanks for bringing this feature to the spotlight.

I found that most comment is concerning on the collision of SHA256.
That's normal, as when it's possible to have collision, it will happen one day.
Personally I'm quite confident about current "strong" hash algorithm, but since users have the conern, then we need to address them, and just like ZFS, to provide a byte-by-byte comparison option for dedupe.

Currently what we can do is, to add SHA512 to reduce the collision. Although SHA512 is originally planned to improve the hash speed on modern 64bit machines.

But for now, we're focusing on polishing the ioctl interface and on-disk format with maintainers(Chris and David), hoping the in-memory backend can be merged in 4.7 first.

For byte-by-byte comparison, it maybe be scheduled after all current backend merged.
But since so many user worry about the SHA256 collision, we will keep it in mind and investigate from now on.

Thanks,
Qu

In-band deduplication for Btrfs

nye — Thu, 24 Mar 2016 16:18:49 +0000

>It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem

Somewhat more to the point, if you have a function which checks for duplicate data - whether by comparing the hashes, or comparing the data itself - the chance of a hash collision is less than the chance that random memory errors will happen to cause that function to return the wrong result. That is to say, the reliability of the byte-for-byte function is no greater than the compare-the-hashes function, at least when it comes to order of magnitude.

In practical terms, if you have a pipeline which relies on every step operating correctly, there's not much point in paying an ongoing cost to improve the reliability of one given component if there are others that are orders of magnitude more likely to fail. Sure, technically it makes a difference, but only in the sense that you can make the ocean bigger by tipping a bucket of water into it.

In-band deduplication for Btrfs

robbe — Tue, 15 Mar 2016 21:11:42 +0000

I, for one, welcome this SHA2@Home project trying to find a collision, however unlikely.

But, the willingness to dedicate CPU cycles to pure science is waning – maybe we should promise the person who finds the first collision some riches. Let’s call it btrcoin.

In-band deduplication for Btrfs

nybble41 — Tue, 15 Mar 2016 17:26:15 +0000

> > However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about
> I disagree, that actually is a very scary failure mode for a file system.

I wasn't trying to downplay the problems it would create for a filesystem, and I agree with everything else you said. However, SHA-256 hashes are used for more than just identifying blocks within filesystems for deduplication. The ability to create SHA-256 hash collisions would undermine the entire digital signature system, for example. Implementing a workaround is also easier in the context of deduplication—in the short term you can just turn it off while you reindex the drive with a different hash function. Unless the content of your filesystem is *really* important, the odds of anyone wasting a SHA-256 collision 0-day attack on it are vanishingly small, and even a major issue with the algorithm which cut the effective bit length in half would not represent an immediate practical threat.

In-band deduplication for Btrfs

nix — Tue, 15 Mar 2016 17:08:55 +0000

I've only heard of it as a possibility. Nobody has ever mentioned encountering a real collision, and frankly I'm not worried about one turning up in the foreseeable future.

In-band deduplication for Btrfs

intgr — Tue, 15 Mar 2016 16:10:33 +0000

> However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about

I disagree, that actually is a very scary failure mode for a file system. If an attacker is allowed to influence what gets stored in a file system, then a preimage attack or possibly a clever application of a collision attack would allow poisoning the file system.

For instance, an attacker knows that some user wants to store document A on the system. The attacker can prepare a colliding document B and upload it before the user gets the chance to upload A. When document A is written later, the file system will throw away A and keep the tampered document B instead.

Consider that document A can be, for example, a system package update that the system administrator installs. Lulz ensues.

In-band deduplication for Btrfs

mathstuf — Mon, 14 Mar 2016 15:29:02 +0000

I thought I read somewhere that someone had an object of 40 zeros and had some collision locally. Or was that just a potentiality?

In-band deduplication for Btrfs

nix — Sun, 13 Mar 2016 19:21:47 +0000

And because those considerations apply to git, they apply to bup as well. Bup actually provides a tool to check the probability of such trouble. Let's run it over my backups for the last few years:

56 matching prefix bits
2.02 bits per doubling
104 bits (51.38 doublings) remaining
2.93716e+15 times larger is possible

Everyone on earth could have 438382 data sets like yours, all in one
repository, and we would expect 1 object collision.

I am not scared of a collision.

In-band deduplication for Btrfs

oever — Fri, 11 Mar 2016 21:56:01 +0000

git uses the SHA-1 hash of a file as the identifier. If there is a collision, the git repository breaks. Other threats to data integrity are much higher than the chance of an accidental collision. Git it also a deduplicated file system of sorts, so the same considerations apply.

In-band deduplication for Btrfs

ikm — Fri, 11 Mar 2016 17:56:52 +0000

> You can't play probability games with filesystems

You always play probability games, whether you like it or not. Any digital circuit is a system with SNRs sufficiently high to quantize levels reliably, but this can't ever be 100% reliable. Data redundancy can be put on top of it to make it more reliable - however, there's still always a probability of state and data corruption in such a system. Once you realize that, the fears of having a hash collision reduce to the question of being able to calculate the respective probabilities and compare them with the other probabilities involved.

In-band deduplication for Btrfs

nybble41 — Fri, 11 Mar 2016 16:25:37 +0000

Sure, cryptoanalysis uncovering weaknesses in the SHA-256 algorithm is a possibility. I was replying only to the "it's not mathematically impossible, ergo it must be treated as a realistic possibility" argument. However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about than some corrupted filesystem data. There are also various well-known methods to thwart attacks dependent on predictable hash results, like seeding the hash function with a hidden per-filesystem salt, and in the event that such an attack is discovered the workaround is simple: just disable online deduplication until the hash function can be updated.

In-band deduplication for Btrfs

micka — Fri, 11 Mar 2016 15:30:05 +0000

So, nobody does cryptanalysis on SHA-256 and try to create attacks on it ?
OK, that's not by random chance anymore, but...

In-band deduplication for Btrfs

nybble41 — Fri, 11 Mar 2016 15:24:51 +0000

For all practical purposes, in this case it does mean exactly that. It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem. It is so far down the list of things to worry about that developers would be more productively employed working on almost anything else compared to implementing bit-for-bit block comparison to supplement the SHA-256 match.

In-band deduplication for Btrfs

bgoglin — Fri, 11 Mar 2016 14:05:43 +0000

Unlikely doesn't mean it won't ever happen.

In-band deduplication for Btrfs

nybble41 — Thu, 10 Mar 2016 23:23:44 +0000

Even if you generated a new 256-bit hash every picosecond (10^12 hashes per second), it would be 10^19 years before the probability of a collision reached 50%, taking into account the Birthday Paradox. That is over 700 million times the current age of the universe, with a 50% probability of *one* collision. The probability of finding any collisions is still less than 10^-9 after 500 trillion (5*10^14) years.

Even filesystems don't "roll the dice" often enough to make 256-bit hash collisions a serious consideration.

In-band deduplication for Btrfs

martin.langhoff — Thu, 10 Mar 2016 19:01:20 +0000

Very rare is a statement of probability. You can't play probability games with filesystems because you roll the dice so many times that you will hit the jackpot.

In-band deduplication for Btrfs

roblucid — Thu, 10 Mar 2016 18:41:28 +0000

Presumably it's expected that the HW acceleration will be available for the common SHA-256 algorithmn, which is in kernel, so the costs of hashing an extent may be less than feared on many core CPUs

I would guess avoiding the bit for bit comparison is done for performance reasons, on the pragmatic basis that collisions, even if there's only really 64bits of entropy are going to be very, very rare :)

In-band deduplication for Btrfs

jtaylor — Thu, 10 Mar 2016 14:57:25 +0000

I guess the in-band deduplication has the goal of avoiding a duplicated writes completely before they even go to disk.

Your suggested approach is basically batch duplication, btrfs already stores cheap hashes with its data, a userspace program could use these too relatively quickly find candidates for the dedupe ioctl to deduplicate safely.
duperemove basically does this, except it cannot yet do incremental deduplication.
It could probably be done by using btrfs subvolume find-new command that can list what has changed since a certain filesystem generation

In-band deduplication for Btrfs

martin.langhoff — Thu, 10 Mar 2016 14:00:05 +0000

At first blush the performance cost and collision risk of the technique described make me uneasy...

There's a known pattern of calculating a fast cheap hash in-band during writes, which is later used by an offline process to assess dedupe candidates.

The fast hash will have some false positives, that's OK. It's merely a low cost guide for the batch or idle process to pick candidates. The office process performs a full comparison, and might have additional rules, such as skipping files with high rates of change.

Perhaps that's a better approach?