User: Password:
Subscribe / Log in / New account

Desmond: Out-Tridging Tridge

Desmond: Out-Tridging Tridge

Posted Aug 28, 2013 20:39 UTC (Wed) by ikm (subscriber, #493)
Parent article: Desmond: Out-Tridging Tridge

Since this article talks about backups and rsync, I welcome people to check out a globally deduplicating tool I wrote specifically for backups, I backup all my data with it nowadays and feel cozy.

(Log in to post comments)

Desmond: Out-Tridging Tridge

Posted Aug 28, 2013 21:44 UTC (Wed) by khc (guest, #45209) [Link]

Nice! What's the performance like for zbackup? There's also which seems more mature.

Having previously worked on a commercial deduplication product it's always interesting to see open source catching up.

Desmond: Out-Tridging Tridge

Posted Aug 28, 2013 22:42 UTC (Wed) by ikm (subscriber, #493) [Link]

They are quite different. I believe s3ql does not do rolling hashes, so it only features a block-level deduplication, comparable to ones found on some filesystems (e.g ZFS). As far as performance goes, it's easier to just test. LZMA takes most of the time, but this only really applies to the first backup where most of the data is new. As long as incrementality is high, subsequent backups should be fast, and there's also room for micro-optimizations too.

Desmond: Out-Tridging Tridge

Posted Aug 28, 2013 23:03 UTC (Wed) by khc (guest, #45209) [Link]

indeed I misread, their faq seems to imply that they use 10MB as the blocksize:

I haven't looked at your code, but from the description it sounds like you are checking the hash for each rolling fingerprint, and then the max chunk size is 64K. I take it that it means for each byte, you see if the rolling hash is already defined, if not, keep going, until you hit 64K and create a new chunk? Wouldn't that only create 64K chunks? Or are you doing other things to limit the chunk size?

Desmond: Out-Tridging Tridge

Posted Aug 29, 2013 7:04 UTC (Thu) by ikm (subscriber, #493) [Link]

Yes, deduplication normally works on full chunks. There are cases where smaller chunks are used - to fill gaps between the full ones or at the end of input.

Desmond: Out-Tridging Tridge

Posted Aug 29, 2013 8:05 UTC (Thu) by khc (guest, #45209) [Link]

What do you mean by "fill gaps between the full ones"?

Desmond: Out-Tridging Tridge

Posted Aug 29, 2013 22:54 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

I imagine that it's likely something like this:



where C is a smaller block to "fill gaps between the full ones".

Desmond: Out-Tridging Tridge

Posted Aug 30, 2013 1:04 UTC (Fri) by khc (guest, #45209) [Link]

Right. It's a little strange though. Consider how the rolling hash would work (disclaimer: I actually haven't looked at zbackup's code):

we will simplify things by saying the max chunk size is 5 bytes instead. With your example, it's easy to see how AAAAA would be a match. Next we get to C, which is not a match. Then we get to CB, CBB, CBBB, CBBBB, none of these would match, and we hit the max chunk size. This normally would cause us to define a new chunk, with a chunk size of 5 bytes.

I guess it may work if the lookup window is twice the max chunk size though.

Desmond: Out-Tridging Tridge

Posted Aug 30, 2013 9:03 UTC (Fri) by ikm (subscriber, #493) [Link]

Once you get to CBBBB, the hash starts to roll. The first roll would get it to BBBBB, leaving C between the deduplicated chunks AAAAA and BBBBB (this whole example assumed that both AAAAA and BBBBB were encountered before so they got deduplicated here).

In a nutshell: after each deduplication matched, the window starts growing from a single byte to its full width. After that, the window starts rolling forward, looking for a match. If a match is found, the data left behind the window is output as a new short chunk (or even in-place if it's too little). If no match is found at the point where the data left behind is enough for a new full-sized chunk, the data left behind is output as a new full-sized chunk. The window then continues to roll.

Desmond: Out-Tridging Tridge

Posted Sep 2, 2013 19:53 UTC (Mon) by ttonino (subscriber, #4073) [Link]

When I heard about deduplicating in a way that is not block-oriented, I immediately thought about cutting the source into variable length blocks at a specific marker or set of markers. If the file is cut at a 2 byte sequence such as 0x3456, the average chunk size will be 64KB.

This is obviously not random enough (binary and text behave differently). So I would propose having perhaps 512 different markers of 3 bytes length for an average block size of 32 KB. These blocks can then be deduplicated with little processing by determining the hash and perhaps length.

Desmond: Out-Tridging Tridge

Posted Sep 3, 2013 4:28 UTC (Tue) by khc (guest, #45209) [Link]

In practice it works well and is fast (although you usually want to compare a full word and not 2 bytes).

Desmond: Out-Tridging Tridge

Posted Aug 29, 2013 14:06 UTC (Thu) by LenZ (subscriber, #1051) [Link]

FWIW, I just recently added zbackup support to mylvmbackup (the patch was contributed by Ivan Korjavin): - the code is in the trunk, but has not been released yet. Feedback is welcome!

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds