Desmond: Out-Tridging Tridge
Posted Aug 28, 2013 21:44 UTC (Wed) by khc (guest, #45209)
Having previously worked on a commercial deduplication product it's always interesting to see open source catching up.
Posted Aug 28, 2013 22:42 UTC (Wed) by ikm (subscriber, #493)
Posted Aug 28, 2013 23:03 UTC (Wed) by khc (guest, #45209)
I haven't looked at your code, but from the description it sounds like you are checking the hash for each rolling fingerprint, and then the max chunk size is 64K. I take it that it means for each byte, you see if the rolling hash is already defined, if not, keep going, until you hit 64K and create a new chunk? Wouldn't that only create 64K chunks? Or are you doing other things to limit the chunk size?
Posted Aug 29, 2013 7:04 UTC (Thu) by ikm (subscriber, #493)
Posted Aug 29, 2013 8:05 UTC (Thu) by khc (guest, #45209)
Posted Aug 29, 2013 22:54 UTC (Thu) by mathstuf (subscriber, #69389)
where C is a smaller block to "fill gaps between the full ones".
Posted Aug 30, 2013 1:04 UTC (Fri) by khc (guest, #45209)
we will simplify things by saying the max chunk size is 5 bytes instead. With your example, it's easy to see how AAAAA would be a match. Next we get to C, which is not a match. Then we get to CB, CBB, CBBB, CBBBB, none of these would match, and we hit the max chunk size. This normally would cause us to define a new chunk, with a chunk size of 5 bytes.
I guess it may work if the lookup window is twice the max chunk size though.
Posted Aug 30, 2013 9:03 UTC (Fri) by ikm (subscriber, #493)
In a nutshell: after each deduplication matched, the window starts growing from a single byte to its full width. After that, the window starts rolling forward, looking for a match. If a match is found, the data left behind the window is output as a new short chunk (or even in-place if it's too little). If no match is found at the point where the data left behind is enough for a new full-sized chunk, the data left behind is output as a new full-sized chunk. The window then continues to roll.
Posted Sep 2, 2013 19:53 UTC (Mon) by ttonino (subscriber, #4073)
This is obviously not random enough (binary and text behave differently). So I would propose having perhaps 512 different markers of 3 bytes length for an average block size of 32 KB. These blocks can then be deduplicated with little processing by determining the hash and perhaps length.
Posted Sep 3, 2013 4:28 UTC (Tue) by khc (guest, #45209)
Posted Aug 29, 2013 14:06 UTC (Thu) by LenZ (subscriber, #1051)
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds