LWN.net Logo

Changing Hash Function without wholesale rehashing/rebuilding

Changing Hash Function without wholesale rehashing/rebuilding

Posted Apr 29, 2011 1:17 UTC (Fri) by orcmid (guest, #74478)
In reply to: Linear Hashing Work-Around Potential - LoadFactor clarification by orcmid
Parent article: Google Linux servers hit with $5m patent infringement verdict (The Register)

I thought about the need to retire hash(key) in favor of an incompatible hash'(key) some more (in the shower naturally), and I realized that even if hash'(key) doesn't produce the same low order bits, it is easier to migrate than wholesale rebuilding.

Here is a brief sketch with lots of important details left out.

I assume that N=1 and we use Mmin, M, LFmin and LFmax as I have come to be using them in this series of comments. This is all for simplicity.

When it is necessary to migrate the current hash-index base to another hash-indexed base, do it this way.

1. Arrange, for the to-be-retired base, for a level to be exactly completed, either by shrinking or splitting. So M is at some 2^level - 1.

2. Set up the new (initially-empty) base, base'. If this is a linear hashing base too, then that's cool because it can be set to some Mmin' initial number of (empty) buckets and we will let it grow naturally. The technique we use will allow base' to grow gradually as the old base shrinks.

3. Stop inserting new records in the old base. The procedure will be such that new records not found in either base will be added to base' only.

4. Adjust the hashing bucket(key,level) algorithm in the old base so that whenever hash(key) > M, we don't use the bf(key, level-1), instead we go to base' and look for it there. If it is not there, we insert it there. Likewise, if we do use bf(key, level-1) and we don't find the record, we go to the new base and insert it there. (It shouldn't be there already.)

5. When we are ready to decrease M (probably fixing LFmin to some large value and keeping LFmax always larger than that and what the current LF is, to encourage that, e.g., LFmax > LFmin > LF), instead of merging the same-hash values in bucket [M] back to the split buddy of [M], we instead insert those records into base' and then reduce M by 1. We will never look for those records in the old base any longer.

6. Now, at some point we will want to stop searching first in the old base but start searching in the new base. If the record is not in the new base, we should search the old base for it (hashing with the old hash of course). If it is in the old base, we are done looking although it *might* be worth removing it from the old base and inserting it in the new base. I don't have any intuition on when that trade-off is important. The easy way is simply to end the query. If the query in the old base doesn't find the record though, then the old base sends the record back to the new base to be inserted.

7. The bottom line is that the new base never returns a record/query that comes to it from the old base, but when it is accepting queries directly, it will forward to the old base if it can't find the record in its base.

8. This continues until, at some point, the old base is compacted enough that it is simply time to move everything to the new base. And it is simple enough to have that be when M = Mmin = 0 and old-base bucket [0] is all that's left (the mask has gone to 0, in effect).


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds