Weekly edition Kernel Security Distributions Contact Us Search Archives Calendar Subscribe Write for LWN LWN.net FAQ Sponsors

# Linear Hashing Work-Around Potential - LoadFactor clarification

## Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 25, 2011 18:01 UTC (Mon) by orcmid (guest, #74478)
In reply to: Linear Hashing Work-Around Potential - LoadFactor clarification by Wol
Parent article: Google Linux servers hit with \$5m patent infringement verdict (The Register)

Thanks for this and the previous expansion on the linear hashing implementations you're experienced with.

The load factor definition I use has to do with the average number of keys that have the same hash in the set of possible hashes (actually, same buckets).

I should clarify what I mean by same-hash. Typically, the adjustable bucket function that works with linear hashing is something like

bf(key,level) = hash(key) mod (N*2^level)

where N is the minimum (starting) number of buckets and level is initially 0.

Once splitting starts, there are at any time at most two of these functions being used.

Let M (not less than N-1) be the highest number of the buckets currently in use and let let N*2^level be the least integer greater than M.

Then the rule for finding the correct bucket is,

bucket(key,level) = if bf(key,level) > M then bf(key,level-1) else bf(key,level)

Note that for the pathological case, I had hash(key) be what was constant. It doesn't matter what constant, but 0 is easiest to visualize.

In the simplest linked-list in-memory case, a buck holds one record directly (in the head entry of the linked list) and anything else is linked from that. An empty bucket has some sort of special NULL entry so it can be distinguished for a bucket holding onto one record. Note that if N is a power of 2, the bucket(key, level) algorithm becomes very simple since masking can be used instead of division but there needs to be more attention to the quality of hash(key). I prefer that case because masking can be used to dramatically speed up the entire process of getting to a specific bucket through whatever the bucket director structure is.

If we are talking about disk spill, and on-disk buckets, the overflow case is different and different load factor values are more important. For example, one would like to know the average number of disk blocks taken for buckets corresponding to the same hash, on the assumption that one wants to minimize the number of buckets with overflows to the point where the extra hit has a negligible impact on performance. I imagine for tuning purposes, the load factor as I described it also matters because it gives a clue as to how much empty space is carried in the first/primary disk block of a bucket, since having too many disk blocks that are nearly empty has its own performance consequences. It looks like these load factors (being estimates) are derivable from each other assuming primary and overflow blocks are all of the same size.

We've wandered a distance from workarounds for '120, which I think are more-easily handled without going to linear hashing if it is not already being used in the Linux kernel code that infringes an essential claim. Thanks for the exploration though.

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 25, 2011 19:48 UTC (Mon) by Wol (guest, #4433) [Link]

In UniVerse (the current implementation with which I am familiar, Prime having gone bust) you can adjust the size of the bucket. In INFORMATION it was stuck at 2K (which gave you a maximum key size of 1600 bytes because it needed to fit in one bucket).

In practice, strange as it may seem, I don't think records spilt out of the primary block that much - they couldn't have if my comment about needing only 1.05 reads average to find a record is correct ... :-)

At worst, even if all records were oversized, you'd need the 1.05 reads to locate it, and then one more read to actually get it because the primary bucket would tell you where it was.

btw, you said that "if N is a power of 2, masking can be used". I think you've got a bit confused, but you've also missed a trick. You can always use masking. First of all, I wouldn't use N at all in the hash function. It would only be used in split/merge. I had some trouble getting my head round your use of N, but it does appear to work :-)

However, what I'd do is

Given M buckets, calculate P such that 2^(P-1) <= M < 2^P

If hash mod mask < M then that's our bucket, else hash mod rightshift( mask) gives us our bucket.

Then

if loadfactor > 80% then split
if loadfactor < 50% AND M > N then merge

Cheers,
Wol

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 25, 2011 19:52 UTC (Mon) by Wol (guest, #4433) [Link]

Actually, that explains something I didn't understand before ... in both UniData and UniVerse, N is called MINIMUM.MODULUS. Apparently in UniData (with which I am unfamiliar), changing this triggers a complete file rebuild, and I never understood why.

Your version of linear hashing explains why ... :-)

Cheers,
Wol

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 28, 2011 17:48 UTC (Thu) by orcmid (guest, #74478) [Link]

If N, the MINIMUM.MODULUS is a power of 2, it is possible to change it without consequence. It means that level numbers might seem to change, but there is a way out of that. Instead of using N as the minimum, just set some Mmin as the minimum number of buckets. Let N be any power of 2 not greater than Mmin. It might as well be 1. Now you can adjust Mmin all you want, although making it greater than the current M forces splitting. Reducing it simply allows the splitting algorithm to shrink the database if that is what is wanted. (There are probably some games with LFmin and LFmax to coax things along.)

Now, if you decide to adjust Mmin for some practical reason, do so! In the creation of an initial hash-table, one wouldn't do it by splitting of arrivals but by creating Mmin empty buckets, and the current level being completed would be whatever it needs to be to allocate into those M = MMin buckets properly.

I think the bigger problem is when hash(key) needs to change because it doesn't provide enough good uniformly random bits for the level that has been reached or there are not enough bits in it to increase the level at all.

Changing hash(key) arbitrarily breaks the linearity scheme by which the hash-table grows and shrinks linearly at the end and splitting depends on rehashing of one bucket dividing things between the current bucket and the new bucket at the end.

However, you can safely change hash(key) to a new hash'(key) that still produces the same low order bits but provides more or better high-order bits, ones that aren't selected by the current biggest mask so there are no buckets that are selected by them in the current state of the system.

The above arrangements only if N hasn't been wired into other aspects of the implementation. It is important to make the design of the expandable hash-indexed directory to the buckets and any clustering of buckets be flexible enough to handle variation in N and, better, fluctuation in Mmin. It should be possible to adjust and tune that scheme independently. The scheme needs to be robust enough so that the base never has to be rebuilt by rehashing everything, since that is terrible for high-availability distributed cases where we are talking about records and buckets/overflows on disk.

(If there is a problem at the level of the current mask, one can adjust LFmax and LFmin to force the system to shrink to a lower level if the performance hit while waiting to get to the desired lower completed level is tolerable. The techniques for distributed operation of linear hashed bases might be relevant if it is necessary to do anything more drastic than that in order to institute an entirely different hash(key) in an alternate base that allows the original base to be switched off at an appropriate time.)

Changing Hash Function without wholesale rehashing/rebuilding

Posted Apr 29, 2011 1:17 UTC (Fri) by orcmid (guest, #74478) [Link]

I thought about the need to retire hash(key) in favor of an incompatible hash'(key) some more (in the shower naturally), and I realized that even if hash'(key) doesn't produce the same low order bits, it is easier to migrate than wholesale rebuilding.

Here is a brief sketch with lots of important details left out.

I assume that N=1 and we use Mmin, M, LFmin and LFmax as I have come to be using them in this series of comments. This is all for simplicity.

When it is necessary to migrate the current hash-index base to another hash-indexed base, do it this way.

1. Arrange, for the to-be-retired base, for a level to be exactly completed, either by shrinking or splitting. So M is at some 2^level - 1.

2. Set up the new (initially-empty) base, base'. If this is a linear hashing base too, then that's cool because it can be set to some Mmin' initial number of (empty) buckets and we will let it grow naturally. The technique we use will allow base' to grow gradually as the old base shrinks.

3. Stop inserting new records in the old base. The procedure will be such that new records not found in either base will be added to base' only.

4. Adjust the hashing bucket(key,level) algorithm in the old base so that whenever hash(key) > M, we don't use the bf(key, level-1), instead we go to base' and look for it there. If it is not there, we insert it there. Likewise, if we do use bf(key, level-1) and we don't find the record, we go to the new base and insert it there. (It shouldn't be there already.)

5. When we are ready to decrease M (probably fixing LFmin to some large value and keeping LFmax always larger than that and what the current LF is, to encourage that, e.g., LFmax > LFmin > LF), instead of merging the same-hash values in bucket [M] back to the split buddy of [M], we instead insert those records into base' and then reduce M by 1. We will never look for those records in the old base any longer.

6. Now, at some point we will want to stop searching first in the old base but start searching in the new base. If the record is not in the new base, we should search the old base for it (hashing with the old hash of course). If it is in the old base, we are done looking although it *might* be worth removing it from the old base and inserting it in the new base. I don't have any intuition on when that trade-off is important. The easy way is simply to end the query. If the query in the old base doesn't find the record though, then the old base sends the record back to the new base to be inserted.

7. The bottom line is that the new base never returns a record/query that comes to it from the old base, but when it is accepting queries directly, it will forward to the old base if it can't find the record in its base.

8. This continues until, at some point, the old base is compacted enough that it is simply time to move everything to the new base. And it is simple enough to have that be when M = Mmin = 0 and old-base bucket [0] is all that's left (the mask has gone to 0, in effect).

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 26, 2011 2:43 UTC (Tue) by orcmid (guest, #74478) [Link]

I used the formulation given in the two now-classical papers on Linear Hashing.

I was having the modulus always be a power of 2. Then the bit mask to compute it for a given bf(key,level) is 2^(level+Nbits)-1 where N = 2 ^ Nbits is the minimum number of buckets (which are then repeatedly doubled as new levels are completed). So the bucket algorithm can be optimized to use the two masks (one of which is double the other + 1) precomputed and it is easy to adjust them as level is expanded or contracted. If hash(key) is also cached in the linked-list entries for the same bucket, rehashing becomes trivial. (This could matter in-memory more than if disk accesses are involved, but there are the usual time-space trade-offs to ponder, including whatever the cost of computing hash(key) is.)

For the linked-list case in memory, I used a bucket size of one, meaning exactly one linked-list cell (the head) per bucket.

However, how buckets are organized into blocks/clusters of contiguous storage (or disk) and how overflows are kept near or in those blocks is also something interesting to explore, since that also matters when dealing with either disk blocks or virtual-memory allocations. Since the splitting algorithm runs sequentially through the buckets, I speculate that having clusters with location affinity can be important.

I haven't looked at the time-space trade-offs enough to have formulated a simulation and determined the sensitivity to different bucket directory and block/cluster organizations. Too many opportunities, not enough time ...

I think you and I are looking at the same sort of considerations, we are just cutting up the storage organization in different but more-or-less equivalent ways.

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 26, 2011 2:55 UTC (Tue) by orcmid (guest, #74478) [Link]

For some reason, an edit I tried to make didn't get into the just-immediate post:

Although caching hash(key) might be useful to avoid its recomputation when splitting a bucket, it is not needed at all when removing a bucket, since we know absolutely what bucket the removed-bucket's records go into. It is the bucket number (actually, M) masked by the smaller mask at the time of the merge-back and before M is decreased.

Whether or not hash(key) is cached, it is a nice value to keep as the record itself in a fast simulation looking for where splitting doesn't catch up fast enough and what the headroom needs to be on load factor, etc. OK, that's more than enough speculation from me on one day.

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 26, 2011 13:11 UTC (Tue) by Wol (guest, #4433) [Link]

Trying to get my head round what you're saying ... but I think you're using "minimum buckets" completely differently to my MINIMUM.MODULUS. It looks to me like you're using Nbits to keep level always at 0 or 1.

MINIMUM.MODULUS cannot change on the fly. As is shown by my comment about UniData :-) The recommended use for it is where a file can shrink, grow, or be cleared regularly, so its size is very variable, but it also has a "normal" size. By declaring a minimum size equal to the typical size, you avoid a lot of unnecessary split/merge activity. The other use is where you are loading a file with a lot of data - again by setting the minimum modulus to whatever it will be once the file is loaded, you're saving a load of activity while the file is built.

Mind you, for my first implementation, to calculate the bucket I used the following pseudo-code

unsigned int mask = -1 // yes I know :-)
bucket = hash;
while (bucket gt current-modulo)
repeat

Assuming "int" is a short, that means I've got a valid bucket with less than 16 iterations. And the only operations I'm using are and, shift, and compare. All cheap compared with multiply, divide, and mod, I think.

Even cheaper if rather than calculating mask each time, I cache the upper value :-) Unless you're on a Z80, it's quicker to get the lower mask by right shifting the upper, rather than going the other way by doubling and incrementing. (I gather the Z80, due to a bug in the silicon, didn't have a leftshift function. Then somebody tested the instruction that should be leftshift and discovered it set the new rightmost bit to 1. So what should have been leftshift became "left shift and increment", at first informally then Zilog actually documented it because everyone was using it :-)

Cheers,
Wol

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 27, 2011 18:59 UTC (Wed) by orcmid (guest, #74478) [Link]

No, I never change the value of N which you remark is also known as MINIMUM.MODULUS. And level grows as necessary to keep the load factor in bounds. Every time the load factor gets too large, a bucket is split. That is, a new bucket M+1 shows up and it is a split of its "mirror" bucket farther back in the preceding buckets. Every time a level is filled, splitting starts over from bucket 0 moving forward through all of them to the end of the just-completed level. Each completed level has double the number of buckets that the previous level did.

It is also possible for the number of used buckets to decrease when the load factor goes too small, and I allowed for that as well.

I did some interesting back-of-the-envelope analysis and there is an amusing situation. At the time a level is completed, the expected occupancy of bucket [0] is the load factor. But the expected occupancy of the last bucket to be split in completing the level is double the load factor at that point in time. That's because it is still double the size of the already-split buckets, under the uniform random assumption. (It is still a (level)-1 bucket and splitting other buckets has no effect on its share of the load factor until it too is split.) Splitting then has there be load-factor-expected content in the shrunk bucket and its split partner, which is the instantaneous expectancy for all of the buckets in the just completed level.

This saw-tooth effect may have to be taken into account when wanting to avoid bad congestion, especially if we are talking about the likelihood of some sort of disk or memory block overflow into another block at the 2*LFmax expectancy.

Of course, even for uniform randomness one needs to account for variance as well. I haven't looked at that and I have no idea how important this observation is under practical conditions.

PS: By masking I mean computing the equivalent of hash(key) mod (N*2**level) by simply extracting the low-order bits of hash(key) by a mask of level+log2(N) bits where N is a power of 2. It is trivial to keep a pair of current mask values, one for level, one for level-1 and implementing bucket(key,level) very quickly, adjusting them as level changes. And if we need the bucket number to be a multiple of a power of 2 (for pointer manipulation), we can handle that by adjusting the masks too. It's all about getting through the directory of buckets to the bucket slot as quickly as possible.

Linear Hashing Work-Around Potential - LoadFactor clarification

Posted Apr 27, 2011 19:18 UTC (Wed) by orcmid (guest, #74478) [Link]

Oh. NO, Nbits is just the number of bits in N, for N a power of 2, and which is constant for the hash table, regardless of level. (I may be off by one here). I think it is log2(N)+1.

Level moves up and down as needed. However, at any point, we only have bf(key, level) and bf(key, level-1) in use.

The initial conditions are level = 1, M = N-1, and buckets 0 to N-1 are ready and empty. So the LF is initially 0 but we never shrink below N-1 anyhow. With linear hashing, we always grow the buckets linearly at the end, so the first bucket to split is bucket [0] and it splits between itself and new bucket M+1. (Then M is then advanced by 1).

At the time of the first split, presumably LF > LFmax has happened and let us assume for simplicity that it is over just a hair and we can assume LF = LFmax for convenience.

Then bucket [0] is expected to have LF records in it and after the split it will have kept an expected LF/2 of them with new bucket [N] getting the other expected LF/2. By the time the level is complete (so there are 2N buckets), each of them will again have an expected LF records in them (which is how the saw-tooth fits in). Then we start splitting again from bucket [0], M = 2N-1, and we work on building out the next level, which will end up with a total of 4N buckets if it is completed. Rinse. Repeat.