Trying to get my head round what you're saying ... but I think you're using "minimum buckets" completely differently to my MINIMUM.MODULUS. It looks to me like you're using Nbits to keep level always at 0 or 1.
MINIMUM.MODULUS cannot change on the fly. As is shown by my comment about UniData :-) The recommended use for it is where a file can shrink, grow, or be cleared regularly, so its size is very variable, but it also has a "normal" size. By declaring a minimum size equal to the typical size, you avoid a lot of unnecessary split/merge activity. The other use is where you are loading a file with a lot of data - again by setting the minimum modulus to whatever it will be once the file is loaded, you're saving a load of activity while the file is built.
Mind you, for my first implementation, to calculate the bucket I used the following pseudo-code
unsigned int mask = -1 // yes I know :-)
bucket = hash;
while (bucket gt current-modulo)
mask = rightshift( mask)
bucket = and( bucket, mask)
repeat
Assuming "int" is a short, that means I've got a valid bucket with less than 16 iterations. And the only operations I'm using are and, shift, and compare. All cheap compared with multiply, divide, and mod, I think.
Even cheaper if rather than calculating mask each time, I cache the upper value :-) Unless you're on a Z80, it's quicker to get the lower mask by right shifting the upper, rather than going the other way by doubling and incrementing. (I gather the Z80, due to a bug in the silicon, didn't have a leftshift function. Then somebody tested the instruction that should be leftshift and discovered it set the new rightmost bit to 1. So what should have been leftshift became "left shift and increment", at first informally then Zilog actually documented it because everyone was using it :-)
Linear Hashing Work-Around Potential - LoadFactor clarification
Posted Apr 27, 2011 18:59 UTC (Wed) by orcmid (guest, #74478)
[Link]
No, I never change the value of N which you remark is also known as MINIMUM.MODULUS. And level grows as necessary to keep the load factor in bounds. Every time the load factor gets too large, a bucket is split. That is, a new bucket M+1 shows up and it is a split of its "mirror" bucket farther back in the preceding buckets. Every time a level is filled, splitting starts over from bucket 0 moving forward through all of them to the end of the just-completed level. Each completed level has double the number of buckets that the previous level did.
It is also possible for the number of used buckets to decrease when the load factor goes too small, and I allowed for that as well.
I did some interesting back-of-the-envelope analysis and there is an amusing situation. At the time a level is completed, the expected occupancy of bucket [0] is the load factor. But the expected occupancy of the last bucket to be split in completing the level is double the load factor at that point in time. That's because it is still double the size of the already-split buckets, under the uniform random assumption. (It is still a (level)-1 bucket and splitting other buckets has no effect on its share of the load factor until it too is split.) Splitting then has there be load-factor-expected content in the shrunk bucket and its split partner, which is the instantaneous expectancy for all of the buckets in the just completed level.
This saw-tooth effect may have to be taken into account when wanting to avoid bad congestion, especially if we are talking about the likelihood of some sort of disk or memory block overflow into another block at the 2*LFmax expectancy.
Of course, even for uniform randomness one needs to account for variance as well. I haven't looked at that and I have no idea how important this observation is under practical conditions.
PS: By masking I mean computing the equivalent of hash(key) mod (N*2**level) by simply extracting the low-order bits of hash(key) by a mask of level+log2(N) bits where N is a power of 2. It is trivial to keep a pair of current mask values, one for level, one for level-1 and implementing bucket(key,level) very quickly, adjusting them as level changes. And if we need the bucket number to be a multiple of a power of 2 (for pointer manipulation), we can handle that by adjusting the masks too. It's all about getting through the directory of buckets to the bucket slot as quickly as possible.
Linear Hashing Work-Around Potential - LoadFactor clarification
Posted Apr 27, 2011 19:18 UTC (Wed) by orcmid (guest, #74478)
[Link]
Oh. NO, Nbits is just the number of bits in N, for N a power of 2, and which is constant for the hash table, regardless of level. (I may be off by one here). I think it is log2(N)+1.
Level moves up and down as needed. However, at any point, we only have bf(key, level) and bf(key, level-1) in use.
The initial conditions are level = 1, M = N-1, and buckets 0 to N-1 are ready and empty. So the LF is initially 0 but we never shrink below N-1 anyhow. With linear hashing, we always grow the buckets linearly at the end, so the first bucket to split is bucket [0] and it splits between itself and new bucket M+1. (Then M is then advanced by 1).
At the time of the first split, presumably LF > LFmax has happened and let us assume for simplicity that it is over just a hair and we can assume LF = LFmax for convenience.
Then bucket [0] is expected to have LF records in it and after the split it will have kept an expected LF/2 of them with new bucket [N] getting the other expected LF/2. By the time the level is complete (so there are 2N buckets), each of them will again have an expected LF records in them (which is how the saw-tooth fits in). Then we start splitting again from bucket [0], M = 2N-1, and we work on building out the next level, which will end up with a total of 4N buckets if it is completed. Rinse. Repeat.