Thanks for this and the previous expansion on the linear hashing implementations you're experienced with.
The load factor definition I use has to do with the average number of keys that have the same hash in the set of possible hashes (actually, same buckets).
I should clarify what I mean by same-hash. Typically, the adjustable bucket function that works with linear hashing is something like
bf(key,level) = hash(key) mod (N*2^level)
where N is the minimum (starting) number of buckets and level is initially 0.
Once splitting starts, there are at any time at most two of these functions being used.
Let M (not less than N-1) be the highest number of the buckets currently in use and let let N*2^level be the least integer greater than M.
Then the rule for finding the correct bucket is,
bucket(key,level) = if bf(key,level) > M then bf(key,level-1) else bf(key,level)
Note that for the pathological case, I had hash(key) be what was constant. It doesn't matter what constant, but 0 is easiest to visualize.
In the simplest linked-list in-memory case, a buck holds one record directly (in the head entry of the linked list) and anything else is linked from that. An empty bucket has some sort of special NULL entry so it can be distinguished for a bucket holding onto one record. Note that if N is a power of 2, the bucket(key, level) algorithm becomes very simple since masking can be used instead of division but there needs to be more attention to the quality of hash(key). I prefer that case because masking can be used to dramatically speed up the entire process of getting to a specific bucket through whatever the bucket director structure is.
If we are talking about disk spill, and on-disk buckets, the overflow case is different and different load factor values are more important. For example, one would like to know the average number of disk blocks taken for buckets corresponding to the same hash, on the assumption that one wants to minimize the number of buckets with overflows to the point where the extra hit has a negligible impact on performance. I imagine for tuning purposes, the load factor as I described it also matters because it gives a clue as to how much empty space is carried in the first/primary disk block of a bucket, since having too many disk blocks that are nearly empty has its own performance consequences. It looks like these load factors (being estimates) are derivable from each other assuming primary and overflow blocks are all of the same size.
We've wandered a distance from workarounds for '120, which I think are more-easily handled without going to linear hashing if it is not already being used in the Linux kernel code that infringes an essential claim. Thanks for the exploration though.