Relativistic hash tables, part 1: Algorithms

By Jonathan Corbet
September 17, 2014

Hash tables are heavily used within the kernel to speed access to objects of interest. Using a hash table will be faster than, say, a linear search through a single list, but there is always value in making accesses faster yet. Quite a bit of work has been done toward this goal over the years; for example, the use of read-copy-update (RCU) can allow the addition and removal of items from a hash bucket list without locking out readers. Some operations have proved harder to make concurrent in that manner, though, with table resizing being near the top of the list. As of 3.17, the Linux kernel has gotten around this problem with an implementation of "relativistic hash tables" that can be resized while lookups proceed concurrently. This article will describe the algorithm that is used; a companion article will look at the kernel API for these new hash tables.

One might wonder whether the resizing of hash tables is common enough to be worth optimizing. As it turns out, picking the correct size for a hash table is not easy; the kernel has many tables whose size is determined at system initialization time with a combination of heuristics and simple guesswork. But even if the initial guess is perfect, workloads can vary over time. A table that was optimally sized may, after a change, end up too small (and thus perform badly) or too big (wasting memory). Resizing the table would fix these problems, but, since that is hard to do without blocking access to the table, it tends not to happen. The longer-term performance gains are just not seen to be worth the short-term latency caused by shutting down access to the table while it is resized.

Back in 2010, Josh Triplett, Paul McKenney, and Jonathan Walpole published a paper called Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming [PDF] that described a potential solution to this problem. "Relativistic" refers to the fact that the relative timing of two events (hash table insertions, say) that are not causally related may appear different to independent observers. In other words, one CPU may see two items inserted into the table in one order, while those insertions appear to have happened in the opposite order on another CPU. Despite some interesting performance results, this technique did not find its way into the kernel until the 3.17 merge window opened, when Thomas Graf contributed an implementation for use within the networking subsystem.

For the most part, relativistic hash tables resemble the ordinary variety. They are implemented as an array of buckets and a hash function to assign objects to buckets. Each bucket contains a linked list of all items in the table that belong to that bucket. A simple table with two buckets might thus be portrayed something like this:

Here, the first bucket has four items, the second bucket has three.

Table shrinking

Now imagine that this massive hash table is deemed to be too large; it would be nice to be able to shrink it down to a single bucket. (No need to point out that a single-bucket hash table is useless; the point is to simplify the example as much as possible). The first step is to allocate the new table (without making it visible to anybody else) and to link each bucket to the first list in the old table that would hash to that bucket in the new table:

Note the assumption that all items falling into the same bucket in the old table will also fall into the same bucket in the new table. That is a requirement for this algorithm to work. In practice, it means that the size of a table can only change by an integer factor; normally that means that the size of a table can only be doubled or halved.

Anyway, the new table now has some of the necessary items (the green ones) in its sole bucket, but the blue ones are still not represented there. That can be fixed by simply splicing them on to the end of the green list:

At this point, a reader traversing the green list may wander into the blue items added to the end. Since a hash lookup must always compare the keys anyway, the result will still be correct, it will just take a little longer than it did before. But that is how things will work once the operation is complete anyway. Completing simply requires pointing to the new table rather than the old:

Any new readers that come along after this point will see the new table, but there could still be concurrent readers traversing the lists in the old table. So the resize implementation must wait for an RCU grace period to pass; after that the old table can be freed and the operation is complete.

Table expansion

Making a table larger is a somewhat different operation, because objects appearing in a single hash bucket in the old table must be split among multiple buckets in the new table. Imagine that we start with the same two-bucket table:

We now wish to expand that table to four buckets. The colors of the objects in the diagram are meant to indicate which bucket each object will land in once the resizing is complete. The first step is to allocate the new table, and, for each bucket in the new table, set a pointer to the chain containing all of the objects that belong in that bucket:

Note that each pointer from the new table's hash bucket goes to the first item in the old chain that would hash to the new bucket. So, while each hash chain in the new table contains objects that do not belong there, the first object in the list is always in the correct bucket. At this point, the new table can be used for lookups, though they will be slower than they need to be due to the need to pass over objects that appear to be in the wrong list.

To make the new table visible to readers, the pointer to the hash table must be changed accordingly. The expansion code must then wait for an RCU grace period to pass so that it can be sure that there are no longer any lookups running from the old table. Then the process of cleaning up ("unzipping") the lists in the new table can begin.

That process works by passing over the buckets in the old table, which has not yet been freed. For each bucket, the algorithm is somewhat like this:

Determine which bucket (in the new table) the first item in the list belongs to.
Advance list head pointer through the list until it encounters the first item that does not belong in the same bucket.

The result after processing the first bucket would look something like:

Then, the mismatched object (and any like it that immediately follow) are patched out of the list by changing the "next" pointer in the previous item:

At this point, no other changes can be made to this list until another RCU grace period has passed, otherwise a reader following the lists could end up in the wrong place. Remember that readers can be following these lists concurrently with the above-described change, but they will be starting from the pointers in the new table. A reader looking for a light green object may be looking at the dark green object that is patched out of the list in the first step, but, since that object's "next" pointer remains unchanged, the reader will not be derailed. Readers looking for dark green objects will have never seen the changed object to begin with, so they, too, are safe.

In other words, the algorithm works by changing pointers that are visible only to readers that are not looking for the objects that will be "unzipped" from the list. As long as only one pointer in the list is changed for each RCU grace period, this property holds and the list can be safely unzipped while readers pass through it. Note that each bucket (from the old table) can be processed in this way in each cycle, so the second bucket in this example can also be partially unzipped. So the data structure will look like this when the wait for the grace period happens after the first pass through the old table:

The hash lists are now partially straightened out, but the job is not yet complete. So the unzipping process runs again, tweaking another "next" pointer in each list if needed. This process continues until it reaches the end of each of the hash lists, at which point each object appears only in its appropriate bucket:

If the hash lists are long, this process may take a fair amount of time. But resize operations should be relatively rare, while readers access the table frequently. So, for a sufficiently read-mostly table, the extra effort will be worth it in the end; the extra work done for the uncommon operation pays off many times over in the hot (read) path.

Other details

One topic that the above description has avoided thus far is updates to the table. A certain amount of mutual exclusion is required to update a hash table like this even in the absence of "relativistic" techniques; only one thread should be trying to update the head pointer in any given hash bucket at a time, for example. In general, updates must also be synchronized with resize operations in the same ways; a thread making a change to the table will simply have to wait until the resize operation is complete.

That said, there are some tricks that can be employed to speed up these operations while a resize is going on. For a shrinking operation, update access to each hash bucket can be allowed as soon as the splicing of the lists for that bucket is complete. Expansion is, once again, a bit more complicated. Insertion at the head of the list can be done concurrently once the new table is set up since the unzipping process will not change the head pointer. Removal, though, requires coordination with the resizer to avoid removing an item that the resizer is also operating on.

For a lot more detail on everything described here, see the original paper.

That paper contained code implementing the relativistic hash table concept in the kernel. But, for some reason, none of this code was proposed for inclusion into the mainline. So the technique languished for nearly four years until Thomas put together his implementation, which was initially targeted at the hash table used to track netlink sockets in the kernel. Now that this infrastructure is in place, chances are that its use will expand into other areas relatively quickly.

Index entries for this article
Kernel	Hash table
Kernel	Read-copy-update

Relativistic hash tables, part 1: Algorithms

Posted Sep 18, 2014 18:36 UTC (Thu) by ppisa (subscriber, #67307) [Link]

Nice solution.

As for the hash tables, I use in my simple applications (no RCU) solution where AVL roots pointers are stored in hash indexed table instead of list head pointers. The items (expected to be stored in table) include embedded AVL node. The same thing can be done with kernel R-B tree. There is overhead of AVL/R-B node in each item but the tree roots are cheap (single pointer). This solution protects hash to unfortunate or intentional hash collision attacks and worst search complexity is O(log(n)). The size of the table can be automatically scaled to enhance lookup times. The code to test implementation/use example can be found in file ul_hashtabchk.c and implementation in surrounding files

http://sourceforge.net/p/ulan/ulut/ci/master/tree/ulut/

My two cents are not rocket science as relativistic hashes are but idea may be inspiring for somebody for something better.

Relativistic hash tables, part 1: Algorithms

Posted Sep 24, 2014 15:49 UTC (Wed) by kjp (guest, #39639) [Link] (3 responses)

It would be nice to know why you can't just copy all elements and put them in the new table, then just swap the pointer to the table in one RCU cycle. I'm sure there is a reason, but it's only silently implied.

Copying all the elements

Posted Sep 24, 2014 16:04 UTC (Wed) by corbet (editor, #1) [Link]

The most likely reason is that...you would have to copy all the elements. Depending on the size of the table, that could be a lot of copying. Concurrency with modifications would also become harder.

Relativistic hash tables, part 1: Algorithms

Posted Sep 24, 2014 20:11 UTC (Wed) by josh (subscriber, #17465) [Link]

In addition to the overhead of copying all the elements, you'd break any code that holds references to those elements. In quite a few hash tables in the kernel, other code holds long-running references to hash nodes and expects them to stay alive.

When I first started working on hash table algorithms, years ago, I started out by developing an algorithm to move a single node between buckets because the key changed. My first attempt involved making a copy of the node. That works fine in a standalone hash table, but it won't work in many real kernel data structures (for instance, dcache), because it breaks other pointers to the (reference-counted) node.

That's why the algorithm shown here does *not* copy any nodes.

Relativistic hash tables, part 1: Algorithms

Posted Sep 24, 2014 23:10 UTC (Wed) by xman (subscriber, #46972) [Link]

That wouldn't get you eventual consistency or would make all mutations effectively single threaded.

How would you handle the case of two simultaneous writes?

Dynamic hash tables

Posted Sep 25, 2014 10:10 UTC (Thu) by Wol (subscriber, #4433) [Link] (13 responses)

It's always puzzled me why this doesn't seem to be common. I believe it's the algorithm used in Sleepycat/Berkeley DB. It's used very extensively in the Pick world.

Basically, I can change the size of my hash table by ONE bucket at a time, at very minimal cost. Let's look at what happens to bucket no 5 (binary 101). The algorithm guarantees that, in any resize operation, the appropriate bucket number changes by the most significant bit ONLY.

So if I want to reduce my file size from a hash of 6 to 5, the highest bucket (no 5) loses its most significant bit - 101 -> 1 - and everything from bucket 5 goes into bucket 1.

Going the other way, from a hash of 13 to 14, the new bucket 13 finds all its new members were previously in bucket 5, so you just rehash that one bucket - 101 -> 1101 (leaving behind half the entries, which still map to 5 not 13, maybe they'll map to 21 next time 5 gets split - 10101).

Cheers,
Wol

Dynamic hash tables

Posted Sep 25, 2014 14:45 UTC (Thu) by perlwolf (guest, #46060) [Link] (12 responses)

Wol, that breaks the "random but tending toward equally-sized" expectation for bucket content size.

When incrementing the number of buckets by 1, a "fair" bucket distribution policy would go from k buckets, each holding 1/k of the nodes (on average with a good hashing function), to k+1 buckets each with 1/(k+1).

Your policy would go from i-j buckets of 1/(log2(i)+1) nodes plus i+2j buckets of 1/(log2(i)+2) nodes, to i-j-1 of the bigger lists and i+2j+2 of the smaller lists. (where i is the largest power of 2 <- k, and j is k-i)

So, that means there are two groups of buckets, the first group containing buckets that are twice as full as the second group.

In general, adding one bucket, or removing one bucket, would only sometimes provide any benefit depending upon whether the bucket that was split, or the buckets that were merged, were actually bigger or smaller than 1/k. Sometimes, you'll add one bucket by splitting the smallest of the large buckets; or by merging the largest of the small buckets and get no significant benefit from the change. (At worst, it would be splitting (merging) an empty bucket (pair).) Other times, it'll happen that the bucket that is split or merged will be the best candidate and there will be great benefit.

Dynamic hash tables

Posted Sep 25, 2014 16:55 UTC (Thu) by Wol (subscriber, #4433) [Link] (11 responses)

Yep, I know that, but how important is that?

Standard stats for Pick are that it splits at 80%, merges at 50%, and 95% of accesses hit on the first attempt.

You need some way of handling overflow whatever you do, - how often do you need to rehash these relativistic tables? - how do you handle it overflowing there?

It's a tradeoff - a relativistic rehash is expensive so you need to waste disk/memory to suppress rehashes. Dynamic hashing makes much more efficient use of disk and memory, and reduces the cost of rehashing at, as you say, slightly uneven bucket filling.

But if you need to cope with buckets overfilling anyway, so what about dynamic hashing having uneven buckets ...

Cheers,
Wol

Dynamic hash tables

Posted Sep 25, 2014 17:49 UTC (Thu) by perlwolf (guest, #46060) [Link] (10 responses)

The importance is that there is probably no advantage gained by frequently splitting one entry instead of rarely splitting all of them. That's especially true here where splitting a bucket requires waiting a number of delay times as the contents get moved to the new pair of buckets; since splitting all of the buckets can share series of delay periods. Doing individual splits would mean that each one provides a multi-delay period that requires careful processing for additions and deletions.

Consider going from 64 buckets to 128, and assume that the buckets require on the average one time delay for the initial head split and one time delay for the node splitting, (with that average coming from a number of buckets which require no extra delays, many that require one, and a few that require 2 or 3).

Doing the splits individually would have write contention for about 64 operations, each blocking for 1 to 4 delay times. Doing all of the splits at once would require a single blocking period of write contention that lasted 4 delay times.

There's also the cost of setting up the new array of bucket pointers 64 times instead of once - if that is done with an array allocation each time there is a much higher total execution cost.

For dynamic uneven buckets, one possibility might be to have two type of bucket pointer. The normal one points to a linked list, but the alternate form points to a small 2**k bucket list that uses the next bits of the hash code.

Dynamic hash tables

Posted Sep 26, 2014 0:40 UTC (Fri) by Wol (subscriber, #4433) [Link] (9 responses)

I'll need to investigate that. But, equally, it's a tradeoff again. Are you saying the big time-killer is the head-split?

Because with dynamic hashing, if you're not interested in the bucket being split, there's no delay.

What I'd do is - in my example - lock bucket 5 with a "splitting" status. Split or merge it, then free it. So any caller attempting to access that bucket knows that it needs to rehash before trying again. So the state of the head wouldn't matter, the sequence would be to lock bucket 5, split or merge it, update the head, free the lock.

If accessing any other bucket, the hash will be valid at all times. Because head is only updated AFTER the split/merge is complete. And the caller knows that when the lock is freed its original hash is invalid and needs to be recalculated.

So while I take your point that the time for a split seems to be constant for you whether you're splitting one bucket or many, you inconvenience everyone while you're splitting. I only inconvenience callers who are trying to access the actual bucket being split.

Cheers,
Wol

Dynamic hash tables

Posted Sep 26, 2014 1:54 UTC (Fri) by dlang (guest, #313) [Link] (8 responses)

If it's a good hash for the task at hand, all the buckets will be very close to equally full, so if you need to split one, you probably need to split all of them.

Also, please think carefully though all the possibilities that can come up when you have multiple threads accessing things at the same time. not just starting from the head, but the fact that you may have them walking and updating the collision chains as well (I haven't tried to go through your explanation in enough detail to check that)

Dynamic hash tables

Posted Sep 26, 2014 10:34 UTC (Fri) by Wol (subscriber, #4433) [Link] (7 responses)

> Note the assumption that all items falling into the same bucket in the old table will also fall into the same bucket in the new table. That is a requirement for this algorithm to work. In practice, it means that the size of a table can only change by an integer factor; normally that means that the size of a table can only be doubled or halved.

Having carefully read the article (I didn't, originally), I now see that relativistic and dynamic hashing are almost exactly the same technique :-)

It's just that the above quoted statement IS WRONG (the "in practice" bit, not the other bit), and dynamic hashing takes advantage of that fact. So the question is, how is the bucket made up, and will dynamic hashing save memory. If, as it appears, each item in the list is individually allocated memory and the items are physically located randomly in memory, then dynamic hashing won't save you anything.

If, however, the bucket consists of blocks of allocated memory, that contain a linked list inside them (as is the setup with Pick, a bucket is a disk block), then you can save a lot of memory by keeping the buckets optimally full.

What I would do here (and it's why dynamic hashing won't save memory if each item is individually allocated) is to overallocate space for the table so I can grow and shrink the table without needing to lock anything. Unless I need to expand it beyond the available space at which point I hit that problem :-)

So shrinking the table is almost exactly the same - link the bucket you're getting rid of on to the end of the bucket it joins, then decrement the variable that says how big the table is.

Growing it, likewise. "Duplicate" the bucket being split into its new position, increment the variable that says how big the table is, then unzip the two buckets. With exactly the same wait periods as before.

So in other words, relativistic and dynamic hashing are almost identical! I just don't think dynamic would gain anything here because it saves on wasted storage when you store multiple items in a single block of allocated space, which doesn't seem to be the case here. The only thing it could save is on the space allocated for the table, which is peanuts in comparison to the entire list.

Thanks for the enlightenment :-)

Cheers,
Wol

Dynamic hash tables

Posted Sep 26, 2014 14:37 UTC (Fri) by perlwolf (guest, #46060) [Link] (6 responses)

What is the point of having an allocated table that is only partly used?

When you shrink that table by merging the last element with its "parent", but keep the same original table, you are occupying the same total amount of storage space, while slowing down access to one of the buckets. There is zero benefit here to offset that occasional slowdown. (Reduced memory usage from shrinking the bucket table size is the *only* benefit of reducing the number of buckets and you throw that away if you keep unused slots in the table.)

Splitting all of the buckets ensures that you split the longest/worst one(s) rather than simply splitting the last one (which might be one of the smallest buckets).

A dynamic split that can split buckets other than the last one have their own cost. The simple hash-index-listsearch becomes hash-index-maybeanotherindex-listsearch which is a small cost (extra code and time to determine whether this is a dynamically split list) that offsets somewhat the shorter list traversals. (Taken to an extreme, you would have a binary tree, with each node being either a tree node that takes the next bit from the has and chooses the appropriate sub-tree or an element node that is either the desired value or the desired value is not in the hash. But at that extreme, it has switched from an O(1) table index to an O(log n) series of tree traversals. That can be balanced by having each tree node use a variable number of hash bits to get back to O(1) behaviour.

Dynamic hash tables

Posted Sep 26, 2014 17:38 UTC (Fri) by Wol (subscriber, #4433) [Link]

> What is the point of having an allocated table that is only partly used?

When the table points to buckets, and space is allocated at the bucket level, not the item level. So having more buckets than you need wastes a LOT of space. Which is cheaper - a few bytes for unused pointers in the table, or many kb for unused space in the buckets?

Each entry in a hash table points to a bucket. A bucket is a linked list of blocks. And a block can contain (in the generic case) any number of items. In this case, a block contains 1 item so there are no savings to be made. But in the Pick case, a block can contain maybe 5 typical items, so the difference between a file with an average 4 items per block or 2 items per block is huge.

Which is why dynamic hashing would make a lot of sense for storing i-nodes in a directory! If doubling the size of the hash table doubles the disk space used by the directory, and each block has space for, say, 10 i-nodes then that's a perfect use-case!

I initially didn't twig that memory was allocated at the item level, and dynamic hashing has been around for absolutely years (probably longer than a lot of people here have been alive!).

So this idea of dynamically splitting a hash table has been around for 40+ years, and in WIDESPREAD commercial use for over 30 of them to my personal knowledge. What's new is the trick of splitting them while they are being actively accessed - Pick databases will lock the bucket while they split it. But they only lock one bucket at a time, and reads only need to outnumber writes by a small amount before the cost of splitting is overwhelmed by the benefits of permanent near-perfect hashing. As your hash file gets bigger, the chances of any individual read tripping over a split/merge operation tends to zero ...

Cheers,
Wol

Dynamic hash tables

Posted Sep 26, 2014 17:53 UTC (Fri) by Wol (subscriber, #4433) [Link] (4 responses)

> A dynamic split that can split buckets other than the last one have their own cost. The simple hash-index-listsearch becomes hash-index-maybeanotherindex-listsearch which is a small cost

True. But the cost is an if, a right-shift, and a mask. Probably faster, actually, than your typical mod(file modulo) operation!

For any given file modulo M, the Pick algorithm is to find N such that
2^(N-1) < M <= 2^N (or have I got my < and <= the wrong way round. Never mind.)
Then
hash = hashfunction( key)
bucket = and( hash, (2^N)-1 )
if bucket >= M then bucket = and( hash, (2^(N-1))-1 )

This is easily optimised by precomputing (2^N)-1, and right-shifting it by one for the second and. It only needs to be recalculated if M hits 2^N or 2^(N-1) on a split or merge.

Cheers,
Wol

Dynamic hash tables

Posted Sep 26, 2014 19:55 UTC (Fri) by perlwolf (guest, #46060) [Link] (3 responses)

The cost is more significant where it is some random buckets that have been split rather than having a clean break between the unsplit and the split buckets. There, you have to take the linked list pointer and add a flag to determine whether it is really pointing to a linked list or if it is actually a link to a split bucket structure. So, the code has a boolean test before traversing the list for an unsplit bucket, or before the logic to process the lower level split in some way, eventually getting down (perhaps) to a linked list to traverse.

Dynamic hash tables

Posted Sep 26, 2014 20:50 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> The cost is more significant where it is some random buckets that have been split

Which is exactly what my algorithm does NOT do.

Chances are, my algorithm converts the key to the bucket rather faster than a standard hash algorithm (it uses and(), not mod().). And there's no fancy logic past that - the bucket is the right bucket, first time, EVERY TIME. There's no such thing as a "split bucket" structure.

(And if it hits a linked list, it's a failure mode ... it's intended to identify the right disk block first time every time - a disk miss is *expensive*. It handles it - it has to - but because it's expensive it's not a good idea. The miss rate is typically 5%)

Cheers,
Wol

Dynamic hash tables

Posted Sep 26, 2014 21:17 UTC (Fri) by perlwolf (guest, #46060) [Link] (1 responses)

Ok, you are back to the structure that always splits the last of the large buckets. That is unlikely to be the bucket that has the longest chain, the one most in need of being split, so the split gains no significant benefit. A useful dynamic splitter would split a bucket when it reaches some threshold, but that is unlikely to be conveniently placed as the last large bucket. You either get a fast determination of which buckets have an extra split, or the ability to split the bucket that needs it, but not both.

Dynamic hash tables

Posted Sep 26, 2014 23:27 UTC (Fri) by Wol (subscriber, #4433) [Link]

Yes, but the thing is, it works. Statistically, the split bucket is likely to have a longer-than-average chain. And, on average, at *no* time does *any* bucket have a chain :-) So it's quite possible for me that a "split the largest bucket" would fail because it wouldn't find a "largest" bucket to split!

And in my use case (multiple items per allocation block) it doesn't seem to matter. If the average block is 80% full, I'll find the item I'm looking for in the first block I try 19 times out of 20. That's pretty good ...

It's all a tradeoff. I trade uneven clumping for a simple algorithm and a perfect modulo. I also trade multiple items per allocation block to reduce chains.

You're trading unrestricted access most of the time, and accepting that you have an imperfect modulo and every now and then you're going to get a bit of a hit as the table resizes.

I'm trading the cost of scanning a bucket, and the occasional stall as I hit a locked bucket, for the fact that most of the time I have no chains, and it's damn bad luck if a read hits a bucket that's splitting.

Both trades are appropriate for our use circumstances - a chain causes me an unwanted disk i/o stall, you're not worried about memory usage but need to prevent chains getting too long.

Cheers,
Wol

Relativistic hash tables, part 1: Algorithms

Posted Sep 29, 2014 10:25 UTC (Mon) by _xhr_ (guest, #92665) [Link] (1 responses)

@Jonathan: What's the tools you used to generate the graphics?

Diagrams

Posted Sep 29, 2014 12:34 UTC (Mon) by corbet (editor, #1) [Link]

The diagrams are all done with dia. It drives me nuts sometimes but it generally works well and the layering support makes it easy to make pictures with different combinations of arrows with minimal pain.