URCU-protected hash tables
The traditional way of implementing a hash table is to have an array of linked lists. However, user-space RCU instead uses a split-ordered list, which provides a lock-free RCU-protected resizable hash table. The resulting algorithms are described in the following papers:
- Ori Shalev and Nir Shavit.
Split-ordered lists: Lock-free extensible hash tables.
J. ACM 53, 3 (May 2006), 379-405.
- Michael, M. M. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures, ACM Press, (2002), 73-82.
Readers wishing more details on the actual implementation are encouraged to read these two papers and the source code.
In a manner similar to the linked lists, structures intended to be
added to a hash table must have a field of type cds_lfht_node
,
which is used internally by the hash-table implementation to link
the structures together.
This means that the various lookup functions return a pointer to a
cds_lfht_node
structure, so caa_container_of()
should be used to access the enclosing structure.
The actual hash function is provided by the caller—the hash-table
implementation requires key-hash pairs be provided by the caller.
Access the RCU-protected hash table API as follows:
/* Select RCU flavor, and then: */ #include <urcu/rculfhash.h>
As noted in the comment, it is necessary to select the RCU flavor before
including rculfhash.h
.
See the discussion of
URCU flavors
for advice on which flavor to select.
The full API is described separately, however, typical usage and ordering constraints are covered here.
Typical usage
Because the cds_lfht_node
structure has no provisions
for user data, it is necessary to define a structure containing both
the desired user data and a cds_lfht_node
structure, for
example, as follows:
struct mynode { int value; /* Node content */ struct cds_lfht_node node; /* Chaining in hash table */ };
In this case, the user data is a single int
.
We also need a pointer to the hash table itself:
struct cds_lfht *ht;
Before using hash-table functions, each thread must register with userspace RCU:
rcu_register_thread();
The hash table can then be allocated:
ht = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE, NULL);
The first 1
specifies the initial number of hash buckets
in the table, the second 1
specifies the minimum number of
hash buckets, and the 0
specifies the maximum number of
hash buckets.
All three quantities must be powers of two, except for the special case
of 0
for the third argument,
which specifies an unlimited maximum number of hash buckets.
The CDS_LFHT_AUTO_RESIZE
indicates that the hash table
may be automatically resized (a wise choice in this case, given the
initial size of 1
).
We could also specify CDS_LFHT_ACCOUNTING
to cause the
table to maintain element counts, though this accounting would obviously
incur some overhead.
Finally, the NULL
means that any worker threads will be
created with default attributes.
We could instead have initialized a pthread_attr_t
to
specify any needed pthread attributes.
Assuming that the value ht
is non-NULL
,
we are now ready to add nodes to the hash table.
But first we must have a hash function, for example, the one that
Bob Jenkins has graciously placed into
the public domain here.
Mathieu created a jhash()
interface to this hash function,
which we will use here.
So, given a value in variable value
, a pointer to a
mynode
structure in mp
, a 32-bit randomly
generated seed
and a local variable
hash
, we can do the following:
hash = jhash(&value, sizeof(value), seed); cds_lfht_node_init(&mp->node); mp->value = value; rcu_read_lock(); cds_lfht_add(ht, hash, &mp->node); rcu_read_unlock();
We have now added a node to the hash table. Please note that the RCU read-side critical section is not optional!
Quick Quiz 1:
Don't we need to check for cds_lfht_add()
failure?
We could allocate, initialize, and add many more nodes to the hash
table, at which point we might want to iterate over them.
For this we will need a cds_lfht_iter
structure,
in this case the local variable iter
,
and also a pointer to a mynode
structure, here the local variable np
:
struct cds_lfht_iter iter; struct mynode *np;
We are then ready to iterate over the structure under RCU protection as follows:
rcu_read_lock(); cds_lfht_for_each_entry(ht, &iter, np, node) do_something_with(np); rcu_read_unlock();
We can also do one-at-a-time lookups.
For example, we might wish to determine whether the hash table contained
a mynode
structure whose value matched that of local
variable value
.
However, all the hash table understands are the hash values passed
into cds_lfht_add()
, so we will first need some way to
compare the values rather than the hashes.
For this, we define a match()
function:
static int match(struct cds_lfht_node *ht_node, const void *_key) { struct mynode *node = caa_container_of(ht_node, struct mynode, node); const int *key = _key; return *key == node->value; }
Here caa_container_of()
translates from a pointer to
a cds_lfht_node
pointer to a pointer to the enclosing
mynode
structure.
Now we can call cds_lfht_lookup()
to do the lookup,
and then cds_lfht_iter_get_node()
to extract the node pointer
from the iterator:
hash = jhash(&value, sizeof(value), seed); rcu_read_lock(); cds_lfht_lookup(ht, hash, match, &value, &iter); ht_node = cds_lfht_iter_get_node(&iter); if (!ht_node) { printf("Key %d not found\n", value); } else { mp = caa_container_of(ht_node, struct mynode, node); printf("(key %d found\n", mp->value); } rcu_read_unlock();
At first glance, this might seem like a lot of steps to do a hash lookup, but its great virtue is that it allows the hash table to store any kind of key. Of course, if you are using this hash table for some specific type of key, you would be wise to define access functions to automate the bookkeeping.
A small variation on the above theme removes an item from the hash table:
hash = jhash(&value, sizeof(value), seed); rcu_read_lock(); cds_lfht_lookup(ht, hash, match, &value, &iter); ht_node = cds_lfht_iter_get_node(&iter); if (!ht_node) { printf("Key %d not found\n", value); rcu_read_unlock(); } else { ret = cds_lfht_del(ht, ht_node); mp = caa_container_of(ht_node, struct mynode, node); if (ret) { printf("Key %d concurrently deleted\n", mp->value); rcu_read_unlock(); } else { rcu_read_unlock(); synchronize_rcu(); printf("Key %d deleted\n", mp->value); free(mp); } }
Note that the lookup and the deletion (but not the free()) must both be contained within a single RCU read-side critical section.
Quick Quiz 2: Suppose that we have addedmynode
structures with the same->value
to the hash table. What will happen if we invokecds_lfht_del()
on that value?
Once you have removed all of the nodes from a given hash table, you may destroy it:
ret = cds_lfht_destroy(ht, &attr);
The return value stored in ret
will be non-zero if the
hash table was non-empty, in which case the hash table will not have
been destroyed.
Prior to a thread exiting, it should tell RCU that it is done:
rcu_unregister_thread();
There are a number of other things that can be done to these hash tables, as can be seen from the full API, but the above example gives you the tools you will need for most situations.
Ordering Constraints
Different hash-table operations have different properties, depending on whether they are reads, read traversals, writes, or other.
Read operations include the basic
cds_lfht_lookup()
,
cds_lfht_next_duplicate()
,
cds_lfht_first()
,
cds_lfht_next()
operations,
as well as the failure case of the
cds_lfht_add_unique()
operation.
Read-traversal operations are any of the following groups of operations:
-
cds_lfht_for_each()
. -
cds_lfht_for_each_duplicate()
. -
cds_lfht_for_each_entry()
. -
cds_lfht_for_each_entry_duplicate()
. -
cds_lfht_lookup()
followed by iteration withcds_lfht_next_duplicate()
(and/orcds_lfht_next()
, although less common). -
cds_lfht_add_unique()
(failure) followed by iteration withcds_lfht_next_duplicate()
(and/or the less-commoncds_lfht_next()
). -
cds_lfht_first()
followed iteration withcds_lfht_next()
(and/or the less-commoncds_lfht_next_duplicate()
).
Write operations are any of cds_lfht_add()
,
cds_lfht_replace()
,
cds_lfht_add_replace()
, cds_lfht_del()
, and
the success case of
cds_lfht_add_unique()
.
When cds_lfht_add_unique()
succeeds (and thus
returns the node passed as parameter), it acts as a write
operation.
When cds_lfht_add_unique()
fails (returns
a node different from the one passed as parameter), it acts as a
read operation, and is
a equivalent to a successful cds_lfht_lookup()
operation.
The hash-table operations have forward-progress guarantees as follows:
- Read and read-traversal operations are wait-free.
These operations always move forward in the hash table's
linked list, and this list is finite and acyclic.
Note however, that a malicious and omniscient updater could
in theory schedule a series of insertions and deletions that
caused a reader looking for an entry with a large hash value
to take a very long (but finite!) time.
- Writes are lock-free because conflicting operations
can force a theoretically infinite number of retries.
- Other operations have no forward-progress guarantees.
cds_lfht_lookup()
might be passed to a
cds_lfht_next()
, whose return value might in turn be passed
to another hash-table operation.
This entire cascaded series of operations must be enclosed by a pair of
matching rcu_read_lock()
and rcu_read_unlock()
operations.
This hash table provides the following ordering guarantees:
- Reads after write:
- Read after write: If a write is ordered before a read, then the read is guaranteed to see either that write or some later write.
- Read-traversal after write: Given that there is a dependency between consecutive reads in a read-traversal, if a write is ordered before the first read of the traversal, then the read-traversal is guaranteed to see either that write or some later write.
- Writes after reads:
- Write after read: If a read is ordered before a write, then the read will not see the write.
- Write after read-traversal: given that there is a dependency between reads in a read-traversal, if the last read of the traversal is ordered before a write, then the read-traversal will not see the write.
- Read operations:
- If a pair of lookups for a given key are ordered (e.g. by a memory barrier), then the second lookup will return the same node as the previous lookup, or some later node.
- A read-traversal that starts after the end of a prior read-traversal (ordered by memory barriers) is guaranteed to see either the same nodes as the previous traversal, or some later nodes.
- Concurrent read pairs: Concurrent reads are unordered. For example, if a pair of reads to the same key run concurrently with an insertion of that same key, the reads remain unordered regardless of their return values. In other words, you cannot rely on the values returned by the reads to deduce ordering.
- Write operations:
- Write after write: If a write is ordered before another a write, then the later write is guaranteed to see the effects of the earlier write.
- Concurrent write pairs: The system will assign an arbitrary order to any pair of concurrent conflicting writes. Non-conflicting writes (for example, to different keys) are unordered.
- Write concurrent with reads: If a write occurs concurrently
with a read or read-traversal,
the read might or might not see the write.
- Grace-period guarantee: If a grace period separates a deletion
or replacement operation
and a subsequent operation, then that subsequent operation is
guaranteed not to see the removed item.
- Uniqueness guarantee: Given a hash table that contains no duplicate items for a given key, there will be at most one item per key in the hash table before, during, and after an arbitrary sequence of add-unique and/or add-replace operations. Given a series of add-replace operations on a given pre-existing key, there will never be a time when the key is missing. Note, however, that a pair of concurrent read operations might well access two different items with that key.
Alternatives
The RCU split-ordered approach has many attractive properties, but it is worthwhile to quickly review other possible approaches to resizable RCU-protected hash tables.
A key challenge is the handling of reads that run concurrently with a resize operation. The split-ordered-list approach overcomes this challenge by maintaining a single sorted list that is indexed into by separate lists of pointers. Because the sorted list remains stable during a resize operation, concurrent readers traversing this list are undisturbed by the resize operation.
Another approach is to maintain two sets of list pointers in each node, so that readers traverse one set of pointers while the resize operation is updating the other set. A global index selects the set that is to be used by readers, and the update operation updates this index upon completion. Herbert Xu created a hash table using this technique, which may be found in the Linux kernel here. This is arguably the simplest approach, but does suffer from increased memory footprint due to the extra set of list pointers.
A third approach uses new linked-list operations that allow two linked lists to share a common list tail, allowing the resize operation to “unzip” this shared list tail. This approach has been prototyped by Josh Triplett as part of his dissertation [PDF] and in a USENIX paper [PDF].
The split-ordered-list approach was chosen based on its forward-progress guarantees, which should result in improved scalability and real-time response. Careful comparison of these techniques is important future work.
Answers to Quick Quizzes
Quick Quiz 1:
Don't we need to check for cds_lfht_add()
failure?
Answer:
No, we don't.
We have supplied all the memory that cds_lfht_add()
needs,
and duplicate keys are permitted, so there is no reason for it to fail.
If you don't want duplicate keys, use
cds_lfht_add_unique()
instead of
cds_lfht_add()
.
But then you will need to check for failure.
Quick Quiz 2:
Suppose that we have added mynode
structures with the same
->value
to the hash table.
What will happen if we invoke cds_lfht_del()
on that
value?
Answer:
The trick is that cds_lfht_del()
is invoked on
individual nodes, not key values.
If we have multiple mynode
structures with the same
->value
in the hash table, we can use the
cds_lfht_lookup()
and
cds_lfht_next_duplicate()
functions to select which
structure to pass to cds_lfht_del()
of the several
having the same key.