URCU-protected hash tables

November 12, 2013

This article was contributed by Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan

The traditional way of implementing a hash table is to have an array of linked lists. However, user-space RCU instead uses a split-ordered list, which provides a lock-free RCU-protected resizable hash table. The resulting algorithms are described in the following papers:

Ori Shalev and Nir Shavit. Split-ordered lists: Lock-free extensible hash tables. J. ACM 53, 3 (May 2006), 379-405.
Michael, M. M. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures, ACM Press, (2002), 73-82.

Readers wishing more details on the actual implementation are encouraged to read these two papers and the source code.

In a manner similar to the linked lists, structures intended to be added to a hash table must have a field of type cds_lfht_node, which is used internally by the hash-table implementation to link the structures together. This means that the various lookup functions return a pointer to a cds_lfht_node structure, so caa_container_of() should be used to access the enclosing structure. The actual hash function is provided by the caller—the hash-table implementation requires key-hash pairs be provided by the caller.

Access the RCU-protected hash table API as follows:

    /* Select RCU flavor, and then: */
    #include <urcu/rculfhash.h>

As noted in the comment, it is necessary to select the RCU flavor before including rculfhash.h. See the discussion of URCU flavors for advice on which flavor to select.

The full API is described separately, however, typical usage and ordering constraints are covered here.

Typical usage

Because the cds_lfht_node structure has no provisions for user data, it is necessary to define a structure containing both the desired user data and a cds_lfht_node structure, for example, as follows:

    struct mynode {
	int value;			/* Node content */
	struct cds_lfht_node node;	/* Chaining in hash table */
    };

In this case, the user data is a single int. We also need a pointer to the hash table itself:

    struct cds_lfht *ht;

Before using hash-table functions, each thread must register with userspace RCU:

    rcu_register_thread();

The hash table can then be allocated:

    ht = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE, NULL);

The first 1 specifies the initial number of hash buckets in the table, the second 1 specifies the minimum number of hash buckets, and the 0 specifies the maximum number of hash buckets. All three quantities must be powers of two, except for the special case of 0 for the third argument, which specifies an unlimited maximum number of hash buckets.

The CDS_LFHT_AUTO_RESIZE indicates that the hash table may be automatically resized (a wise choice in this case, given the initial size of 1). We could also specify CDS_LFHT_ACCOUNTING to cause the table to maintain element counts, though this accounting would obviously incur some overhead. Finally, the NULL means that any worker threads will be created with default attributes. We could instead have initialized a pthread_attr_t to specify any needed pthread attributes.

Assuming that the value ht is non-NULL, we are now ready to add nodes to the hash table. But first we must have a hash function, for example, the one that Bob Jenkins has graciously placed into the public domain here. Mathieu created a jhash() interface to this hash function, which we will use here.

So, given a value in variable value, a pointer to a mynode structure in mp, a 32-bit randomly generated seed and a local variable hash, we can do the following:

    hash = jhash(&value, sizeof(value), seed);
    cds_lfht_node_init(&mp->node);
    mp->value = value;
    rcu_read_lock();
    cds_lfht_add(ht, hash, &mp->node);
    rcu_read_unlock();

We have now added a node to the hash table. Please note that the RCU read-side critical section is not optional!

Quick Quiz 1: Don't we need to check for cds_lfht_add() failure?
Answer

We could allocate, initialize, and add many more nodes to the hash table, at which point we might want to iterate over them. For this we will need a cds_lfht_iter structure, in this case the local variable iter, and also a pointer to a mynode structure, here the local variable np:

    struct cds_lfht_iter iter;
    struct mynode *np;

We are then ready to iterate over the structure under RCU protection as follows:

    rcu_read_lock();
    cds_lfht_for_each_entry(ht, &iter, np, node)
	do_something_with(np);
    rcu_read_unlock();

We can also do one-at-a-time lookups. For example, we might wish to determine whether the hash table contained a mynode structure whose value matched that of local variable value. However, all the hash table understands are the hash values passed into cds_lfht_add(), so we will first need some way to compare the values rather than the hashes. For this, we define a match() function:

    static
    int match(struct cds_lfht_node *ht_node, const void *_key)
    {
	struct mynode *node = caa_container_of(ht_node, struct mynode, node);
	const int *key = _key;

	return *key == node->value;
    }

Here caa_container_of() translates from a pointer to a cds_lfht_node pointer to a pointer to the enclosing mynode structure.

Now we can call cds_lfht_lookup() to do the lookup, and then cds_lfht_iter_get_node() to extract the node pointer from the iterator:

    hash = jhash(&value, sizeof(value), seed);
    rcu_read_lock();
    cds_lfht_lookup(ht, hash, match, &value, &iter);
    ht_node = cds_lfht_iter_get_node(&iter);
    if (!ht_node) {
	printf("Key %d not found\n", value);
    } else {
	mp = caa_container_of(ht_node, struct mynode, node);
	printf("(key %d found\n", mp->value);
    }
    rcu_read_unlock();

At first glance, this might seem like a lot of steps to do a hash lookup, but its great virtue is that it allows the hash table to store any kind of key. Of course, if you are using this hash table for some specific type of key, you would be wise to define access functions to automate the bookkeeping.

A small variation on the above theme removes an item from the hash table:

    hash = jhash(&value, sizeof(value), seed);
    rcu_read_lock();
    cds_lfht_lookup(ht, hash, match, &value, &iter);
    ht_node = cds_lfht_iter_get_node(&iter);
    if (!ht_node) {
	printf("Key %d not found\n", value);
	rcu_read_unlock();
    } else {
	ret = cds_lfht_del(ht, ht_node);
	mp = caa_container_of(ht_node, struct mynode, node);
	if (ret) {
	    printf("Key %d concurrently deleted\n", mp->value);
	    rcu_read_unlock();
	} else {
	    rcu_read_unlock();
	    synchronize_rcu();
	    printf("Key %d deleted\n", mp->value);
	    free(mp);
	}
    }

Note that the lookup and the deletion (but not the free()) must both be contained within a single RCU read-side critical section.

Quick Quiz 2: Suppose that we have added mynode structures with the same ->value to the hash table. What will happen if we invoke cds_lfht_del() on that value?
Answer

Once you have removed all of the nodes from a given hash table, you may destroy it:

    ret = cds_lfht_destroy(ht, &attr);

The return value stored in ret will be non-zero if the hash table was non-empty, in which case the hash table will not have been destroyed.

Prior to a thread exiting, it should tell RCU that it is done:

    rcu_unregister_thread();

There are a number of other things that can be done to these hash tables, as can be seen from the full API, but the above example gives you the tools you will need for most situations.

Ordering Constraints

Different hash-table operations have different properties, depending on whether they are reads, read traversals, writes, or other.

Read operations include the basic cds_lfht_lookup(), cds_lfht_next_duplicate(), cds_lfht_first(), cds_lfht_next() operations, as well as the failure case of the cds_lfht_add_unique() operation.

Read-traversal operations are any of the following groups of operations:

cds_lfht_for_each().
cds_lfht_for_each_duplicate().
cds_lfht_for_each_entry().
cds_lfht_for_each_entry_duplicate().
cds_lfht_lookup() followed by iteration with cds_lfht_next_duplicate() (and/or cds_lfht_next(), although less common).
cds_lfht_add_unique() (failure) followed by iteration with cds_lfht_next_duplicate() (and/or the less-common cds_lfht_next()).
cds_lfht_first() followed iteration with cds_lfht_next() (and/or the less-common cds_lfht_next_duplicate()).

Write operations are any of cds_lfht_add(), cds_lfht_replace(), cds_lfht_add_replace(), cds_lfht_del(), and the success case of cds_lfht_add_unique(). When cds_lfht_add_unique() succeeds (and thus returns the node passed as parameter), it acts as a write operation. When cds_lfht_add_unique() fails (returns a node different from the one passed as parameter), it acts as a read operation, and is a equivalent to a successful cds_lfht_lookup() operation.

The hash-table operations have forward-progress guarantees as follows:

Read and read-traversal operations are wait-free. These operations always move forward in the hash table's linked list, and this list is finite and acyclic. Note however, that a malicious and omniscient updater could in theory schedule a series of insertions and deletions that caused a reader looking for an entry with a large hash value to take a very long (but finite!) time.
Writes are lock-free because conflicting operations can force a theoretically infinite number of retries.
Other operations have no forward-progress guarantees.

The hash-table operations also provide well-defined ordering guarantees. We define prior and later node as nodes observable by reads and read traversals respectively before and after a write or sequence of write operations. Hash-table operations are often cascaded, for example, the pointer returned by a cds_lfht_lookup() might be passed to a cds_lfht_next(), whose return value might in turn be passed to another hash-table operation. This entire cascaded series of operations must be enclosed by a pair of matching rcu_read_lock() and rcu_read_unlock() operations. This hash table provides the following ordering guarantees:

Reads after write:
1. Read after write: If a write is ordered before a read, then the read is guaranteed to see either that write or some later write.
2. Read-traversal after write: Given that there is a dependency between consecutive reads in a read-traversal, if a write is ordered before the first read of the traversal, then the read-traversal is guaranteed to see either that write or some later write.
Writes after reads:
1. Write after read: If a read is ordered before a write, then the read will not see the write.
2. Write after read-traversal: given that there is a dependency between reads in a read-traversal, if the last read of the traversal is ordered before a write, then the read-traversal will not see the write.
Read operations:
1. If a pair of lookups for a given key are ordered (e.g. by a memory barrier), then the second lookup will return the same node as the previous lookup, or some later node.
2. A read-traversal that starts after the end of a prior read-traversal (ordered by memory barriers) is guaranteed to see either the same nodes as the previous traversal, or some later nodes.
3. Concurrent read pairs: Concurrent reads are unordered. For example, if a pair of reads to the same key run concurrently with an insertion of that same key, the reads remain unordered regardless of their return values. In other words, you cannot rely on the values returned by the reads to deduce ordering.
Write operations:
1. Write after write: If a write is ordered before another a write, then the later write is guaranteed to see the effects of the earlier write.
2. Concurrent write pairs: The system will assign an arbitrary order to any pair of concurrent conflicting writes. Non-conflicting writes (for example, to different keys) are unordered.
Write concurrent with reads: If a write occurs concurrently with a read or read-traversal, the read might or might not see the write.
Grace-period guarantee: If a grace period separates a deletion or replacement operation and a subsequent operation, then that subsequent operation is guaranteed not to see the removed item.
Uniqueness guarantee: Given a hash table that contains no duplicate items for a given key, there will be at most one item per key in the hash table before, during, and after an arbitrary sequence of add-unique and/or add-replace operations. Given a series of add-replace operations on a given pre-existing key, there will never be a time when the key is missing. Note, however, that a pair of concurrent read operations might well access two different items with that key.

Alternatives

The RCU split-ordered approach has many attractive properties, but it is worthwhile to quickly review other possible approaches to resizable RCU-protected hash tables.

A key challenge is the handling of reads that run concurrently with a resize operation. The split-ordered-list approach overcomes this challenge by maintaining a single sorted list that is indexed into by separate lists of pointers. Because the sorted list remains stable during a resize operation, concurrent readers traversing this list are undisturbed by the resize operation.

Another approach is to maintain two sets of list pointers in each node, so that readers traverse one set of pointers while the resize operation is updating the other set. A global index selects the set that is to be used by readers, and the update operation updates this index upon completion. Herbert Xu created a hash table using this technique, which may be found in the Linux kernel here. This is arguably the simplest approach, but does suffer from increased memory footprint due to the extra set of list pointers.

A third approach uses new linked-list operations that allow two linked lists to share a common list tail, allowing the resize operation to “unzip” this shared list tail. This approach has been prototyped by Josh Triplett as part of his dissertation [PDF] and in a USENIX paper [PDF].

The split-ordered-list approach was chosen based on its forward-progress guarantees, which should result in improved scalability and real-time response. Careful comparison of these techniques is important future work.

Answers to Quick Quizzes

Quick Quiz 1: Don't we need to check for cds_lfht_add() failure?

Answer: No, we don't. We have supplied all the memory that cds_lfht_add() needs, and duplicate keys are permitted, so there is no reason for it to fail.

If you don't want duplicate keys, use cds_lfht_add_unique() instead of cds_lfht_add(). But then you will need to check for failure.

Back to Quick Quiz 1.

Quick Quiz 2: Suppose that we have added mynode structures with the same ->value to the hash table. What will happen if we invoke cds_lfht_del() on that value?

Answer: The trick is that cds_lfht_del() is invoked on individual nodes, not key values. If we have multiple mynode structures with the same ->value in the hash table, we can use the cds_lfht_lookup() and cds_lfht_next_duplicate() functions to select which structure to pass to cds_lfht_del() of the several having the same key.

Back to Quick Quiz 2.