Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.17-rc5, released on September 14. Linus said: "So I should probably have delayed this until Wednesday for sentimental reasons: that will be 23 years since I uploaded the 0.01 source tree. But I'm not an overly sentimental person, so screw that. I'm doing my normal Sunday release." Linus noted that this is a relatively large set of changes, so any thoughts of doing an early 3.17 release (to avoid conflicts between the merge window and his travel plans) have to be put aside.

Stable updates: the 3.16.3, 3.14.19, and 3.10.55 updates were released on September 17.

Comments (none posted)

Quotes of the week

It's pretty pointless to update kernels on pocket heaters except for bragging reasons.

— Thomas Gleixner

The goal of the kernel is not to be foolproof to developer incompetence - that's a battle you can't win until you replace driver developers with software systems.

— Alan Cox

It may help if you think of what hints you might give an NSA agent who is trying to contribute in a Kleptographic trojan horse into an open source project. How would you do that? Bury the change in a thousand line patch. Change whitespace all over the space to further obfuscate the deadly change. Etc. Then do the exact opposite.

— Ted Ts'o

Comments (none posted)

Yao: The State of ZFS on Linux

At the ClusterHQ blog, Richard Yao looks at the current status of the ZFSOnLinux (ZoL) project. He argues that ZoL is ready for production use for a number of different reasons, all of which boil down to the belief that the ZFS filesystem port to Linux has achieved the same level of data integrity, runtime stability, and features as have the other platforms where ZFS runs. "Sharing a common code base with other Open ZFS platforms has given ZFS on Linux the opportunity to rapidly implement features available on other Open ZFS platforms. At present, Illumos is the reference platform in the Open ZFS community and despite its ZFS driver having hundreds of features, ZoL is only behind on about 18 of them."

Comments (113 posted)

Garrett: ACPI, kernels and contracts with firmware

Matthew Garrett writes about the challenges faced by the developers working on ACPI-based ARM systems. "Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing."

Comments (21 posted)

Relativistic hash tables, part 1: Algorithms

By Jonathan Corbet
September 17, 2014

Hash tables are heavily used within the kernel to speed access to objects of interest. Using a hash table will be faster than, say, a linear search through a single list, but there is always value in making accesses faster yet. Quite a bit of work has been done toward this goal over the years; for example, the use of read-copy-update (RCU) can allow the addition and removal of items from a hash bucket list without locking out readers. Some operations have proved harder to make concurrent in that manner, though, with table resizing being near the top of the list. As of 3.17, the Linux kernel has gotten around this problem with an implementation of "relativistic hash tables" that can be resized while lookups proceed concurrently. This article will describe the algorithm that is used; a companion article will look at the kernel API for these new hash tables.

One might wonder whether the resizing of hash tables is common enough to be worth optimizing. As it turns out, picking the correct size for a hash table is not easy; the kernel has many tables whose size is determined at system initialization time with a combination of heuristics and simple guesswork. But even if the initial guess is perfect, workloads can vary over time. A table that was optimally sized may, after a change, end up too small (and thus perform badly) or too big (wasting memory). Resizing the table would fix these problems, but, since that is hard to do without blocking access to the table, it tends not to happen. The longer-term performance gains are just not seen to be worth the short-term latency caused by shutting down access to the table while it is resized.

Back in 2010, Josh Triplett, Paul McKenney, and Jonathan Walpole published a paper called Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming [PDF] that described a potential solution to this problem. "Relativistic" refers to the fact that the relative timing of two events (hash table insertions, say) that are not causally related may appear different to independent observers. In other words, one CPU may see two items inserted into the table in one order, while those insertions appear to have happened in the opposite order on another CPU. Despite some interesting performance results, this technique did not find its way into the kernel until the 3.17 merge window opened, when Thomas Graf contributed an implementation for use within the networking subsystem.

For the most part, relativistic hash tables resemble the ordinary variety. They are implemented as an array of buckets and a hash function to assign objects to buckets. Each bucket contains a linked list of all items in the table that belong to that bucket. A simple table with two buckets might thus be portrayed something like this:

Here, the first bucket has four items, the second bucket has three.

Table shrinking

Now imagine that this massive hash table is deemed to be too large; it would be nice to be able to shrink it down to a single bucket. (No need to point out that a single-bucket hash table is useless; the point is to simplify the example as much as possible). The first step is to allocate the new table (without making it visible to anybody else) and to link each bucket to the first list in the old table that would hash to that bucket in the new table:

Note the assumption that all items falling into the same bucket in the old table will also fall into the same bucket in the new table. That is a requirement for this algorithm to work. In practice, it means that the size of a table can only change by an integer factor; normally that means that the size of a table can only be doubled or halved.

Anyway, the new table now has some of the necessary items (the green ones) in its sole bucket, but the blue ones are still not represented there. That can be fixed by simply splicing them on to the end of the green list:

At this point, a reader traversing the green list may wander into the blue items added to the end. Since a hash lookup must always compare the keys anyway, the result will still be correct, it will just take a little longer than it did before. But that is how things will work once the operation is complete anyway. Completing simply requires pointing to the new table rather than the old:

Any new readers that come along after this point will see the new table, but there could still be concurrent readers traversing the lists in the old table. So the resize implementation must wait for an RCU grace period to pass; after that the old table can be freed and the operation is complete.

Table expansion

Making a table larger is a somewhat different operation, because objects appearing in a single hash bucket in the old table must be split among multiple buckets in the new table. Imagine that we start with the same two-bucket table:

We now wish to expand that table to four buckets. The colors of the objects in the diagram are meant to indicate which bucket each object will land in once the resizing is complete. The first step is to allocate the new table, and, for each bucket in the new table, set a pointer to the chain containing all of the objects that belong in that bucket:

Note that each pointer from the new table's hash bucket goes to the first item in the old chain that would hash to the new bucket. So, while each hash chain in the new table contains objects that do not belong there, the first object in the list is always in the correct bucket. At this point, the new table can be used for lookups, though they will be slower than they need to be due to the need to pass over objects that appear to be in the wrong list.

To make the new table visible to readers, the pointer to the hash table must be changed accordingly. The expansion code must then wait for an RCU grace period to pass so that it can be sure that there are no longer any lookups running from the old table. Then the process of cleaning up ("unzipping") the lists in the new table can begin.

That process works by passing over the buckets in the old table, which has not yet been freed. For each bucket, the algorithm is somewhat like this:

Determine which bucket (in the new table) the first item in the list belongs to.
Advance list head pointer through the list until it encounters the first item that does not belong in the same bucket.

The result after processing the first bucket would look something like:

Then, the mismatched object (and any like it that immediately follow) are patched out of the list by changing the "next" pointer in the previous item:

At this point, no other changes can be made to this list until another RCU grace period has passed, otherwise a reader following the lists could end up in the wrong place. Remember that readers can be following these lists concurrently with the above-described change, but they will be starting from the pointers in the new table. A reader looking for a light green object may be looking at the dark green object that is patched out of the list in the first step, but, since that object's "next" pointer remains unchanged, the reader will not be derailed. Readers looking for dark green objects will have never seen the changed object to begin with, so they, too, are safe.

In other words, the algorithm works by changing pointers that are visible only to readers that are not looking for the objects that will be "unzipped" from the list. As long as only one pointer in the list is changed for each RCU grace period, this property holds and the list can be safely unzipped while readers pass through it. Note that each bucket (from the old table) can be processed in this way in each cycle, so the second bucket in this example can also be partially unzipped. So the data structure will look like this when the wait for the grace period happens after the first pass through the old table:

The hash lists are now partially straightened out, but the job is not yet complete. So the unzipping process runs again, tweaking another "next" pointer in each list if needed. This process continues until it reaches the end of each of the hash lists, at which point each object appears only in its appropriate bucket:

If the hash lists are long, this process may take a fair amount of time. But resize operations should be relatively rare, while readers access the table frequently. So, for a sufficiently read-mostly table, the extra effort will be worth it in the end; the extra work done for the uncommon operation pays off many times over in the hot (read) path.

Other details

One topic that the above description has avoided thus far is updates to the table. A certain amount of mutual exclusion is required to update a hash table like this even in the absence of "relativistic" techniques; only one thread should be trying to update the head pointer in any given hash bucket at a time, for example. In general, updates must also be synchronized with resize operations in the same ways; a thread making a change to the table will simply have to wait until the resize operation is complete.

That said, there are some tricks that can be employed to speed up these operations while a resize is going on. For a shrinking operation, update access to each hash bucket can be allowed as soon as the splicing of the lists for that bucket is complete. Expansion is, once again, a bit more complicated. Insertion at the head of the list can be done concurrently once the new table is set up since the unzipping process will not change the head pointer. Removal, though, requires coordination with the resizer to avoid removing an item that the resizer is also operating on.

For a lot more detail on everything described here, see the original paper.

That paper contained code implementing the relativistic hash table concept in the kernel. But, for some reason, none of this code was proposed for inclusion into the mainline. So the technique languished for nearly four years until Thomas put together his implementation, which was initially targeted at the hash table used to track netlink sockets in the kernel. Now that this infrastructure is in place, chances are that its use will expand into other areas relatively quickly.

Comments (21 posted)

Relativistic hash tables, part 2: Implementation

By Jonathan Corbet
September 17, 2014

The kernel's "relativistic" (or "resizable") hash table implementation came into the mainline during the 3.17 merge window. These tables can be resized while still allowing concurrent access by readers and (to an extent) updaters. See the first article in this series for a description of the algorithms used to implement this functionality. This article, instead, will describe the in-kernel API for these "rhashtables" as seen in 3.17, with an eye toward changes that are likely to arrive in the near future.

Rhashtables were initially created for use in the networking subsystem, but their implementer, Thomas Graf, understood that they could be more widely used within the kernel. As a result, the API is fairly well generalized. One implication of this design is that a certain amount of setup work is needed to put an rhashtable into use. Once that setup is done, though, operations on the table are relatively straightforward.

The first step is to define the object that will be stored in the hash table. Such objects will typically look something like this:

    #include <linux/rhashtable.h>

    struct hashed_object {
	int key;
	struct rhash_head node;
	/* Stuff of actual interest */
    }

Here, an integer key is used for identifying objects in the table; the implementation can handle just about any key type, though, as will be seen below.

The next step is to fill in a structure describing how the hash table will work. This structure is defined as:

    typedef u32 (*rht_hashfn_t)(const void *data, u32 len, u32 seed);
    typedef u32 (*rht_obj_hashfn_t)(const void *data, u32 seed);

    struct rhashtable_params {
	size_t		  nelem_hint;
	size_t		  key_len;
	size_t		  key_offset;
	size_t		  head_offset;
	u32		  hash_rnd;
	size_t		  max_shift;
	rht_hashfn_t	  hashfn;
	rht_obj_hashfn_t  obj_hashfn;
	bool		  (*grow_decision)(const struct rhashtable *ht,
					   size_t new_size);
	bool		  (*shrink_decision)(const struct rhashtable *ht,
					     size_t new_size);
	int		  (*mutex_is_held)(void);
    };

Here, nelem_hint is a guess at how many elements will be stored in the table (it is used to calculate the initial size). key_len and key_offset describe the key — its size in bytes and its offset into the structure (which should be obtained with offsetof()). Similarly, head_offset is the offset of the rhash_head structure within the hashed object. hash_rnd is a random seed to be used in the hash function; if it is zero, the rhashtable code will generate a random seed. The maximum size of the table can be optionally set with max_shift; its value is the base-two logarithm of that size. hashfn is the function that performs the hashing; normally it can be set to arch_fast_hash(), defined in <linux/hash.h>. obj_hashfn(), instead, is a function to hash the entire object. If key_len and key_offset have been provided, there is no need for obj_hashfn(). If calculating the hash requires something more complicated than looking at a few contiguous bytes in the object, this function can be provided to do the work (and key_len should be set to zero).

Control over automatic resizing is provided by the grow_decision() and shrink_decision() functions. Despite being called new_size, that parameter actually holds the current number of buckets in the table; the number of elements can be found in ht->elems. If these functions return a true value, the size of the table will be doubled or halved as appropriate. The rhashtable implementation offers two functions that can be used in this role: rht_grow_above_75() and rht_shrink_below_30(). These functions seek to keep between 30% and 75% of the table buckets full.

Finally, mutex_is_held() must be provided; it returns true if the lock serializing changes to the table is currently held. It is used purely for debugging purposes, ensuring that the lock is held when functions that modify the table are called.

Once that structure is set up, it's just a matter of allocating a struct rhashtable instance and initializing it with:

    int rhashtable_init(struct rhashtable *ht, 
			struct rhashtable_params *params);

There is also an rhashtable_destroy() to free a hash table's resources.

Insertion and removal are handled with:

    void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node, 
			   gfp_t gfp_flags);
    bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node, 
			   gfp_t gfp_flags);
    void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
			         struct rhash_head __rcu **pprev, 
				 gfp_t gfp_flags);

The table lock must be held when any of the above functions are called. In all of these calls, ht refers to the hash table, node points to the rhash_head structure within the relevant object, and gfp_flags is used for memory allocation should the table need to be resized. If the caller happens to have a pointer to the object immediately prior to the one that is to be removed, it can speed up the removal process by calling rhashtable_remove_pprev().

Note that these functions are likely to change in the near future. A patch set currently in review moves resizing operations into a separate thread, so there is no longer a need to allocate memory in the above functions. Accordingly, the gfp_flags argument is removed. This change makes it easier to perform hash table operations while running in atomic context.

Should there be a reason to explicitly change the size of an rhashtable, that can be done with:

    int rhashtable_expand(struct rhashtable *ht, gfp_t gfp_flags);
    int rhashtable_shrink(struct rhashtable *ht, gfp_t gfp_flags);

The table lock must be acquired before calling these functions. A call to rhashtable_expand() will double the size of the table, while rhashtable_shrink() will halve its size. Again, the gfp_flags argument is likely to be removed in 3.18 or shortly thereafter.

Lookups are performed with either of:

    void *rhashtable_lookup(const struct rhashtable *ht, const void *key);
    void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
				    bool (*compare)(void *, void *), 
				    void *arg);

The first will simply return the object in the table matching key, or NULL if no such object exists. In the second form, an explicit comparison function may be passed to determine whether a given object matches or not. In this case, hash is a precomputed hash for the desired object, usually obtained from rhashtable_hashfn(). The comparison function is compare(); its arguments are the object to be tested and arg.

The latter function is clearly meant for code which delves more deeply into the structure of the hash tables themselves. For cases where that is necessary, the interface provides a whole set of functions for traversing the hash chains; see <linux/rhashtable.h> for the full list, and at net/netlink/af_netlink.c for an example of their use.

Future changes call for changing the hash chain into a "nulls list," where the NULL pointer at the end of the list contains the hash of the first item in the bucket. Such lists are used heavily in the networking subsystem where things can change on the fly and an object may move between lists while it is being traversed. This change won't affect relatively casual users of rhashtables, but code that follows the hash chains itself will need to adapt.

For those wanting more details on how this code works, it can be found in lib/rhashtable.c. The code is relatively clear, well documented, and easy to follow, especially in its 3.17 incarnation; the changes to come will necessarily add complexity.

Comments (2 posted)

The kernel address sanitizer

By Jake Edge
September 17, 2014

Finding places where the kernel accesses memory that it shouldn't is the goal for the kernel address sanitizer (KASan). It uses a combination of a new GCC feature to instrument memory accesses and "shadow memory" to track which addresses are legitimate, so that it can complain loudly when the kernel reads or writes anywhere else. While KASan shares some attributes with other kernel debugging features, it has its own advantages—and has already been used to find real kernel bugs.

The KASan patch set comes from Andrey Ryabinin, though it is based on work by Andrey Konovalov and by the AddressSanitizer project. While the patches are still at the RFC stage, this is the second version of them. Many improvements have been made since the original RFC was posted back in July.

The basic idea behind KASan is to allocate a map (aka "shadow region") that represents each eight bytes of kernel address space with one byte in the map. For x86_64 (which is the only architecture supported), that means setting aside (but not allocating) 16TB of address space to handle the entire 128TB that the kernel can address. Each byte in the map encodes the legality of a kernel access to the corresponding bytes in the full address space.

The encoding is fairly straightforward. A 0 means that all eight bytes are legal for access, while 1–7 indicate the number of consecutive bytes at the beginning of the eight-byte region that are valid (so 2 means that the first two bytes are valid, the last six are not). Negative values are for different types of non-accessible memory (free memory, redzones of various sorts, etc.). It would seem that other positive values could be used to encode consecutive valid bytes at the end of the eight-byte region, but perhaps that does not occur in practice.

Kernel memory accesses can then be checked against the shadow map to see if they are valid. GCC 5.0 (the next major release after 4.9, which is expected in 2015) will introduce a new feature that allows runtime checking of kernel addresses. The -fsanitize=kernel-address flag will cause GCC to instrument each memory load and store operation. It will insert a function call into each of those operations that can examine the target address and complain into the kernel log if it is invalid. The calls are of the form __asan_loadN() and __asan_storeN(), where N corresponds to the width of the access (1, 2, 4, 8, or 16 bytes). There are GCC patches available to turn these out-of-line checks into inline checks, where GCC will directly insert the code to check the shadow map, rather than making function calls to do so.

So, the bulk of the work is done by the code inserted by the compiler. But there are still some details to be handled by the patches. Implementing the checking and reporting infrastructure is step one (that patch also includes a Documentation/kasan.txt file). Then, the initial shadow region needs to be populated. Early in the boot process, each page table entry in the shadow region is set to the zero page. Later, when the physical memory has been mapped, the zero pages corresponding to that memory are unmapped, and real pages are allocated and mapped for tracking the kernel's memory.

As pages are allocated by the kernel, they are marked as accessible in the shadow region; likewise, as pages are freed, they are marked inaccessible. The SLUB allocator has been modified to update the shadow map for its allocations and deallocations.

In the patch set's first message, Ryabinin outlined the differences between KASan and a few other kernel memory-debugging tools. Since KASan uses compile-time instrumentation, it is much faster than kmemcheck, but it cannot detect reads of uninitialized memory as kmemcheck does. While both DEBUG_PAGEALLOC and SLUB_DEBUG are faster than KASan, neither can detect all of the illegal accesses that KASan does (DEBUG_PAGEALLOC only has page-level granularity and SLUB_DEBUG is unable to detect some bad reads).

KASan is enabled for the kernel with the KASAN Kconfig parameter. It requires the SLUB allocator and a GCC version >= 5.0. Better reports will be generated if stack traces and SLUB_DEBUG are enabled. Ryabinin has also added a test module that can be used to cause out-of-bounds accesses and use-after-free errors for testing KASan or other memory debuggers.

The discussion on the patches has been constructive, with just a few suggestions for relatively minor changes. In addition, Sasha Levin listed several kernel bugs that he had found using the Trinity fuzzer with the first version of the KASan patch set. It would seem that there is an unfilled niche in the kernel's memory debugging that KASan could help fill.

Comments (1 posted)

Linus Torvalds Linux 3.17-rc5 ?

Greg KH Linux 3.16.3 ?

Greg KH Linux 3.14.19 ?

Greg KH Linux 3.10.55 ?

Ben Hutchings Linux 3.2.63 ?

Alexandre Belloni Initial support for the Atmel SMART sama5d4 ?

Carlo Caione ARM: meson: add preliminary support for MesonX/Meson6 SoCs ?

Oleksij Rempel initial support for Alpscale ASM9260 ?

Hanjun Guo Introduce ACPI for ARM64 based on ACPI 5.1 ?

Jonathan Richardson Add initial support for Broadcom Cygnus SoC ?

Joe.C [PATCH v2 0/7] arm: Add basic support for MediaTek MT8127 & MT8135 SoCs ?

Geert Uytterhoeven [PATCH/RFC v2 00/12] ARM: shmobile: R-Mobile: Prototype DT PM domain support ?

Robert Richter arm64, thunder: Enable Cavium Thunder SoC Family ?

Naveen Krishna Chatradhi arch: arm64: enable support for Samsung Exynos7 SoC ?

Jiang Liu Enable support of IOAPIC hotplug on x86 platforms ?

Jiang Liu Prepare for enabling hierarchy irqdomain on x86 ?

Qiaowei Ren Intel MPX support ?

Juergen Gross x86: Full support of PAT ?

Konstantin Khlebnikov x86_64: per-cpu memory for user-space ?

Aaron Tomlin sched: Always check the integrity of the canary ?

Alexei Starovoitov eBPF syscall, verifier, testsuite ?

Davidlohr Bueso locktorture: Improve and expand lock torturing ?

Andy Gross Add Qualcomm ADM dmaengine driver ?

mathieu.poirier@linaro.org Coresight framework and drivers ?

Jaewon Kim Add max77693 haptic driver ?

Krzysztof Kozlowski charger/mfd: max14577: Add support for MAX77836 ?

Jianqun rockchip-max98090: add driver for rockchip board with max98090 ?

Javier Martinez Canillas Add Maxim 77802 RTC support ?

Tanmay Inamdar APM X-Gene PCIe host controller ?

Jacob Pan Initial support for XPowers AXP288 PMIC ?

Suman Anna Support for HwSpinlock reserved locks ?

Santosh Shilimkar net: Add Keystone NetCP ethernet driver support ?

suravee.suthikulpanit@amd.com irqchip: gic: Introduce ARM GICv2m MSI(-X) support ?

Antoine Tenart ARM: berlin: USB support ?

Ivan T. Ivanov thermal: Add QPNP PMIC temperature alarm driver ?

Andrew Lunn net: DSA: Marvell mv88e6171 switch driver ?

Andy Grover Userspace Passthrough backstore for LIO ?

Dmitry Lavnikevich Add Phytec pbab01 audio support ?

Weike Chen The Designware GPIO Supporting ?

Mika Westerberg Add ACPI _DSD and unified device properties support ?

Beniamino Galvani Amlogic Meson I2C driver ?

Maximilian Eschenbacher usbip: Crypto and ACLs ?

Tomeu Vizoso Per-user clock constraints ?

Jiang Liu Enable support of Intel DMAR device hotplug ?

Antoine Tenart usb: add support for the generic PHY framework ?

Sukadev Bhattiprolu Add support for parameterized events from sysfs ?

Andi Kleen Add some documentation on the perf sysfs ABI interface ?

Seth Forshee fuse: Add support for mounts from pid/user namespaces ?

Ming Lei block: per-distpatch_queue flush machinery ?

Milosz Tanski Non-blocking buffered fs read (page cache only) ?

Benjamin LaHaise aio: async readahead ?

Vladimir Davydov Anonymous memory threshold notifications for memcg ?

Dan Streetman implement zsmalloc shrinking ?

Jerome Marchand mm, shmem: Enhance per-process accounting of shared memory ?

Rostislav Lisovy 802.11p OCB mode ?

richard.alpe@ericsson.com tipc: new netlink API ?

Tom Herbert net: foo-over-udp (fou) ?

John Fastabend net/sched rcu filters ?

David Howells KEYS: Improve asymmetric key and PKCS#7 handling ?

Arianna Avanzini Multi-queue support for xen-blkfront and xen-blkback ?

Thomas Graf rhashtable: Parallel atomic insert/deletion & deferred expansion ?

Andi Kleen Implement lbr-as-callgraph v9 ?

Jozsef Kadlecsik ipset 6.22 released ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Yao: The State of ZFS on Linux

Garrett: ACPI, kernels and contracts with firmware

Kernel development news

Relativistic hash tables, part 1: Algorithms

Table shrinking

Table expansion

Other details

Relativistic hash tables, part 2: Implementation

The kernel address sanitizer

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous