Kernel development
Brief items
Kernel release status
The current development kernel is 4.3-rc2, released on September 20. Linus said: "As has been the trend for a while now, rc2 tends to be reasonably small, probably because it takes a while for regression reports to start trickling in (and some people probably actively wait for rc2 to even start testing - you scaredy-cats, you)."
Stable updates: 4.2.1, 4.1.8, 3.14.53, and 3.10.89 were released on September 21.
Quotes of the week
Kernel development news
Some kernel memory-allocation improvements
The kernel's memory allocator must work within a large set of constraints if it is to perform well under most workloads. Over time, these constraints have led to the addition of significant complexity to the low-level allocation code. But problems remain, and, as it turns out, sometimes the best solution is to get rid of some of that complexity and take a simpler approach. This patch set from Mel Gorman has been making the rounds for some time and appears to be reaching a mature stage; it's worth a look as an example of what it takes to perform well in current kernels.The allocator in question here is the low-level page allocator (or "buddy allocator"). The smallest unit of memory it deals with is a full page (4096 bytes on most systems). The slab allocators (including kmalloc()) are built on top of the page allocator; they have their own complexities that we'll not be concerned with here.
The page allocator is the ultimate source of memory in the system; if that allocator can't fulfill a request, the memory simply cannot be had. So considerable effort is put into ensuring that memory is available at all times, especially for high-priority callers that cannot wait for some memory to be reclaimed from elsewhere. High-order allocations (those requiring more than one page of physically contiguous memory) complicate the problem; memory tends to fragment over time, making contiguous chunks hard to find. Balancing memory use across NUMA systems adds yet another twist. All of these constraints (and more) must be managed without slowing down the system in the allocator itself. Solving this problem involves a significant amount of complex code, scary heuristics, and more; it is not surprising that memory-management changes can be hard to merge into the mainline.
The zone cache
The page allocator divides physical memory into "zones," each of which corresponds to memory with specific characteristics. ZONE_DMA contains memory at the bottom of the address range for use by severely challenged devices, for example, while ZONE_NORMAL may contain most memory on the system. 32-bit systems have a ZONE_HIGHMEM for memory that is not directly mapped into the kernel's address space. Each NUMA node has its own set of zones as well. Depending on the characteristics of any given allocation request, the page allocator will search the available zones in a specific priority order. For the curious, /proc/zoneinfo gives a lot of information about the zones in use on any given system.
Checking a zone to see whether it has the memory to satisfy a given request can be more work than one might expect. Except for the highest-priority requests, a given zone shouldn't be drawn below specific low "watermarks," and comparing a zone's available memory against a watermark can require a significant calculation. If the "zone reclaim" feature is enabled, checking a nearly-empty zone can cause the memory-management subsystem to try to reclaim memory in that zone. For these reasons and more, the "zonelist cache" was added to the 2.6.20 kernel in 2006. This cache simply tries to remember which zones have been found to be full in the recent past, allowing allocation requests to avoid checking full zones.
The case for the zonelist cache has been weakening for some time, and Mel's patch set weakens it further by making the watermark checks cheaper. Zone reclaim has been fingered as a performance problem for a number of workloads (including PostgreSQL server workloads) and is now disabled by default. But the biggest problem seems to be that, if a zone is unable to satisfy a high-order allocation, it will be marked as "full" even if single pages abound there. Subsequent single-page allocations will then avoid the zone, even though that zone is entirely capable of satisfying those allocations. That can cause allocations to be needlessly performed on remote NUMA nodes, worsening performance.
Mel notes that this problem "could be addressed with additional
complexity but as the benefit of zlc [the zonelist cache] is questionable,
it is better to
remove it
". This part of the patch series removes nearly 300 lines
of complex memory-management code and improves some benchmarks in the
process. The fact that zones are always checked has some other effects;
the most notable is apparently that more direct reclaim work
(attempts to reclaim memory by the process performing an allocation)
results from checking zones that would have previously been skipped.
The atomic high-order reserve
Within a zone, memory is grouped into "page blocks," each of which can be marked with a "migration type" describing how the block should be allocated. In current kernels, one of those types is MIGRATE_RESERVE; it marks memory that simply will not be allocated at all unless the alternative is to fail an allocation request entirely. Since a physically contiguous range of blocks is so marked, the effect of this policy is to maintain a minimum number of high-order pages in the system. That, in turn, means that high-order requests (within reason) can be satisfied even when memory is generally fragmented.
Mel added the migration reserve during the 2.6.24 development cycle in 2007. The reserve improved the situation at the time but, in the end, it relied on an accidental property of the minimum watermark implemented in the page allocator many years before. The reserve does not actively keep high-order pages around; it simply steers requests away from a specific range of memory unless there is no alternative, in the hope that said range will remain contiguous. The reserve also predates the current memory-management code, which does a far better job of avoiding fragmentation and performing compaction when fragmentation does occur. Mel's current patch set implements the conclusion that this reserve has done its time and removes it.
There is still value in reserving blocks of memory for high-order allocations, though; fragmentation is still a concern in current kernels. So another part of Mel's patch set creates a new MIGRATE_HIGHATOMIC reserve that serves this purpose, but in a different way. Initially, this reserve contains no page blocks at all. If a high-order allocation cannot be satisfied without breaking up a previously whole page block, that block will be marked as being part of the high-order atomic reserve; thereafter, only higher-order allocations (and only high-priority ones at that) can be satisfied from that page block.
The kernel will limit the size of this reserve to about 1% of memory, so it cannot grow overly large. Page blocks remain in this reserve until memory pressure reaches a point where a single-page allocation is about to fail; at this point, the kernel will take a page block out of the reserve to be able to satisfy that request. The end result is a high-order page reserve that is more flexible, growing or shrinking in response to the demands of the current workload. Since the demand for high-order pages can vary significantly from one system (and one workload) to the next, it makes sense to tune the reserve to what is actually running; the result should be more flexible allocations and higher-reliability access to high-order pages.
Lest kernel developers think that they can be more relaxed about high-order
allocations in the future, though, Mel notes that, as a result of the
limited size of the reserve, "callers that speculatively abuse atomic
allocations for long-lived high-order allocations to access the reserve
will quickly fail
". He gives no indication of just who he thinks
those callers are, though. There is one other potential pitfall with this
reserve that bears keeping in mind: since the first page block
doesn't enter the reserve until a high-order allocation is made, the
reserve may remain empty until the system has been running for a long
time. By that point, memory may be so fragmented that the reserve can no
longer be populated. Should such situations arise in real-world use, they
could be addressed by proactively putting a minimum amount of memory into
the reserve at boot time.
The high-order reserve also makes it possible to remove the separate watermarks for high-order pages. These watermarks try to ensure that each zone has a minimal number of pages available at each order; the allocator will fail allocations that drive the level below the relevant watermark for all but the highest-priority allocations. These watermarks are relatively expensive to implement and can cause normal-priority allocations to fail even though suitable pages are available. With the patch set applied, the code continues to enforce the single-page watermark, but, for higher-order allocations, it merely checks that a suitable page is available, counting on the high-order reserve to ensure that pages will be kept available for high-priority allocations.
Flag day
Memory-allocation requests in the kernel are always qualified by a set of "GFP flags" ("GFP" initially came from "get free page") describing what can and cannot be done to satisfy the request. The most commonly used flags are GFP_ATOMIC and GFP_KERNEL, though they are actually built up from lower-level flags. GFP_ATOMIC is the highest-priority request; it can dip into reserves and is not allowed to sleep. Underneath the hood, GFP_ATOMIC is defined as the single bit __GFP_HIGH, marking a high-priority request. GFP_KERNEL cannot use the reserves but can sleep; it is the combination of __GFP_WAIT (can sleep), __GFP_IO (can start low-level I/O), and __GFP_FS (can invoke filesystem operations). The full set of flags is huge; they can be found in include/linux/gfp.h.
Interestingly, the highest-priority requests are not marked with __GFP_HIGH; instead, they are marked by the absence of __GFP_WAIT. Requests with __GFP_HIGH can push memory below the watermarks, but only non-__GFP_WAIT requests can access the atomic reserves. This mechanism doesn't work as well as it could in current kernels, where many subsystems may make allocations they don't want to wait for (often because there is a fallback mechanism available if the allocation fails), but those subsystems do not need access to the deepest reserves. But, by leaving off __GFP_WAIT, such code will access those reserves anyway.
This problem, along with a general desire to have more explicit control over how memory-allocation requests are satisfied, has led Mel to rework the set of GFP flags somewhat. To do so, he has added some new flags to the (already long) list:
- __GFP_ATOMIC identifies requests that truly come from atomic
context, where no sort of blocking or delay is acceptable and there is
no fallback in case of failure. If an allocation is being made with
__GFP_ATOMIC, it may mean, for example, that spinlocks are
currently held. These requests are the ones that get access to the
atomic reserves.
- __GFP_DIRECT_RECLAIM indicates that the caller is willing to
go into direct reclaim. Doing so implies that the request can block
if need be. This flag does not imply __GFP_FS or
__GFP_IO; they must be specified separately if they are
applicable (though in such cases it probably makes sense to just use
GFP_KERNEL).
- __GFP_KSWAPD_RECLAIM says that the kswapd kernel thread can be woken up to perform reclaim. Poking kswapd does not imply blocking, but it may start activity in the system that can affect performance in general. As an example, consider a driver that would very much like to allocate a high-order chunk of memory, but it can get by with a bunch of single pages if that chunk is not available. The high-order allocation may be best tried without __GFP_KSWAPD_RECLAIM, since things will still work fine if that allocation fails and there is no real need to start aggressively reclaiming memory.
With this set of flags, code can express the difference between being absolutely unable to sleep and not wanting to sleep. The "must succeed" nature of a request has been separated from the "don't sleep" aspect, eliminating situations where allocations dip unnecessarily into the atomic reserves. For users of the basic GFP_ATOMIC and GFP_KERNEL flag sets little will change, but Mel's patch set makes changes to several dozen call sites that deal with GFP flags at a lower level.
As a whole, this patch set touches 101 files and removes a net 240 lines of code. With luck, a number of core memory-management algorithms have been simplified while improving performance and making the system more reliable. Mel's strongly benchmark-focused approach will help to build confidence in this work, but it's still a set of significant changes to a complex kernel subsystem, so it's not surprising that the patches have had to go through a number of revisions and extensive review. It seems likely that this process is coming to an end, and that this work will find its way into the mainline in the next development cycle or two.
The kernel connection multiplexer
As the introduction to Tom Herbert's kernel connection multiplexer (KCM) patch set notes, TCP is often used for message-oriented communication protocols even though, as a streaming transport, it has no native support for message-oriented communications. KCM is an effort to make it easier to send and receive messages over TCP which adds a couple of other interesting features as well.
KCM functionality
In its simplest form, a KCM socket can be thought of as a sort of wrapper attached to an ordinary TCP socket:
Within the kernel, an object called a "psock" sits between the application (which is using the KCM socket) and the actual TCP connection to the world. Outgoing messages are sent as formatted by the application. Incoming data is buffered until the kernel sees (via a mechanism to be discussed momentarily) that a full message has arrived; that message is then made available to the application via the KCM socket. The psock module ensures that messages are sent and received as atomic units, freeing the application from having to manage that aspect of the protocol.
Applications are often split into multiple processes, though, each of which can deal with incoming messages in the same way. Multiplexers to dispatch messages are thus built into many applications. But, while the kernel is handling message receipt, it can also deal with the multiplexing. So the actual architecture of KCM looks a bit more like this:
Each of the KCM sockets is seen as equivalent by the kernel — an incoming message can just as well be passed to any one of them. What actually happens, of course, is that the kernel chooses between one of the processes that is actually waiting for a message when one arrives; if there are no waiting processes, the message will be queued.
Things get more complicated on the other side of the multiplexer as well, in that there could be value in having multiple TCP connections to the same destination. A phone handset, for example, might connect to a service over both broadband and WiFi. In such a situation, the multiplexer could choose between those connections for outgoing messages, and accept incoming messages on any of them. And, indeed, that's what KCM does:
Thus, KCM can be thought of as implementing a sort of poor hacker's multipath TCP, where the application is charged with setting up the connections over the various paths.
There is one final detail: how is the TCP data stream broken up into discrete messages? One could certainly envision building some sort of framing mechanism into KCM, as has been done in the kernel's reliable datagram sockets layer, but that would limit its flexibility when it comes to implementing existing protocols. If KCM is to be usable for an existing message-oriented mechanism, there must be a way to tell KCM how the framing of messages is done.
The solution here is perhaps predictable, since it is increasingly being used as the way to extend kernel functionality: use the Berkeley packet filter (BPF) subsystem. Whenever some data shows up at the psock level, it is passed to a BPF program for evaluation. Normally, the program will examine the data, determine the length of the message, and return that value. A return value of zero indicates that the length cannot be determined yet — more data must be received before trying again. A negative number indicates some sort of protocol error; when that happens, KCM will stop servicing the affected connection and signal an error to user space.
API details
An application wanting to use KCM starts by creating a new multiplexer; this is done with a socket() call:
#include <linux/kcm.h>
kcm_socket = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
If additional KCM sockets (attached to the same multiplexer) are needed, they can be created by calling accept() on the initial KCM socket. Normally, accept() is not applicable to datagram sockets, so this usage, while perhaps a little surprising, is arguably a reasonable way of overloading this system call.
The application must also take care of the establishment of one or more TCP connections to the remote side and attaching them to the multiplexer. Once a socket is open, it should be placed into a kcm_attach structure:
struct kcm_attach {
int fd;
int bpf_type;
union {
int bpf_fd;
struct sock_fprog fprog;
};
};
Here, fd is the file descriptor for the open socket. The rest of this structure exists so that the application can supply the BPF program that will help split the incoming data stream into messages. If bpf_type is KCM_BPF_TYPE_FD, then bpf_fd contains a file descriptor corresponding to a BPF program that has been loaded with the bpf() system call. Otherwise, if bpf_type is KCM_BPF_TYPE_PROG, then fprog points to a program to be loaded directly at this time.
The provision of two ways to load the BPF program may look a bit strange. It is, in fact, a combination of new and old ways of solving this particular problem. The bpf() approach is the newer way of doing things, while the sock_fprog approach has been used to load BPF programs into the network subsystem until now. It would not be entirely surprising to see a request from reviewers to narrow the interface down to just one of the two, most likely the bpf() method.
In any case, once the structure has been filled in, it is passed to a new ioctl() command (SIOCKCMATTACH) to connect the socket to the multiplexer. There is also a SIOCKCMUNATTACH operation to disconnect a socket.
Once the pieces are connected together, the application can use the KCM socket(s) like any other datagram socket. Messages can be sent and received with interfaces like sendmsg() and recvmsg() (or write() and read()), poll() can be used to wait for a message to arrive, and so on. The whole structure will persist until the last KCM socket is closed, at which point it is torn down.
One might wonder why this mechanism is being proposed for the kernel, given
that applications have been solving this problem in user space for years.
There is no explicit justification offered in the patch set, but one can
imagine that it would involve performance improvements (avoiding copying or
retransmission of data), the ability to do smart load balancing, and having
a single implementation used by all. The "to do" list in the patch
posting, which says that " This series of articles provides an overview of the procedure one
can follow when porting the Linux kernel to a new processor architecture. Part
1 and part 2 focused on the non-code-related
groundwork and the early code, from the assembly boot code to the creation of
the first kernel thread. Following on from those, the series concludes by
looking at the last portion of the procedure. As will be seen, most of the
remaining work for launching the
init
process deals with thread and process management. When start_kernel() performs its last function call (to rest_init()), the memory-management
subsystem is
fully operational, the boot processor is running and able to process both
exceptions and interrupts, and the system has a notion of time. While the execution flow has so far been sequential and mono-threaded, the main
job handled by rest_init() before turning into the boot idle
thread is to
create two kernel threads: kernel_init, which will be discussed in the next
section, and kthreadd. As one can imagine, creating these kernel
threads (and any other kinds of threads for that matter, from user threads
within the same process to actual processes) implies the existence of a complex
process-management infrastructure. Most of the
infrastructure to create a new thread is not architecture-specific: operations
such as copying the task_struct structure or the credentials, setting up the
scheduler, and so on do not usually need any architecture-specific code.
However, the process-management code must define a few architecture-specific parts, mainly for setting up the
stack for each new thread and for switching between threads. Linux always avoids creating new resources from scratch, especially new threads.
With the exception of the initial thread (the one that has so far been booting
the system and that we have implicitly been discussing), the kernel always
duplicates an
existing thread and modifies the copy to make it into the desired new thread.
The same principle applies after thread creation, when the new thread's
execution begins for
the first time, as it is easier to resume the execution of a thread than to
start it from scratch. This mainly means that the newly allocated stack must be
initialized such that when switching to the new thread for the first
time, the thread looks like it is resuming its execution—as if it had simply
been stopped earlier. To further understand this mechanism, delving a bit into the
thread-switching mechanism and more specifically into the switch of
execution flow
implemented by the architecture-specific context-switching routine
switch_to() is required. This
routine, which is always written in assembly language, is always
called by the current (soon to be previous) thread while returning as the next
(future current) thread. Part of this trick is achieved by saving the current
context in the stack of the current thread, switching stack pointers to use the
stack of the next thread, and restoring the saved context from it. As with a
typical function, switch_to() finally returns to the "calling"
function using
the instruction address that had been saved on the stack of the newly current
thread. In the case that the next thread had previously been running and was
temporarily
removed from the processor, returning to the calling function would be a
normal event that
would eventually lead the thread to resume the execution of its own code.
However, for a brand new thread, there would not have been any function to call
switch_to() in order to save the thread's context. This is why the stack of a
new thread must be initialized to pretend that there has been a previous
function call, enabling switch_to() to return after restoring this new thread.
Such a function is usually setup to be a few assembly lines acting as a trampoline
to the thread's code. Note that switching to a kernel thread does not generally involve switching to
another page table since the kernel address space, in which all kernel
threads run,
is defined in every page table structure. For user processes, the
switch to their own page table is performed by the architecture-specific routine
switch_mm(). As explained in the source
code, the only reason the
kernel thread kernel_init is created first is that it must obtain PID 1.
This is the PID that the init process (i.e. the first user space process born
from kernel_init) traditionally inherits. Interestingly, the first task of kernel_init is to wait for the
second
kernel thread, kthreadd, to be ready. kthreadd is the kernel thread daemon
in charge of asynchronously spawning new kernel threads whenever requested.
Once kthreadd is started, kernel_init proceeds with the
second phase of booting, which includes
a few architecture-specific initializations. In the case of a multiprocessor system, kernel_init begins by
starting the
other processors before initializing the various subsystems composing the driver
model (e.g. devtmpfs, devices, buses, etc.) and, later, using the defined
initialization calls
to bring up the
actual device drivers for the underlying hardware system. Before getting into
the "fancy" device drivers (e.g. block device, framebuffer, etc.), it is
probably
a good idea to focus on having at least an operational terminal (by implementing
the corresponding driver if necessary), especially since the early console
set up
by early_printk() is supposed to be replaced by a real,
full-featured console
shortly after. It is also through these initialization calls that the initramfs is unpacked
and the
initial root filesystem (rootfs) is mounted. There are a few options
for mounting an initial rootfs but I have found initramfs to be the
simplest when porting Linux. Basically this means that the rootfs is statically
built at compilation time and integrated into the kernel binary image.
After being mounted, the rootfs can give access to the mandatory
/init and
/dev/console. Finally, the init memory is freed (i.e. the memory containing code and data
that were used only during the initialization phase and that are no longer
needed) and the init process that has been found on the rootfs is launched. At this point, launching init will probably result in an immediate fault when
trying to fetch the first instruction. This is because, as with creating
threads, being able to execute the init process (and actually any user-space
application) first involves a bit of groundwork. The function that needs to be implemented in order to solve the
instruction-fetching issue is the page fault handler. Linux is lazy,
particularly when it comes to user applications and, by default, does not
pre-load the text and data of applications into memory. Instead, it only sets
up all of
the kernel structures that are strictly required and lets applications fault at
their first instruction because the pages containing their text segment have
usually not been loaded yet. This is actually perfectly intentional behavior since it is expected
that such a
memory fault will be caught and fixed by the page fault handler. This handler
can be seen as an intricate switch statement that is able to treat every fault
related to memory: from vmalloc() faults that necessitate a
synchronization with
the reference page table to stack expansions in user applications. In this case,
the handler will determine that the page fault corresponds to a valid virtual
memory area (VMA) of the application and will consequently load the missing page
in memory before retrying to run the application. Once the page fault handler is able to catch memory faults, it is likely that an
extremely simple init process can be executed. However, it will not be able to
do much as it cannot yet request any service from the kernel through system
calls, such as printing to the terminal. To this end, the system-call
infrastructure must be completed with a few architecture-specific
parts. System calls are
treated as software interrupts since they are accessed by a user
instruction that makes the processor automatically switch to kernel mode, like
hardware interrupts do. Besides defining the list of system calls supported
by the
port, handling system calls involves enhancing the interrupt and exception
handler with the
additional ability to receive them. Once there is support for system calls, it should now be possible to
execute a "hello
world" init that is able to open the main console and write a
message. But
there are still missing pieces in order to have a full-featured
init that is able to
start other applications and communicate with them as well as exchange data with
the kernel. The first step toward this goal concerns the management of signals and, more
particularly, signal delivery (either from another process or from
the kernel itself). If a process has defined a handler for a specific signal,
then this handler must be called whenever the given signal is pending. Such an
event occurs when the targeted process is about to get scheduled again. More
specifically, this means that when resuming the process, right at the moment of
the next transition back to user mode, the execution flow of the process must be
altered in order to execute the handler instead. Some space must also be made on
the application's stack for the execution of the handler. Once the handler has
finished its execution and has returned to the kernel (via a system call
that had
been previously injected into the handler's context), the context of the
process is restored so that it can resume its normal execution. The second and last step for fully running user-space applications deals with
user-space memory access: when the kernel wants to copy data from or to
user-space pages. Such an operation can be quite dangerous if, for example, the
application gives a bogus pointer, which would potentially result in kernel
panics (or security vulnerabilities) if it is not checked properly. To
circumvent this problem, it is
necessary to
write architecture-specific routines that use some assembly magic to
register the
addresses of all of the instructions performing the actual accesses to the
user-space memory in an exception table. As explained in this LWN
article from 2001, " Once a full-featured init process is able to run and give access to a shell,
it probably signals the end of the porting process. But it is most likely
only the
beginning of the adventure, as the port now needs to be maintained (as the
internal APIs sometimes change quickly), and can also be enhanced in
numerous ways:
adding support for multiprocessor and NUMA systems, implementing more device
drivers, etc. By describing the long journey of porting Linux to a new processor architecture,
I hope that this series of articles will contribute to remedying the lack of
documentation in this area and will help the next brave programmer who one day
embarks upon this challenging, but ultimately rewarding, experience.
[The author would like to thank Ena Lupine for her help in
writing and publishing these articles.]
TLS-in-kernel is a separate
initiative
", suggests that there may be efforts to push other
functionality into the kernel in the future. Whether these efforts will
succeed remains to be seen, but there is clearly a lot of interest in
adding interesting functionality to the Linux networking stack.
Porting Linux to a new processor architecture, part 3: To the finish line
Spawning kernel threads
The first kernel thread
Executing init
if ever a fault happens in kernel mode, the fault
handler scans through the exception table trying to match the address of the
faulting instruction with a table entry. If a match is found, a special error
exit is taken, the copy operation fails gracefully, and the system call returns
a segmentation fault error.
"Conclusion
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
