Kernel development
Brief items
Kernel release status
The current development kernel is 4.1-rc4, released on May 18. "So here it is, last-minute fix and all. The -rc4 patch is a bit bigger than the previous ones, but that seems to be mainly due to normal random timing - just the fluctuation of when submaintainer trees get pushed."
Stable updates: 4.0.4, 3.14.43, and 3.10.79 were released on May 17.
Quote of the week
Kernel development news
Delay-gradient congestion control
Network congestion-control algorithms have a difficult task to perform. They must moderate each endpoint's outgoing traffic to keep the Internet from being overwhelmed by packet congestion (as happened in 1986 before these algorithms were introduced). But, at the same time, the algorithm is expected to allow a machine to make full use of the bandwidth available to it, sharing that bandwidth with other systems without any sort of central control mechanism. Err on one side, and the network as a whole suffers; err on the other, and performance will suffer. So it is not surprising that, long after workable solutions to the congestion-control problem exist, research continues in this area. One relatively new algorithm is called "CAIA delay gradient" (or CDG); it is named after the Centre for Advanced Internet Architectures where it was first developed for FreeBSD. A patch adding CDG to the kernel was recently posted for review.Most congestion-control algorithms are based on the use of packet loss as a signal that indicates congestion somewhere along the path used by the connection. When all goes well, a connection should not experience packet loss. If a router's buffers fill, though, that router will start to drop packets; a congestion-control algorithm will respond to that packet loss by reducing the number of packets that can be outstanding (the "congestion window") at any given time. Typically, the congestion window will then be allowed to creep back up until the next packet loss occurs, indicating that the limit has once again been reached.
Loss-based congestion control has served the net well for the better part of thirty years, but it is not without its drawbacks. By its nature, it will cause packets to be dropped occasionally; that will necessarily introduce latency into the connection which, perhaps, can ill afford it. Loss-based algorithms can only find the limit for a given connection by pushing the slowest link to its limit, meaning that it forces a router buffer somewhere to overflow; this behavior can also worsen bufferbloat-related problems. There are also problems when packets are lost for reasons other than congestion, as can happen with wireless links, for example. The congestion-control code will interpret that loss as a congestion signal, slowing transmission unnecessarily.
The alternative, as implemented by CDG, is to try to infer the state of a connection by looking at the variation in the round-trip time (RTT) — the time it takes for a packet to transit the connection in both directions. Using RTT to estimate congestion is easy if one knows the actual characteristics of the connection in use; one need only look at how much "extra" time is currently needed to get a packet through the link. That is why, for example, smartphone driving-time estimates for a well-known commute can be a useful indication of road congestion. But that knowledge is not available on the Internet as a whole, so some other approach will be required.
The CDG approach is to look at the minimum and maximum RTTs observed for a given connection over a period of time. The minimum is called Τmin, while the maximum is Τmax. From subsequent observations of Τmin and Τmax, the algorithm calculates the rate of change of each. The rate at which the minimum RTT is changing is δmin, while the rate at which the maximum RTT is changing is δmax. These two parameters are then used in a couple of interesting ways.
The key use, of course, is to try to come up with an estimate of how congested the link is. To simplify a bit: if Τmin is growing (δmin is positive), chances are that the link is getting more congested. Every RTT interval, CDG calculates a "probability of backoff" based on δmin; as δmin grows, that probability approaches one. That probability is then compared against a random number to determine whether the congestion window should be decreased; should a decrease be decided upon, the number of packets in the congestion window will be reduced by a configurable factor (0.3 by default).
In cycles where the algorithm decides not to decrease the congestion window, it will, instead, increase it by one packet. That allows the window to creep upward and continually test the limits of the connection. In theory, the delay-gradient should detect that limit without pushing the connection to the point of packet loss.
There is an interesting question that comes up at about this point, though: imagine a situation where a system using CDG is sharing a slow link with another system using traditional loss-based congestion control. As RTT increases, the CDG system will back off, but the loss-based system will continue to pump out packets until the router starts dropping things. Indeed, it may increase its transmission rate to soak up the bandwidth that the CDG system is no longer using. If CDG allows itself to be muscled out of a contended link in that manner, one can predict with high confidence that it will not find many users.
To deal with this problem, the CDG authors developed a heuristic to detect situations where competition with a loss-based system is occurring. If CDG is working properly, a decision to slow down the transmission rate should result in the RTT values getting smaller — δmin and δmax should go negative. If that fails to happen after a few backoff operations, the algorithm will conclude that somebody else is playing by different rules and stop backing off for a while. CDG also remembers the previous maximum value of the congestion window (as the "shadow window"); this value can be used to quickly restore the congestion window in the event of a packet drop.
CDG's handling of packet drops is interesting. The motivations for using delay gradients are to avoid depending on packet loss as a signal and to avoid slowing transmission in response to packet losses that do not result from congestion. But congestion-related packet loss will still happen when CDG is in use, and the algorithm should respond accordingly. Backing off in response to packet loss is easy; the tricky part is determining whether that loss is a congestion signal or not.
As a general rule, congestion manifests itself as an overflow of the packet queue for the slowest link used by a connection. If that queue is known to be full, then a packet loss is likely to be a result of the congestion there; if, instead, it is known to be relatively empty, congestion is probably not the problem. The heuristic used to determine the state of the queue is this: when a queue fills, Τmax will reach its largest possible value and stop increasing (because the queue cannot get any longer), but Τmin will continue to increase. CDG will only treat a packet loss as a congestion signal when a full queue has been detected in this manner.
At least, that is how the algorithm was designed; the current Linux patch
does not quite work that way. In the patch posting, Kenneth Klette
Jonassen notes that: "We decided to disable the loss
tolerance heuristic by default due to concerns about its safety outside
closed environments.
" So that aspect of CDG will have to wait until
the algorithm's behavior on the full Internet is better understood.
Even so, networking developer Eric Dumazet said that the new congestion-control module
"looks awesome
". Its true level of awesomeness can only be
determined via years of real-world experience. But getting CDG into the
Linux kernel is a good first step toward the acquisition of that
experience. Should things go well, loss-based congestion control might
end up backing off in its role on the net.
(See this paper [PDF] for the full details on how the CDG algorithm works.)
Kernel support for SYN packet fingerprinting
The initial packet of a TCP connection (i.e. the SYN packet) contains information that can be used to detect attributes of the remote system through TCP/IP fingerprinting. But that data is contained in the headers of the packets, which means it is only accessible to the kernel. A patch set that was recently merged into the net-next tree would change that to allow user-space servers to request the header information on connections they have accepted.
Eric Munson started the conversation when he posted a patch that would allow a program to request that the SYN packets be saved by using setsockopt() on a listening socket. The SYN headers could then be retrieved, once, via a getsockopt() call on the socket returned by accept(). That would allow user space to examine the TCP and IP headers to identify (or at least narrow down) the operating system of the remote host that made the connection.
Munson's patch simply stored the SKB (i.e. struct sk_buff) that contained the SYN packet, which could be rather large (up to 4KB), as Eric Dumazet pointed out. For millions of client connections, that memory can add up, he said.
Instead, Dumazet suggested, a 2012 patch from Tom Herbert (or one based on that) should be used. That code has been used internally at Google for around two years, he said, without any problems handling large numbers of simultaneous connections. Instead of storing the SKB, it allocates space just for the headers with kmalloc()—usually less than 128 bytes per connection.
When Herbert posted his patch, there were concerns about adding eight
bytes to each SKB for a "very fringe feature
" (in the
words
of network maintainer David Miller). Herbert's original patch also stored
the SKB like Munson's does. The patch was never merged, but
Dumazet modified it to kmalloc() space for the headers and it was
put into production at Google.
Munson was not particularly tied to his implementation; he said that he was happy to back Dumazet's patch if it met his needs. That patch was posted on May 3. It adds two new socket options that are used to request and retrieve the SYN headers. Servers request that the kernel save the headers by calling setsockopt() with TCP_SAVE_SYN either before or after the listen() call; the kernel will save the headers for subsequent connection requests. IP and TCP headers can be retrieved, once, by calling getsockopt() with TCP_SAVED_SYN on the socket returned from accept().
Michael Kerrisk complained that the option names were too similar, while also asking about how the interface would be used. Dumazet disagreed about the names, but provided a test program used by Google to demonstrate how the new options work for user space.
Andy Lutomirski wondered if too much
information was being returned to user space with Munson's patch. It turned
out that Ethernet headers were also being returned, which Munson agreed was probably not needed. John Heffner
asked a related question: "Are there conditions where, for security purposes, you don't want an
application to have access to the raw SYNs?
" Dumazet indicated that it was believed to be safe to
provide the IP and TCP headers.
The patch was applied by Miller on May 5, though he noted that the behavior when a too-small buffer was passed to getsockopt() should be rethought. The original patch simply copied as much data as it could into the user-space buffer, but that gave no indication that the SYN headers were not complete. Miller suggested that it should return an error and indicate the proper length so that the program could allocate more space if needed. Munson subsequently posted a patch to do just that.
The feature seems like it will be useful; it appears that it already has been for Google. It is interesting to note that the company has been collecting these fingerprints on (at least) some portion of its vast server farm, though it is not clear what it is doing with all of that information. Soon, though, others will be able to do so too—once 4.2 is released.
An introduction to Clear Containers
Containers are hot. Everyone loves them. Developers love the ease of creating a "bundle" of something that users can consume; DevOps and information-technology departments love the ease of management and deployment. To a large degree, containers entered the spotlight when Docker changed the application-development industry on the server side in a way that resembles how the iPhone changed the client application landscape.The word "container" is not just used for applications, though; it is also used to describe a technology that can run a piece of software in an isolated way. Such containers are about using control groups to manage resources and kernel namespaces to limit the visibility and reach of your container app. For the typical LWN reader, this is likely what one thinks about when encountering the word "container."
Many people who advocate for containers start by saying that virtual machines are expensive and slow to start, and that containers provide a more efficient alternative. The usual counterpoint is about how secure kernel containers really are against adversarial users with an arsenal of exploits in their pockets. Reasonable people can argue for hours on this topic, but the reality is that quite a few potential users of containers see this as a showstopper. There are many efforts underway to improve the security of containers and namespaces in both open-source projects and startup companies.
We (the Intel Clear Containers group) are taking a little bit of a different tack on the security of containers by going back to the basic question: how expensive is virtual-machine technology, really? Performance in this regard is primarily measured using two metrics: startup time and memory overhead. The first is about how quickly your data center can respond to an incoming request (say a user logs into your email system); the second is about how many containers you can pack on a single server.
We set out to build a system (which we call "Clear Containers") where one can use the isolation of virtual-machine technology along with the deployment benefits of containers. As part of this, we let go of the "machine" notion traditionally associated with virtual machines; we're not going to pretend to be a standard PC that is compatible with just about any OS on the planet.
To provide a preview of the results: we can launch such a secured container that uses virtualization technology in under 150 milliseconds, and the per-container memory overhead is roughly 18 to 20MB (this means you can run over 3500 of these on a server with 128GB of RAM). While this is not quite as fast as the fastest Docker startup using kernel namespaces, for many applications this is likely going to be good enough. And we aren't finished optimizing yet.
So how did we do this?
Hypervisor
With KVM as the hypervisor of choice, we looked at the QEMU layer. QEMU is great for running Windows or legacy Linux guests, but that flexibility comes at a hefty price. Not only does all of the emulation consume memory, it also requires some form of low-level firmware in the guest as well. All of this adds quite a bit to virtual-machine startup times (500 to 700 milliseconds is not unusual).
However, we have the kvmtool mini-hypervisor at our disposal (LWN has covered kvmtool in the past). With kvmtool, we no longer need a BIOS or UEFI; instead we can jump directly into the Linux kernel. Kvmtool is not cost-free, of course; starting kvmtool and creating the CPU contexts takes approximately 30 milliseconds. We have enhanced kvmtool to support execute-in-place on the kernel to avoid having to decompress the kernel image; we just mmap() the vmlinux file and jump into it, saving both memory and time.
Kernel
A Linux kernel boots pretty fast. On a real machine, most of the boot time in the kernel is spent initializing some piece of hardware. However, in a virtual machine, none of these hardware delays are there—it's all fake, after all—and, in practice, one uses only the virtio class of devices that are pretty much free to set up. We had to optimize away a few early-boot CPU initialization delays; but otherwise, booting a kernel in a virtual-machine context takes about 32 milliseconds, with a lot of room left for optimization.
We also had to fix several bugs in the kernel. Some fixes are upstream already and others will go upstream in the coming weeks.
User space
In 2008 we talked about the 5-second boot at the Plumbers Conference, and, since then, many things have changed—with systemd being at the top of the list. Systemd makes it trivial to create a user space environment that boots quickly. I would love to write a long essay here about how we had to optimize user space, but the reality is—with some minor tweaks and just putting the OS together properly—user space boots pretty quickly (less than 75 milliseconds) already. (When recording bootcharts at high resolution sampling, it's a little more, but that's all measurement overhead.)
Memory consumption
A key feature to help with memory consumption is DAX, which the 4.0 kernel now supports in the ext4 filesystem. If your storage is visible as regular memory to the host CPU, DAX enables the system to do execute-in-place of files stored there. In other words, when using DAX, you bypass the page cache and virtual-memory subsystem completely. For applications that use mmap(), this means a true zero-copy approach, and for code that uses the read() system call (or equivalent) you will have only one copy of the data. DAX was originally designed for fast flash-like storage that shows up as memory to the CPU; but in a virtual-machine environment, this type of storage is easy to emulate. All we need to do on the host is map the disk image file into the guest's physical memory, and use a small device driver in the guest kernel that exposes this memory region to the kernel as a DAX-ready block device.
What this DAX solution provides is a zero-copy, no-memory-cost solution for getting all operating-system code and data into the guest's user space. In addition, when the MAP_PRIVATE flag is used in the hypervisor, the storage becomes copy-on-write for free; writes in the guest to the filesystem are not persistent, so they will go away when the guest container terminates. This MAP_PRIVATE solution makes it trivial to share the same disk image between all the containers, and also means that even if one container is compromised and mucks with the operating-system image, these changes do not persist in future containers.
A second key feature to reduce memory cost is kernel same-page merging (KSM) on the host. KSM is a way to deduplicate memory within and between processes and KVM guests.
Finally, we optimized our core user space for minimal memory consumption. This mostly consists of calling the glibc malloc_trim() function at the end of the initialization of resident daemons, causing them to give back to the kernel any malloc() buffers that glibc held onto. Glibc by default implements a type of hysteresis where it holds on to some amount of freed memory as an optimization in case memory is needed again soon.
Next steps
We have this working as a proof of concept with rkt (implementing the appc spec that LWN wrote about recently). Once this work is a bit more mature, we will investigate adding support into Docker as well. More information on how to get started and get code can be found at clearlinux.org, which we will update as we make progress with our integration and optimization efforts.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
