LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.6 merge window remains open, so there is no current development kernel release. Changes continue to move into the mainline; see the separate article below for details.

Stable updates: 3.2.24 was released on July 26, 3.4.7 came out on July 30, and 3.0.39 was released on August 1. In addition to the usual fixes, 3.0.39 includes a significant set of backported memory management performance patches.

The 3.2.25 update is in the review process as of this writing; it can be expected on or after August 2.

Comments (none posted)

Quotes of the week

If someone has to read the code to find out what the driver is, your help text probably sucks.
Dave Jones

The number of underscores for the original rcu_dereference()'s local variable was the outcome of an argument about how obfuscated that variable's name should be in order to avoid possible collisions with names in the enclosing scope. Nine leading underscores might seem excessive, or even as you say, insane, but on the other hand no name collisions have ever come to my attention.
Paul McKenney

For cases like that, I will do the merge myself, but I'll actually double-check my merge against the maintainer merge. And it's happened more than once that my merge has differed, and _my_ merge is the correct one. The maintainer may know his code better, but I know my merging. I do a ton of them.
Linus Torvalds

Comments (none posted)

RIP Andre Hedrick (Register)

The Register has an article on the life and death of Andre Hedrick, the former kernel IDE maintainer who passed away on July 13. "Today, millions of people use digital restriction management systems that lock down books, songs and music - the Amazon Kindle, the BBC iPlayer and Spotify are examples - but consumers enter into the private commercial agreement knowingly. It isn't set by default in the factory, as it might have been. The PC remains open rather than becoming an appliance. Andre was never comfortable taking the credit he really deserved for this achievement." See also this weblog page where memories are being collected.

Comments (34 posted)

Garzik: An Andre To Remember

Jeff Garzik has shared his memories of Andre Hedrick on the linux-kernel mailing list; worth a read. "This is a time for grief and a time for celebration of Andre's accomplishments, but also it is a time to look around at our fellow geeks and offer our support, if similar behavioral signs appear."

Full Story (comments: 68)

Kernel development news

3.6 merge window part 2

By Jonathan Corbet
August 1, 2012
As of this writing, just over 8,200 non-merge changesets have been pulled into Linus's repository; that's nearly 4,000 since last week's summary. It seems that any hopes that 3.6 might be a relatively low-volume cycle are not meant to be fulfilled. That said, things seem to be going relatively smoothly, with only a small number of problems being reported so far.

User-visible changes merged since last week include:

  • The btrfs send/receive feature has been merged. Send/receive can calculate the differences between two btrfs subvolumes or snapshots and serialize the result; it can be used for, among other things, easy mirroring of volumes and incremental backups.

  • Btrfs has also gained the ability to apply disk quotas to subvolumes. According to btrfs maintainer Chris Mason, "This enables full tracking of how many blocks are allocated to each subvolume (and all snapshots) and you can set limits on a per-subvolume basis. You can also create quota groups and toss multiple subvolumes into a big group. It's everything you need to be a web hosting company and give each user their own subvolume."

  • The kernel has gained better EFI booting support. This should allow the removal of a lot of EFI setup code from various bootloaders, which now need only load the kernel and jump into it.

  • The new "coupled cpuidle" code enables better CPU power management on systems where CPUs cannot be powered down individually. See this commit for more information on how this feature works.

  • The LED code supports a new "oneshot" mode where applications can request a single LED blink via sysfs. See Documentation/leds/ledtrig-oneshot.txt for details.

  • A number of random number generator changes have been merged, hopefully leading to more secure random numbers, especially on embedded devices.

  • The VFIO subsystem, intended to be a safe mechanism for the creation of user-space device drivers, has been merged; see Documentation/vfio.txt for more information.

  • The swap-over-NFS patch set has been merged, making the placement of swap files on NFS-mounted filesystems a not entirely insane thing to do.

  • New hardware support includes:

    • Processors and systems: Loongson 1B CPUs.

    • Audio: Wolfson Micro "Arizona" audio controllers (WM5102 and WM5110 in particular).

    • Input: NXP LPC32XX key scanners, MELFAS MMS114 touchscreen controllers, and EDT ft5x06 based polytouch devices.

    • Miscellaneous: National Semiconductor/TI LM3533 ambient light sensors, Analog Devices AD9523 clock generators, Analog Devices ADF4350/ADF4351 wideband synthesizers, Analog Devices AD7265/AD7266 analog to digital converters, Analog Devices AD-FMCOMMS1-EBZ SPI-I2C-bridges, Microchip MCP4725 digital-to-analog converters, Maxim DS28E04-100 1-Wire EEPROMs, Vishay VCNL4000 ambient light/proximity sensors, Texas Instruments OMAP4+ temperature sensors, EXYNOS HW random number generators, Atmel AES, SHA1/SHA256, and AES crypto accelerators, Blackfin CRC accelerators, AMD 8111 GPIO controllers, TI LM3556 and LP8788 LED controllers, BlinkM I2C RGB LED controllers, Calxeda Highbank memory controllers, Maxim Semiconductor MAX77686 PMICs, Marvell 88PM800 and 88PM805 PMICs, Lantiq Falcon SPI controllers, and Broadcom BCM63xx random number generators.

    • Networking: Cambridge Silicon Radio wireless controllers.

    • USB: Freescale i.MX ci13xxx USB controllers, Marvell PXA2128 USB 3.0 controllers, and Maxim MAX77693 MUIC USB port accessory detectors.

    • Video4Linux: Realtek RTL2832 DVB-T demodulators, Analog Devices ADV7393 encoders, Griffin radioSHARK and radioSHARK2 USB radio receivers, and IguanaWorks USB IR transceivers.

    • Staging graduations: IIO digital-to-analog converter drivers.

Changes visible to kernel developers include:

  • The pstore persistent storage mechanism has improved handling of console log messages. The Android RAM buffer console mechanism has been removed, since pstore is now able to provide all of the same functionality. Pstore has also gained function tracer support, allowing the recording of function calls prior to a panic.

  • The new PWM framework eases the writing of drivers for pulse-width modulation devices, including LEDs, fans, and more. See Documentation/pwm.txt for details.

  • There is a new utility function:

         size_t memweight(const void *ptr, size_t bytes);
    

    It returns the number of bits set in the given memory region.

  • The fault injection subsystem has a new module which can inject errors into notifier call chains.

  • There is a new "flexible proportions" library allowing the calculation of proportions over a variable period. See <linux/flex_proportions.h> for the interface.

  • The new __GFP_MEMALLOC flag allows memory allocations to dip into the emergency reserves.

  • The IRQF_SAMPLE_RANDOM interrupt flag no longer does anything; it has been removed from the kernel.

Andrew Morton's big pile of patches was merged on August 1; that is usually a sign that the merge window is nearing its end. Expect a brief update after the 3.6 merge window closes, but, at this point, the feature set for this release can be expected to be nearly complete.

Comments (1 posted)

ACCESS_ONCE()

By Jonathan Corbet
August 1, 2012
Even a casual reader of the kernel source code is likely to run into invocations of the ACCESS_ONCE() macro eventually; there are well over 200 of them in the current source tree. Many such readers probably do not stop to understand just what that macro means; a recent discussion on the mailing list made it clear that even core kernel developers may not have a firm idea of what it does. Your editor was equally ignorant but decided to fix that; the result, hopefully, is a reasonable explanation of why ACCESS_ONCE() exists and when it must be used.

The functionality of this macro is actually well described by its name; its purpose is to ensure that the value passed as a parameter is accessed exactly once by the generated code. One might well wonder why that matters. It comes down to the fact that the C compiler will, if not given reasons to the contrary, assume that there is only one thread of execution in the address space of the program it is compiling. Concurrency is not built into the C language itself, so mechanisms for dealing with concurrent access must be built on top of the language; ACCESS_ONCE() is one such mechanism.

Consider, for example, the following code snippet from kernel/mutex.c:

    for (;;) {
	struct task_struct *owner;

	owner = ACCESS_ONCE(lock->owner);
	if (owner && !mutex_spin_on_owner(lock, owner))
	    break;
 	/* ... */

This is a small piece of the adaptive spinning code that hopes to quickly grab a mutex once the current owner drops it, without going to sleep. There is much more to this for loop than has been shown here, but this code is sufficient to show why ACCESS_ONCE() can be necessary.

Imagine for a second that the compiler in use is developed by fanatical developers who will optimize things in every way they can. This is not a purely hypothetical scenario; as Paul McKenney recently attested: "I have seen the glint in their eyes when they discuss optimization techniques that you would not want your children to know about!" These developers might create a compiler that concludes that, since the code in question does not actually modify lock->owner, it is not necessary to actually fetch its value each time through the loop. The compiler might then rearrange the code into something like:

    owner = ACCESS_ONCE(lock->owner);
    for (;;) {
	if (owner && !mutex_spin_on_owner(lock, owner))
	    break;

What the compiler has missed is the fact that lock->owner is being changed by another thread of execution entirely. The result is code that will fail to notice any such changes as it executes the loop multiple times, leading to unpleasant results. The ACCESS_ONCE() call prevents this optimization happening, with the result that the code (hopefully) executes as intended.

As it happens, an optimized-out access is not the only peril that this code could encounter. Some processor architectures (x86, for example) are not richly endowed with registers; on such systems, the compiler must make careful choices regarding which values to keep in registers if it is to generate the highest-performing code. Specific values may be pushed out of the register set, then pulled back in later. Should that happen to the mutex code above, the result could be multiple references to lock->owner. And that could cause trouble; if the value of lock->owner changed in the middle of the loop, the code, which is expecting the value of its local owner variable to remain constant, could become fatally confused. Once again, the ACCESS_ONCE() invocation tells the compiler not to do that, avoiding potential problems.

The actual implementation of ACCESS_ONCE(), found in <linux/compiler.h>, is fairly straightforward:

    #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

In other words, it works by turning the relevant variable, temporarily, into a volatile type.

Given the kinds of hazards presented by optimizing compilers, one might well wonder why this kind of situation does not come up more often. The answer is that most concurrent access to data is (or certainly should be) protected by locks. Spinlocks and mutexes both function as optimization barriers, meaning that they prevent optimizations on one side of the barrier from carrying over to the other. If code only accesses a shared variable with the relevant lock held, and if that variable can only change when the lock is released (and held by a different thread), the compiler will not create subtle problems. It is only in places where shared data is accessed without locks (or explicit barriers) that a construct like ACCESS_ONCE() is required. Scalability pressures are causing the creation of more of this type of code, but most kernel developers still should not need to worry about ACCESS_ONCE() most of the time.

Comments (76 posted)

TCP Fast Open: expediting web services

By Michael Kerrisk
August 1, 2012

Much of today's Internet traffic takes the form of short TCP data flows that consist of just a few round trips exchanging data segments before the connection is terminated. The prototypical example of this kind of short TCP conversation is the transfer of web pages over the Hypertext Transfer Protocol (HTTP).

The speed of TCP data flows is dependent on two factors: transmission delay (the width of the data pipe) and propagation delay (the time that the data takes to travel from one end of the pipe to the other). Transmission delay is dependent on network bandwidth, which has increased steadily and substantially over the life of the Internet. On the other hand, propagation delay is a function of router latencies, which have not improved to the same extent as network bandwidth, and the speed of light, which has remained stubbornly constant. (At intercontinental distances, this physical limitation means that—leaving aside router latencies—transmission through the medium alone requires several milliseconds.) The relative change in the weighting of these two factors means that over time the propagation delay has become a steadily larger component in the overall latency of web services. (This is especially so for many web pages, where a browser often opens several connections to fetch multiple small objects that compose the page.)

Reducing the number of round trips required in a TCP conversation has thus become a subject of keen interest for companies that provide web services. It is therefore unsurprising that Google should be the originator of a series of patches to the Linux networking stack to implement the TCP Fast Open (TFO) feature, which allows the elimination of one round time trip (RTT) from certain kinds of TCP conversations. According to the implementers (in "TCP Fast Open", CoNEXT 2011 [PDF]), TFO could result in speed improvements of between 4% and 41% in the page load times on popular web sites.

We first wrote about TFO back in September 2011, when the idea was still in the development stage. Now that the TFO implementation is starting to make its way into the kernel, it's time to visit it in more detail.

The TCP three-way handshake

To understand the optimization performed by TFO, we first need to note that each TCP conversation begins with a round trip in the form of the so-called three-way handshake. The three-way handshake is initiated when a client makes a connection request to a server. At the application level, this corresponds to a client performing a connect() system call to establish a connection with a server that has previously bound a socket to a well-known address and then called accept() to receive incoming connections. Figure 1 shows the details of the three-way handshake in diagrammatic form.

[TCP Three-Way Handshake]
Figure 1: TCP three-way handshake between a client and a server

During the three-way handshake, the two TCP end-points exchange SYN (synchronize) segments containing options that govern the subsequent TCP conversation—for example, the maximum segment size (MSS), which specifies the maximum number of data bytes that a TCP end-point can receive in a TCP segment. The SYN segments also contain the initial sequence numbers (ISNs) that each end-point selects for the conversation (labeled M and N in Figure 1).

The three-way handshake serves another purpose with respect to connection establishment: in the (unlikely) event that the initial SYN is duplicated (this may occur, for example, because underlying network protocols duplicate network packets), then the three-way handshake allows the duplication to be detected, so that only a single connection is created. If a connection was established before completion of the three-way handshake, then a duplicate SYN could cause a second connection to be created.

The problem with current TCP implementations is that data can only be exchanged on the connection after the initiator of the connection has received an ACK (acknowledge) segment from the peer TCP. In other words, data can be sent from the client to the server only in the third step of the three-way handshake (the ACK segment sent by the initiator). Thus, one full round trip time is lost before data is even exchanged between the peers. This lost RTT is a significant component of the latency of short web conversations.

Applications such as web browsers try to mitigate this problem using HTTP persistent connections, whereby the browser holds a connection open to the web server and reuses that connection for later HTTP requests. However, the effectiveness of this technique is decreased because idle connections may be closed before they are reused. For example, in order to limit resource usage, busy web servers often aggressively close idle HTTP connections. The result is that a high proportion of HTTP requests are cold, requiring a new TCP connection to be established to the web server.

Eliminating a round trip

Theoretically, the initial SYN segment could contain data sent by the initiator of the connection: RFC 793, the specification for TCP, does permit data to be included in a SYN segment. However, TCP is prohibited from delivering that data to the application until the three-way handshake completes. This is a necessary security measure to prevent various kinds of malicious attacks. For example, if a malicious client sent a SYN segment containing data and a spoofed source address, and the server TCP passed that segment to the server application before completion of the three-way handshake, then the segment would both cause resources to be consumed on the server and cause (possibly multiple) responses to be sent to the victim host whose address was spoofed.

The aim of TFO is to eliminate one round trip time from a TCP conversation by allowing data to be included as part of the SYN segment that initiates the connection. TFO is designed to do this in such a way that the security concerns described above are addressed. (T/TCP, a mechanism designed in the early 1990s, also tried to provide a way of short circuiting the three-way handshake, but fundamental security flaws in its design meant that it never gained wide use.)

On the other hand, the TFO mechanism does not detect duplicate SYN segments. (This was a deliberate choice made to simplify design of the protocol.) Consequently, servers employing TFO must be idempotent—they must tolerate the possibility of receiving duplicate initial SYN segments containing the same data and produce the same result regardless of whether one or multiple such SYN segments arrive. Many web services are idempotent, for example, web servers that serve static web pages in response to URL requests from browsers, or web services that manipulate internal state but have internal application logic to detect (and ignore) duplicate requests from the same client.

In order to prevent the aforementioned malicious attacks, TFO employs security cookies (TFO cookies). The TFO cookie is generated once by the server TCP and returned to the client TCP for later reuse. The cookie is constructed by encrypting the client IP address in a fashion that is reproducible (by the server TCP) but is difficult for an attacker to guess. Request, generation, and exchange of the TFO cookie happens entirely transparently to the application layer.

At the protocol layer, the client requests a TFO cookie by sending a SYN segment to the server that includes a special TCP option asking for a TFO cookie. The SYN segment is otherwise "normal"; that is, there is no data in the segment and establishment of the connection still requires the normal three-way handshake. In response, the server generates a TFO cookie that is returned in the SYN-ACK segment that the server sends to the client. The client caches the TFO cookie for later use. The steps in the generation and caching of the TFO cookie are shown in Figure 2.

[Generating the TFO cookie]
Figure 2: Generating the TFO cookie

At this point, the client TCP now has a token that it can use to prove to the server TCP that an earlier three-way handshake to the client's IP address completed successfully.

For subsequent conversations with the server, the client can short circuit the three-way handshake as shown in Figure 3.

[Employing the TFO cookie]
Figure 3: Employing the TFO cookie

The steps shown in Figure 3 are as follows:

  1. The client TCP sends a SYN that contains both the TFO cookie (specified as a TCP option) and data from the client application.

  2. The server TCP validates the TFO cookie by duplicating the encryption process based on the source IP address of the new SYN. If the cookie proves to be valid, then the server TCP can be confident that this SYN comes from the address it claims to come from. This means that the server TCP can immediately pass the application data to the server application.

  3. From here on, the TCP conversation proceeds as normal: the server TCP sends a SYN-ACK segment to the client, which the client TCP then acknowledges, thus completing the three-way handshake. The server TCP can also send response data segments to the client TCP before it receives the client's ACK.

In the above steps, if the TFO cookie proves not to be valid, then the server TCP discards the data and sends a segment to the client TCP that acknowledges just the SYN. At this point, the TCP conversation falls back to the normal three-way handshake. If the client TCP is authentic (not malicious), then it will (transparently to the application) retransmit the data that it sent in the SYN segment.

Comparing Figure 1 and Figure 3, we can see that a complete RTT has been saved in the conversation between the client and server. (This assumes that the client's initial request is small enough to fit inside a single TCP segment. This is true for most requests, but not all. Whether it might be technically possible to handle larger requests—for example, by transmitting multiple segments from the client before receiving the server's ACK—remains an open question.)

There are various details of TFO cookie generation that we don't cover here. For example, the algorithm for generating a suitably secure TFO cookie is implementation-dependent, and should (and can) be designed to be computable with low processor effort, so as not to slow the processing of connection requests. Furthermore, the server should periodically change the encryption key used to generate the TFO cookies, so as to prevent attackers harvesting many cookies over time to use in a coordinated attack against the server.

There is one detail of the use of TFO cookies that we will revisit below. Because the TFO mechanism allows a client that submits a valid TFO cookie to trigger resource usage on the server before completion of the three-way handshake, the server can be the target of resource-exhaustion attacks. To prevent this possibility, the server imposes a limit on the number of pending TFO connections that have not yet completed the three-way handshake. When this limit is exceeded, the server ignores TFO cookies and falls back to the normal three-way handshake for subsequent client requests until the number of pending TFO connections falls below the limit; this allows the server to employ traditional measures against SYN-flood attacks.

The user-space API

As noted above, the generation and use of TFO cookies is transparent to the application level: the TFO cookie is automatically generated during the first TCP conversation between the client and server, and then automatically reused in subsequent conversations. Nevertheless, applications that wish to use TFO must notify the system using suitable API calls. Furthermore, certain system configuration knobs need to be turned in order to enable TFO.

The changes required to a server in order to support TFO are minimal, and are highlighted in the code template below.

    sfd = socket(AF_INET, SOCK_STREAM, 0);   // Create socket

    bind(sfd, ...);                          // Bind to well known address
    
    int qlen = 5;                            // Value to be chosen by application
    setsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
    
    listen(sfd, ...);                        // Mark socket to receive connections

    cfd = accept(sfd, NULL, 0);              // Accept connection on new socket

    // read and write data on connected socket cfd

    close(cfd);

Setting the TCP_FASTOPEN socket option requests the kernel to use TFO for the server's socket. By implication, this is also a statement that the server can handle duplicated SYN segments in an idempotent fashion. The option value, qlen, specifies this server's limit on the size of the queue of TFO requests that have not yet completed the three-way handshake (see the remarks on prevention of resource-exhaustion attacks above).

The changes required to a client in order to support TFO are also minor, but a little more substantial than for a TFO server. A normal TCP client uses separate system calls to initiate a connection and transmit data: connect() to initiate the connection to a specified server address and (typically) write() or send() to transmit data. Since a TFO client combines connection initiation and data transmission in a single step, it needs to employ an API that allows both the server address and the data to be specified in a single operation. For this purpose, the client can use either of two repurposed system calls: sendto() and sendmsg().

The sendto() and sendmsg() system calls are normally used with datagram (e.g., UDP) sockets: since datagram sockets are connectionless, each outgoing datagram must include both the transmitted data and the destination address. Since this is the same information that is required to initiate a TFO connection, these system calls are recycled for the purpose, with the requirement that the new MSG_FASTOPEN flag must be specified in the flags argument of the system call. A TFO client thus has the following general form:

    sfd = socket(AF_INET, SOCK_STREAM, 0);
    
    sendto(sfd, data, data_len, MSG_FASTOPEN, 
                (struct sockaddr *) &server_addr, addr_len);
        // Replaces connect() + send()/write()
    
    // read and write further data on connected socket sfd

    close(sfd);

If this is the first TCP conversation between the client and server, then the above code will result in the scenario shown in Figure 2, with the result that a TFO cookie is returned to the client TCP, which then caches the cookie. If the client TCP has already obtained a TFO cookie from a previous TCP conversation, then the scenario is as shown in Figure 3, with client data being passed in the initial SYN segment and a round trip being saved.

In addition to the above APIs, there are various knobs—in the form of files in the /proc/sys/net/ipv4 directory—that control TFO on a system-wide basis:

  • The tcp_fastopen file can be used to view or set a value that enables the operation of different parts of the TFO functionality. Setting bit 0 (i.e., the value 1) in this value enables client TFO functionality, so that applications can request TFO cookies. Setting bit 1 (i.e., the value 2) enables server TFO functionality, so that server TCPs can generate TFO cookies in response to requests from clients. (Thus, the value 3 would enable both client and server TFO functionality on the host.)

  • The tcp_fastopen_cookies file can be used to view or set a system-wide limit on the number of pending TFO connections that have not yet completed the three-way handshake. While this limit is exceeded, all incoming TFO connection attempts fall back to the normal three-way handshake.

Current state of TCP fast open

Currently, TFO is an Internet Draft with the IETF. Linux is the first operating system that is adding support for TFO. However, as yet that support remains incomplete in the mainline kernel. The client-side support has been merged for Linux 3.6. However, the server-side TFO support has not so far been merged, and from conversations with the developers it appears that this support won't be added in the current merge window. Thus, an operational TFO implementation is likely to become available only in Linux 3.7.

Once operating system support is fully available, a few further steps need to be completed to achieve wider deployment of TFO on the Internet. Among these is assignment by IANA of a dedicated TCP Option Number for TFO. (The current implementation employs the TCP Experimental Option Number facility as a placeholder for a real TCP Option Number.)

Then, of course, suitable changes must be made to both clients and servers along the lines described above. Although each client-server pair requires modification to employ TFO, it's worth noting that changes to just a small subset of applications—most notably, web servers and browsers—will likely yield most of the benefit visible to end users. During the deployment process, TFO-enabled clients may attempt connections with servers that don't understand TFO. This case is handled gracefully by the protocol: transparently to the application, the client and server will fall back to a normal three-way handshake.

There are other deployment hurdles that may be encountered. In their CoNEXT 2011 paper, the TFO developers note that a minority of middle-boxes and hosts drop TCP SYN segments containing unknown (i.e., new) TCP options or data. Such problems are likely to diminish as TFO is more widely deployed, but in the meantime a client TCP can (transparently) handle such problems by falling back to the normal three-way handshake on individual connections, or generally falling back for all connections to specific server IP addresses that show repeated failures for TFO.

Conclusion

TFO is promising technology that has the potential to make significant reductions in the latency of billions of web service transactions that take place each day. Barring any unforeseen security flaws (and the developers seem to have considered the matter quite carefully), TFO is likely to see rapid deployment in web browsers and servers, as well as in a number of other commonly used web applications.

Comments (49 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds