Kernel development [LWN.net]

Kernel release status

The 3.19 merge window is open so there is no current development kernel. Patches have begun to flow into the mainline for the 3.19 development cycle; see the separate article below for details.

The 3.18 kernel was released on December 7.

Stable updates: 3.17.5 stable kernel was released on December 7 with a comment saying "No one should use it"; instead, the immediately following 3.17.6, containing an important patch reversion, should be used. Also available are 3.14.26 and 3.10.62.

Comments (none posted)

Quote of the week

Microsoft provide an in-box driver and vendors have the choice of using that or certifying their own via WHQL, which is a bit like choosing between free ice cream and banging your head against a plank cover in nails.

— Alan Cox (Thanks to Vishal Verma).

Comments (none posted)

The 3.19 merge window opens

By Jonathan Corbet
December 10, 2014

As of this writing, the merge window for the 3.19 development cycle has gotten off to a bit of a slow start; less than 2000 non-merge changesets have been pulled into the mainline repository so far. As can be seen from the lists below, though, that is still enough for some interesting new code to make its way into the kernel.

User-visible changes for 3.19 include:

Support for Intel's MPX technology has been added to the kernel. MPX-enabled processors (which are still mostly unobtainable) can perform bounds-checking on memory references, presumably catching a lot of bugs and blocking the exploitation of buffer-overflow vulnerabilities. Using this feature requires providing the processor with a lot of information about the acceptable bounds for each memory reference, though, so full adoption is likely to take time.
The device mapper "thin provisioning" target has seen some significant performance improvements, mostly having to do with aggregating I/O operations to the same block before issuing them to the underlying device.
The kernel now has support for Altera's "Nios II" processor.
The arm64 architecture has gained support for the secure computing ("seccomp") subsystem.
New hardware support includes:
- Systems and processors: Broadcom IPROC architected systems-on-chip (SoCs), Amlogic Meson8 SoCs, Allwinner A80 SoCs, Samsung Exynos4415 SoCs, Freescale LS1021A SoCs, Alphascale ASM9260 SoCs, and AMD Seattle SoCs. Additionally, dozens of new systems are supported through device tree additions.
- Block: Tekram DC390(T) and Am53/79C974 SCSI adapters and Western Digital WD7193/7197/7296 SCSI adapters.
- Miscellaneous: Toshiba type A SD/MMC card interfaces, X-Powers AXP288 analog-to-digital converters, Diolan DLN2 USB-I2C/SPI/GPIO master adapters, Atmel high-end LCD controllers, Nuvoton NCT7802Y hardware monitoring chips, Richtek RT5033 regulators, and NVIDIA Tegra system memory management units.

Changes visible to kernel developers include:

The Atmel AT91 subarchitecture has been completely converted over to the device tree mechanism. To celebrate, the developers removed all of the old board files for this family, reducing the size of the kernel by 24,000 lines of code.
Work on the year 2038 problem continues. The internal functions do_settimeofday(), timekeeping_inject_sleeptime(), and mktime() now have 2038-safe replacements. In each case, the new version adds "64" to the name of the function and switches to the time64_t or timespec64 types for time representation. Now the process of deprecating the old versions and converting code can begin.
Hierarchical interrupt domain support has been merged into the interrupt-handling core. This support is needed to properly represent complex hardware which has multiple interrupt controllers connected in complicated ways. See the new section added to Documentation/IRQ-domain.txt for a bit more information.

If the usual pattern holds, the merge window will remain open until December 21, when the first 3.19 prepatch will be released. As usual, LWN will track the changes merged during this merge window in subsequent articles; stay tuned.

Comments (2 posted)

Attaching eBPF programs to sockets

By Jonathan Corbet
December 10, 2014

Recent kernel development cycles have seen the addition of the extended Berkeley Packet Filter (eBPF) subsystem to the kernel. But, as of 3.18, a user-space program can load an eBPF program, but cannot cause it to run in any useful context; programs can be loaded and verified, but then they just sit there. Needless to say, eBPF developer Alexei Starovoitov envisions a more extensive role for this subsystem. The 3.19 kernel should include a new set of patches that will, for the first time, demonstrate the sort of capabilities Alexei has in mind.

The main feature to be added in 3.19 is the ability to attach eBPF programs to sockets. The sequence of operations is to set up the eBPF program in memory, then use the new (as of 3.18) bpf() system call to load the program into the kernel and obtain a file descriptor reference to it. Then, the program can be attached to a socket with the new SO_ATTACH_BPF option to setsockopt():

    setsockopt(socket, SOL_SOCKET, SO_ATTACH_BPF, &fd, sizeof(fd));

Where socket represents the socket of interest, and fd holds the file descriptor for the loaded eBPF program.

Once the program is loaded, it will be run on every packet that shows up on the given socket. At the moment, the available functionality is still limited in a couple of ways:

eBPF programs have access to the data stored in the packet itself, but not to any other information stored in the kernel's skb data structure. Future plans call for making some of that metadata available, but it's not yet clear which data will be reachable or how.
Programs cannot do anything to influence the delivery or contents of the packet. So, while these programs are referred to as "filters," all they can do at the moment is store information in eBPF "maps" for consumption by user space.

The end result is that eBPF programs will be useful for statistics gathering and such in 3.19, but not a whole lot more.

Still, that is something to start with. The 3.19 kernel should include a number of examples in the samples directory to show how this functionality can be used. Two of them are versions of a simple program that obtains the low-level protocol (UDP, TCP, ICMP, ... ) from each packet and maintains a count for each protocol in an eBPF map. If one wants to write such a program directly in the eBPF virtual machine language, one ends up with something like this:

    struct bpf_insn prog[] = {
	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
	BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
	BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
	BPF_LD_MAP_FD(BPF_REG_1, map_fd),
	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
	BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
	BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
	BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
	BPF_EXIT_INSN(),
    };

Needless to say, such programs are, for most of us, not particularly enlightening to to read. But, as is shown in this example, the program can also be written in a restrictive version of the C language:

    int bpf_prog1(struct sk_buff *skb)
    {
	int index = load_byte(skb, 14 + 9);
	long *value;

	value = bpf_map_lookup_elem(&my_map, &index);
	if (value)
	    __sync_fetch_and_add(value, 1);

	return 0;
    }

This program can be fed to a special version of the LLVM compiler, producing an object file for the eBPF virtual machine. For now, one must use Alexei's version of LLVM, but he says that he's working on getting the changes upstreamed into the LLVM mainline. A user-space utility can read the program from the object file and load it into the kernel in the usual way; there is no need to deal directly with the eBPF language.

The ability to work in a higher-level language makes its value clear when one looks at the final example, which compiles to a 300-instruction eBPF program. This one does flow tracking, counting the number of packets by IP address. The program itself may be of limited use, but it shows that some fairly complex things can be done with the eBPF virtual machine in the kernel.

Future plans call for using eBPF in a number of other places, including the secure computing ("seccomp") subsystem and for filtering tracepoint hits. Given that eBPF is becoming a general-purpose facility in the kernel, it seems likely that developers will come up with other places where it can be of use. Expect to see some interesting things happen with eBPF in the coming years.

Comments (3 posted)

The iov_iter interface

By Jonathan Corbet
December 9, 2014

One of the most common tasks in the kernel is processing a buffer of data supplied by user space, possibly in several chunks. Perhaps unsurprisingly, this is a task that kernel code often gets wrong, leading to bugs and, possibly, security problems. The kernel contains a primitive (called "iov_iter") meant to make this task simpler. While iov_iter use is mostly confined to the memory-management and filesystem layers currently, it is slowly spreading out into other parts of the kernel. This interface is currently undocumented, a situation this article will attempt to remedy.

The iov_iter concept is not new; it was first added by Nick Piggin for the 2.6.24 kernel in 2007. But there has been an effort over the last year to expand this API and use it in more parts of the kernel; the 3.19 merge window should see it making its first steps into the networking subsystem, for example.

An iov_iter structure is essentially an iterator for working through an iovec structure, defined in <uapi/linux/uio.h> as:

    struct iovec
    {
	void __user *iov_base;
	__kernel_size_t iov_len;
    };

This structure matches the user-space iovec structure defined by POSIX and used with system calls like readv(). As the "vec" portion of the name would suggest, iovec structures tend to come in arrays; as a whole, an iovec describes a buffer that may be scattered in both physical and virtual memory.

The actual iov_iter structure is defined in <linux/uio.h>:

    struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	const struct iovec *iov; /* SIMPLIFIED - see below */
	unsigned long nr_segs;
    };

The type field describes the type of the iterator. It is a bitmask containing, among other things, either READ or WRITE depending on whether data is being read into the iterator or written from it. The data direction, thus, refers not to the iterator itself, but to the other part of the data transaction; an iov_iter created with a type of READ will be written to.

Beyond that, iov_offset contains the offset to the first byte of interesting data in the first iovec pointed to by iov. The total amount of data pointed to by the iovec array is stored in count, while the number of iovec structures is stored in nr_segs. Note that most of these fields will change as code "iterates" through the buffer. They describe a cursor into the buffer, rather than the buffer as a whole.

Working with struct iov_iter

Before use, an iov_iter must be initialized to contain an (already populated) iovec with:

    void iov_iter_init(struct iov_iter *i, int direction,
		       const struct iovec *iov, unsigned long nr_segs,
		       size_t count);

Then, for example, data can be moved between the iterator and user space with either of:

    size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
    size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);

The naming here can be a little confusing until one gets the hang of it. A call to copy_to_iter() will copy bytes data from the buffer at addr to the user-space buffer indicated by the iterator. So copy_to_iter() can be thought of as being like a variant of copy_to_user() that takes an iterator rather than a single buffer. Similarly, copy_from_iter() will copy the data from the user-space buffer to addr. The similarity with copy_to_user() continues through to the return value, which is the number of bytes not copied.

Note that these calls will "advance" the iterator through the buffer to correspond to the amount of data transferred. In other words, the iov_offset, count, nr_segs, and iov fields of the iterator will all be changed as needed. So two calls to copy_from_iter() will copy two successive areas from user space. Among other things, this means that the code owning the iterator must remember the base address for the iovec array, since the iov value in the iov_iter structure may change.

Various other functions exist. To move data referenced by a page structure into or out of an iterator, use:

    size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
			     struct iov_iter *i);
    size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
			       struct iov_iter *i);

Only the single page provided will be copied to or from, so these functions should not be asked to copy data that would cross the page boundary.

Code running in atomic context can attempt to obtain data from user space with:

    size_t iov_iter_copy_from_user_atomic(struct page *page, struct iov_iter *i,
					  unsigned long offset, size_t bytes);

Since this copy will be done in atomic mode, it will only succeed if the data is already resident in RAM; callers must thus be prepared for a higher-than-normal chance of failure.

If it is necessary to map the user-space buffer into the kernel, one of these calls can be used:

    ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
                               size_t maxsize, unsigned maxpages, size_t *start);
    ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, 
    	    			     size_t maxsize, size_t *start);

Either function turns into a call to get_user_pages_fast(), causing (hopefully) the pages to be brought in and their locations stored in the pages array. The difference between them is that iov_iter_get_pages() expects the pages array to be allocated by the caller, while iov_iter_get_pages_alloc() will do the allocation itself. In that case, the array returned in pages must eventually be freed with kvfree(), since it might have been allocated with either kmalloc() or vmalloc().

Advancing through the iterator without moving any data can be done with:

    void iov_iter_advance(struct iov_iter *i, size_t size);

The buffer referred to by an iterator (or a portion thereof) can be cleared with:

    size_t iov_iter_zero(size_t bytes, struct iov_iter *i);

Information about the iterator is available from a number of helper functions:

    size_t iov_iter_single_seg_count(const struct iov_iter *i);
    int iov_iter_npages(const struct iov_iter *i, int maxpages);
    size_t iov_length(const struct iovec *iov, unsigned long nr_segs);

A call to iov_iter_single_seg_count() returns the length of the data in the first segment of the buffer. iov_iter_npages() reports the number of pages occupied by the buffer in the iterator, while iov_length() returns the total data length. The latter function must be used with care, since it trusts the len field in the iovec structures. If that data comes from user space, it could cause integer overflows in the kernel.

Not just iovecs

The definition of struct iov_iter shown above does not quite match what is actually found in the kernel. Instead of a single field for the iov array, the real structure has (in 3.18):

    union {
	const struct iovec *iov;
	const struct bio_vec *bvec;
    };

In other words, the iov_iter structure is also set up to work with the BIO structures used by the block layer. Such iterators are marked by having ITER_BVEC include in the type field bitmask. Once such an iterator is created, all of the above calls will work with it as if it were an "ordinary" iterator using iovec structures. Currently, the use of BIO-based iterators in the kernel is minimal; they can only be found in the swap and splice() code.

Coming in 3.19

The 3.19 kernel is likely to see a substantial rewrite of the iov_iter code aimed at reducing the vast amount of boilerplate code needed to implement all of the above-mentioned functions. The code is indeed shorter afterward, but at the cost of introducing a fair amount of mildly frightening preprocessor macro magic to generate the needed boilerplate on demand.

The iov_iter code already works if the "user-space" buffer is actually located in kernel space. In 3.19, things will be formalized and optimized a bit. Such an iterator will be created with:

    void iov_iter_kvec(struct iov_iter *i, int direction,
		       const struct kvec *iov, unsigned long nr_segs,
		       size_t count);

There will also be a new kvec field added to the union shown above for this case.

Finally, some functions have been added to help with the networking case; it will be possible, for example, to copy a buffer and generate a checksum in the process.

The end result is that the iov_iter interface is slowly becoming the standard way of hiding many of the complexities associated with handling user-space buffers. We can expect to see its use encouraged in more places in the future. It only took seven years or so, but iov_iter appears to be reaching a point of being an interface that most kernel developers will want to be aware of.

Comments (3 posted)

Linus Torvalds Linux 3.18 released ?

Greg KH Linux 3.17.6 ?

Greg KH Linux 3.17.5 ?

Greg KH Linux 3.14.26 ?

Kamal Mostafa Linux 3.13.11-ckt12 ?

Jiri Slaby Linux 3.12.34 ?

Greg KH Linux 3.10.62 ?

Wang Nan ARM: kprobes: OPTPROBES and other improvements. ?

Chunyan Zhang Add Spreadtrum Sharkl64 Platform support ?

Kenneth Westfield ASoC: QCOM: Add support for ipq806x SOC ?

Chris Packham Add support for Marvell's RD-DXBC2 ?

David Daney MIPS: Get ready for non-executable stack. ?

Frederic Weisbecker hw_breakpoints: Support AMD range breakpoints ?

Shreyas B. Prabhu powernv: cpuidle: Redesign idle states management ?

Pieter Smith kernel tinification: optionally compile out splice family of syscalls (splice, vmsplice, tee and sendfile) ?

Christian Borntraeger ACCESS_ONCE and non-scalar accesses ?

Aditya Kali CGroup Namespaces ?

Jarkko Sakkinen TPM 2.0 support ?

Teodora Baluta Introduce IIO interface for fingerprint sensors ?

Tomeu Vizoso Add support for Tegra Activity Monitor ?

Flora Fu Add Support for MediaTek PMIC MT6397 MFD Core and Regulator ?

Lee Jones spi: st: New driver for ST's SPI Controller ?

Beomho Seo mfd: rt5033: Add Richtek RT5033 drivers ?

Krzysztof Kozlowski devfreq: exynos: Add driver for Exynos3250 ?

Liu Ying Add support for i.MX MIPI DSI DRM driver ?

Sudeep Dutt misc: mic: SCIF driver ?

Marc Zyngier Introducing per-device MSI domain ?

Javi Merino The power allocator thermal governor ?

Chanwoo Choi [RFC PATCHv2 0/7] devfreq: Add devfreq-event class to provide raw data for devfreq device ?

atull@opensource.altera.com FPGA Manager Framework ?

Andrzej Hajda Resource tracking/allocation framework ?

Josef Bacik dm: log writes target ?

Theodore Ts'o add support for a lazytime mount option ?

Li Xi ext4: add project quota support ?

Omar Sandoval btrfs: implement swap file support ?

Chandan Rajendra Btrfs: Subpagesize-blocksize: Get rid of whole page I/O. ?

Jeff Layton nfsd/sunrpc: add support for a workqueue-based nfsd ?

Mel Gorman Replace _PAGE_NUMA with PAGE_NONE protections v4 ?

Joonsoo Kim enhance compaction success rate ?

Jesper Dangaard Brouer [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) ?

Christoph Lameter slub: Fastpath optimization (especially for RT) V1 ?

Stephan Mueller crypto: AF_ALG: add AEAD and RNG support ?

Jike Song [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM ?

Andre Przywara KVM GICv3 emulation ?

Seth Jennings Kernel Live Patching ?

Omar Sandoval Introduce RCU string API ?

Kernel development

Brief items

Kernel release status

Quote of the week

Kernel development news

The 3.19 merge window opens

Attaching eBPF programs to sockets

The iov_iter interface

Working with struct iov_iter

Not just iovecs

Coming in 3.19

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous