Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.11-rc4, released on August 4. "I had hoped things would start calming down, but rc4 is pretty much exactly the same size as rc3 was. That said, the patches seem a bit more spread out, and less interesting - which is a good thing. Boring is good." All told, 339 non-merge changesets were pulled into the mainline for -rc4. They are mostly fixes, but there is also a mysterious set of ARM security fixes (starting here) that showed up without prior discussion.

Stable updates: 3.10.5 3.4.56, 3.2.50, and 3.0.89, were all released on August 4.

Also worth noting: Greg Kroah-Hartman has announced that 3.10 will be the next long-term supported kernel. "I’m picking this kernel after spending a lot of time talking about kernel releases, and product releases and development schedules from a large range of companies and development groups. I couldn’t please everyone, but I think that the 3.10 kernel fits the largest common set of groups that rely on the longterm kernel releases."

Comments (none posted)

Quotes of the week

Well, lguest is particularly expendable. It's the red shirt of the virtualization away team.

— Rusty Russell

Don't be afraid of writing too much text - trust me, I've never seen a changelog which was too long!

— Andrew Morton

Comments (1 posted)

flink() at last?

By Jonathan Corbet
August 7, 2013

There has long been a desire for an flink() system call in the kernel. It would take a file descriptor and a file name as arguments and cause the name to be a new hard link to the file behind the descriptor. There have been concerns about security, though, that have kept this call out of the kernel; some see it as a way for a process to make a file name for a file descriptor that came from outside — via exec(), for example. That process may not have had a reachable path to the affected file before, so the creation of a new name could be seen as bypassing an existing security policy.

The problem with this reasoning, as noted by Andy Lutomirski in a patch merged for 3.11-rc5, is that this functionality is already available by way of the linkat() system call. All it takes is having the /proc filesystem mounted — and a system without /proc is quite rare. But the incantation needed to make a link in this way is a bit arduous:

    linkat(AT_FDCWD, "/proc/self/fd/N", destdirfd, newname, AT_SYMLINK_FOLLOW);

Where "N" is the number of the relevant file descriptor. It would be a lot nicer, he said, to just allow the use of the AT_EMPTY_PATH option, which causes the link to be made to the file behind the original file descriptor:

    linkat(fd, "", destdirfd, newname, AT_EMPTY_PATH);

In current kernels, though, that option is restricted to processes with the CAP_DAC_READ_SEARCH capability out of the same security concerns as described above. But, as Andy pointed out, the restriction makes no sense given that the desired functionality is available anyway. So his patch removes the check, making the second variant available to all users. This functionality is expected to be useful with files opened with the O_TMPFILE option, but other uses can be imagined as well. It will be generally available in the 3.11 kernel.

Comments (17 posted)

A survey of memory management patches

By Jonathan Corbet
August 6, 2013

Traffic on the kernel mailing lists often seems to follow a particular theme. At the moment, one of those themes is memory management. What follows is an overview of these patches, hopefully giving an idea of what the memory management developers are up to.

MADV_WILLWRITE

Normally, developers expect that a write to file-backed memory will execute quickly. That data must eventually find its way back to persistent storage, but the kernel usually handles that in the background while the application continues running. Andy Lutomirski has discovered that things don't always work that way, though. In particular, if the memory is backed by a file that has never been written (even if it has been extended to the requisite size with fallocate()), the first write to each page of that memory can be quite slow, due to the filesystem's need to allocate on-disk blocks, mark the block as being initialized, and otherwise get ready to accept the data. If (as is the case with Andy's application) there is a need to write multiple gigabytes of data, the slowdown can be considerable.

One way to work around this problem is to write throwaway data to that memory before getting into the time-sensitive part of the application, essentially forcing the kernel to prepare the backing store. That approach works, but at the cost of writing large amounts of useless data to disk; it might be nice to have something a bit more elegant than that.

Andy's answer is to add a new operation, MADV_WILLWRITE, to the madvise() system call. Within the kernel, that call is passed to a new vm_operations_struct operation:

    long (*willwrite)(struct vm_area_struct *vma, unsigned long start, 
		      unsigned long end);

In the current implementation, only the ext4 filesystem provides support for this operation; it responds by reserving blocks so that the upcoming write can complete quickly. Andy notes that there is a lot more that could be done to fully prepare for an upcoming write, including performing the copy-on-write needed for private mappings, actually allocating pages of memory, and so on. For the time being, though, the patch is intended as a proof of concept and a request for comments.

Controlling transparent huge pages

The transparent huge pages feature uses huge pages whenever possible, and without user-space awareness, in order to improve memory access performance. Most of the time the result is faster execution, but there are some workloads that can perform worse when transparent huge pages are enabled. The feature can be turned off globally, but what about situations where some applications benefit while others do not?

Alex Thorlton's answer is to provide an option to disable transparent huge pages on a per-process basis. It takes the form of a new operation (PR_SET_THP_DISABLED) to the prctl() system call. This operation sets a flag in the task_struct structure; setting that flag causes the memory management system to avoid using huge pages for the associated process. And that allows the creation of mixed workloads, where some processes use transparent huge pages and others do not.

Transparent huge page cache

Since their inception, transparent huge pages have only worked with anonymous memory; there is no support for file-backed (page cache) pages. For some time now, Kirill A. Shutemov has been working on a transparent huge page cache implementation to fix that problem. The latest version, a 23-patch set, shows how complex the problem is.

In this version, Kirill's patch has a number of limitations. Unlike the anonymous page implementation, the transparent huge page cache code is unable to create huge pages by coalescing small pages. It also, crucially, is unable to create huge pages in response to page faults, so it does not currently work well with files mapped into a process's address space; that problem is slated to be fixed in a future patch set. The current implementation only works with the ramfs filesystem — not, perhaps, the filesystem that users were clamoring for most loudly. But the ramfs implementation is a good proof of concept; it also shows that, with the appropriate infrastructure in place, the amount of filesystem-specific code needed to support huge pages in the page cache is relatively small.

One thing that is still missing is a good set of benchmark results showing that the transparent huge page cache speeds things up. Since this is primarily a performance-oriented patch set, such results are important. The mmap() implementation is also important, but the patch set is already a large chunk of code in its current form.

Reliable out-of-memory handling

As was described in this June 2013 article, the kernel's out-of-memory (OOM) killer has some inherent reliability problems. A process may have called deeply into the kernel by the time it encounters an OOM condition; when that happens, it is put on hold while the kernel tries to make some memory available. That process may be holding no end of locks, possibly including locks needed to enable a process hit by the OOM killer to exit and release its memory; that means that deadlocks are relatively likely once the system goes into an OOM state.

Johannes Weiner has posted a set of patches aimed at improving this situation. Following a bunch of cleanup work, these patches make two fundamental changes to how OOM conditions are handled in the kernel. The first of those is perhaps the most visible: it causes the kernel to avoid calling the OOM killer altogether for most memory allocation failures. In particular, if the allocation is being made in response to a system call, the kernel will just cause the system call to fail with an ENOMEM error rather than trying to find a process to kill. That may cause system call failures to happen more often and in different contexts than they used to. But, naturally, that will not be a problem since all user-space code diligently checks the return status of every system call and responds with well-tested error-handling code when things go wrong.

The other change happens more deeply within the kernel. When a process incurs a page fault, the kernel really only has two choices: it must either provide a valid page at the faulting address or kill the process in question. So the OOM killer will still be invoked in response to memory shortages encountered when trying to handle a page fault. But the code has been reworked somewhat; rather than wait for the OOM killer deep within the page fault handling code, the kernel drops back out and releases all locks first. Once the OOM killer has done its thing, the page fault is restarted from the beginning. This approach should ensure reliable page fault handling while avoiding the locking problems that plague the OOM killer now.

Logging drop_caches

Writing to the magic sysctl file /proc/sys/vm/drop_caches will cause the kernel to forget about all clean objects in the page, dentry, and inode caches. That is not normally something one would want to do; those caches are maintained to improve the performance of the system. But clearing the caches can be useful for memory management testing and for the production of reproducible filesystem benchmarks. Thus, drop_caches exists primarily as a debugging and testing tool.

It seems, though, that some system administrators have put writes to drop_caches into various scripts over the years in the belief that it somehow helps performance. Instead, they often end up creating performance problems that would not otherwise be there. Michal Hocko, it seems, has gotten a little tired of tracking down this kind of problem, so he has revived an old patch from Dave Hansen that causes a message to be logged whenever drop_caches is used. He said:

I am bringing the patch up again because this has proved being really helpful when chasing strange performance issues which (surprise surprise) turn out to be related to artificially dropped caches done because the admin thinks this would help... So mostly those who support machines which are not in their hands would benefit from such a change.

As always, the simplest patches cause the most discussion. In this case, a number of developers expressed concern that administrators would not welcome the additional log noise, especially if they are using drop_caches frequently. But Dave expressed a hope that at least some of the affected users would get in contact with the kernel developers and explain why they feel the need to use drop_caches frequently. If it is being used to paper over memory management bugs, the thinking goes, it would be better to fix those bugs directly.

In the end, if this patch is merged, it is likely to include an option (the value written to drop_caches is already a bitmask) to suppress the log message. That led to another discussion on exactly which bit should be used, or whether the drop_caches interface should be augmented to understand keywords instead. As of this writing, the simple printk() statement still has not been added; perhaps more discussion is required.

Comments (20 posted)

Unreviewed code in 3.11

By Jonathan Corbet
August 7, 2013

Kernel development, like development in most free software projects, is built around the concept of peer review. All patches should be reviewed by at least one other developer; that, it is hoped, will catch bugs before they are merged and lead to a higher-quality end result. While a lot of code review does take place in the kernel project, it is also clearly the case that a certain amount of code goes in without ever having been looked at by anybody other than the original developer. A couple of recent episodes bear a closer look; they show why the community values code review and the hazards of skipping it.

O_TMPFILE

The O_TMPFILE option to the open() system call was pulled into the mainline during the 3.11 merge window; prior to that pull, it had not been posted in any public location. There is no doubt that it provides a useful feature; it allows an application to open a file in a given filesystem with no visible name. In one stroke, it does away with a whole range of temporary file vulnerabilities, most of which are based on guessing which name will be used. O_TMPFILE can also be used with the linkat() system call to create a file and make it visible in the filesystem, with the right permissions, in a single atomic step. There can be no doubt that application developers will want to make good use of this functionality once it becomes widely available.

That said, O_TMPFILE has been going through a bit of a rough start. It did not take long for Linus to express concerns about the new API; in short, there was no way for applications to determine that they were running on a system where O_TMPFILE was not supported. A couple of patches later, those issues had been addressed. Since then, a couple of bugs have been found in the implementation; one, fixed by Zheng Liu, would oops the kernel. Another, reported by Andy Lutomirski, corrupts the underlying filesystem through the creation of a bogus inode. Finally, few filesystems actually support this new option at this point, so it is not something that developers can count on having available, even on Linux systems.

Meanwhile, Christoph Hellwig has questioned the API chosen for this feature:

Why is the useful tmpfile functionality multiplexed over open when it has very different semantics from a normal open? In addition to the flag problems already discussed to death it also just leads to splattering of the code in the implementation [...]

Christoph suggests that it would have been better to create a new tmpfile() system call rather than adding this feature to open(). In the end, he has said, O_TMPFILE needs some more time:

Given all the problems and very limited fs support I'd much prefer disabling O_TMPFILE for this release. That'd give it the needed exposure it was missing by being merged without any previous public review.

Neither Al Viro (the author of this feature) nor Linus has responded to Christoph's suggestions, leading one to believe that the current plan is to go ahead with the current implementation. Once the O_TMPFILE ABI is exposed in the 3.11 release, it will need to be supported indefinitely. It certainly is supportable in its current form, but it may well have come out better with a bit more discussion prior to merging.

Secret security fixes

Russell King's pre-3.11-rc4 pull request does not appear to have been sent to any public list. Based on the merge commit in the mainline, what Russell said about this request was:

I've thought long and hard about what to say for this pull request, and I really can't work out anything sane to say to summarise much of these commits. The problem is, for most of these are, yet again, lots of small bits scattered around the place without any real overall theme to them.

Evidently, the fact that eight out of the 22 commits in that request were security fixes does not constitute a "real overall theme." The patches seem like worthwhile hardening for the ARM architecture, evidently written in response to disclosures made at the recently concluded Black Hat USA 2013 event. While most of the patches carry an Acked-by from Nicolas Pitre, none of them saw any kind of public review before heading into the mainline.

It was not long before Olof Johansson encountered a number of problems with the changes, leading to several systems that were unable to boot. LWN reader kalvdans pointed out a different obvious bug in the code. Olof suggested that, perhaps, the patches might have benefited from some time in the linux-next repository, but Russell responded:

Tell me how I can put this stuff into -next _and_ keep it secret because it's security related. The two things are totally incompatible with each other. Sorry.

In this case, it is far from clear that much was gained by taking these patches out of the normal review process. The list of distributors rushing to deploy these fixes to users prior to their public disclosure is likely to be quite short, and, in any case, the cure, as was merged for 3.11-rc4, was worse than the disease. As of this writing, neither bug has been fixed in the mainline, though patches exist for both.

That said, one can certainly imagine scenarios where it might make sense to develop and merge a fix outside of public view. If a security vulnerability is known to be widely exploitable, one wants to get the fix as widely distributed as possible before the attackers are able to develop their exploits. In many cases, though, the vulnerabilities are not readily exploitable, or, as is the case for the bulk of deployed ARM systems, there is no way to quickly distribute an update in any case. In numerous other cases, the vulnerability in question has been known to the attacker community for a long time before it comes to the attention of a kernel developer.

For all of those cases, chances are high that the practice of developing fixes in secret does more harm than good. As has been seen here, such fixes can introduce bugs of their own; sometimes, those new bugs can be new security problems as well. In other situations, as in the O_TMPFILE case, unreviewed code also runs the risk of introducing suboptimal APIs that must then be maintained for many years. The code review practices we have developed over the years exist for a reason; bypassing those practices introduces a whole new set of risks to the kernel development process. The 3.11 development cycle has demonstrated just how real those risks can be.

Comments (5 posted)

Linus Torvalds Linux 3.11-rc4 ?

Greg KH Linux 3.10.5 ?

Thomas Gleixner 3.10.4-rt1 ?

Kamal Mostafa Linux 3.8.13.6 ?

Luis Henriques Linux 3.5.7.18 ?

Greg KH Linux 3.4.56 ?

Ben Hutchings Linux 3.2.50 ?

Greg KH Linux 3.0.89 ?

David Long uprobes: Add uprobes support for ARM ?

Maxime Ripard ARM: sunxi: Add support for the Allwinner A31 SoC ?

Maxime Ripard ARM: sunxi: Introduce Allwinner A20 support ?

Stefano Stabellini enable swiotlb-xen on arm and arm64 ?

Lorenzo Pieralisi ARM: TC2 big.LITTLE CPU idle driver ?

Roy Franz EFI stub for ARM ?

David Herrmann x86 platform framebuffers ?

Waiman Long qrwlock: Introducing a queue read/write lock implementation ?

Waiman Long qspinlock: Introducing a 4-byte queue spinlock implementation ?

Waiman Long Lockless update of reference count protected by spinlock ?

Raghavendra K T Paravirtualized ticket spinlocks ?

Rui Xiang Add namespace support for syslog ?

Kent Overstreet IDA/IDR rewrite, percpu ida ?

Rui Wang I/O Hook: Method for emulating h/w events ?

Phil Carmody ] scripts/robopatch.pl: the mechanical bride of checkpatch.pl ?

Marek Belisko ASoC: Add PCM1681 codec driver. ?

Jonas Jensen net: Add MOXA ART SoCs ethernet driver ?

Sourav Poddar Add ti qspi controller ?

Russell King - ARM Linux Preview of DMA mask changes ?

Ricardo Ribalda Delgado videobuf2-dma-sg: Contiguos memory allocation ?

Srinivas Pandruvada Power Capping Framework ?

Laxman Dewangan pinctrl: add pincontrol driver for palmas device. ?

Rob Clark drm/msm: A DRM/KMS driver for snapdragon SoCs ?

Ewan D. Milne Enhanced Unit Attention handling ?

Bradley Grove [RFC] SCSI: esas2r: ATTO Technology ExpressSAS 6G SAS/SATA RAID Adapter Driver ?

Laurent Pinchart Renesas VSP1 driver ?

Archit Taneja v4l: VPE mem to mem driver ?

Arun Kumar K Exynos5 IS driver ?

Alex Williamson pci: bus and slot reset interfaces ?

Fabian Vogt [V2] gpio: New driver for LSI ZEVIO SoCs ?

zwu.kernel@gmail.com VFS hot tracking ?

Mark Fasheh btrfs: out-of-band (aka offline) dedupe v4 ?

Kent Overstreet Immutable biovecs, block layer changes ?

Tang Chen Arrange hotpluggable memory as ZONE_MOVABLE. ?

Johannes Weiner mm: improve page aging fairness between zones/nodes ?

Nathan Zimmer Transparent on-demand struct page initialization embedded in the buddy allocator ?

Alex Thorlton Add per-process flag to control thp ?

Kirill A. Shutemov Transparent huge page cache: phase 1, everything but mmap() ?

Johannes Weiner improve memcg oom killer robustness v2 ?

Johannes Weiner mm: thrash detection-based file cache sizing v3 ?

Sha Zhengju Add memcg dirty/writeback page accounting ?

Andrea Arcangeli adding compaction to zone_reclaim_mode v3 ?

Andy Lutomirski Add madvise(..., MADV_WILLWRITE) ?

Bob Liu zcache: a compressed file page cache ?

David Howells KEYS: Kerberos caching support ?

Pablo Neira Ayuso libnetfilter_conntrack 1.0.4 release ?

Pablo Neira Ayuso iptables 1.4.20 release ?

Pablo Neira Ayuso conntrack-tools 1.4.2 release ?

Kernel development

Brief items

Kernel release status

Quotes of the week

flink() at last?

Kernel development news

A survey of memory management patches

MADV_WILLWRITE

Controlling transparent huge pages

Transparent huge page cache

Reliable out-of-memory handling

Logging drop_caches

Unreviewed code in 3.11

O_TMPFILE

Secret security fixes

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Miscellaneous