Brief items
The current development kernel is 3.11-rc4,
released on August 4. "
I had hoped things would start calming down, but rc4 is pretty much
exactly the same size as rc3 was. That said, the patches seem a bit
more spread out, and less interesting - which is a good thing. Boring
is good."
All told, 339 non-merge changesets were pulled into the mainline for -rc4.
They are mostly fixes, but there is also a mysterious set of ARM security
fixes (starting
here)
that showed up without prior discussion.
Stable updates:
3.10.5
3.4.56,
3.2.50, and
3.0.89, were all released on August 4.
Also worth noting: Greg Kroah-Hartman has announced
that 3.10 will be the next long-term supported kernel. "I’m picking
this kernel after spending a lot of time talking about kernel releases, and
product releases and development schedules from a large range of companies
and development groups. I couldn’t please everyone, but I think that the
3.10 kernel fits the largest common set of groups that rely on the longterm
kernel releases."
Comments (none posted)
Well, lguest is particularly expendable. It's the red shirt of the
virtualization away team.
—
Rusty Russell
Don't be afraid of writing too much text - trust me, I've never
seen a changelog which was too long!
—
Andrew Morton
Comments (1 posted)
By Jonathan Corbet
August 7, 2013
There has long been a desire for an
flink() system call in the
kernel. It would take a file descriptor and a file name as arguments
and cause the name to be a new hard link to the file behind the
descriptor. There have been concerns about security, though, that have
kept this call out of the kernel; some see it as a way for a process to
make a file name for a file descriptor that came from outside — via
exec(), for example. That process may not
have had a reachable path to the affected file before, so the creation of a
new name could be seen as bypassing an existing security policy.
The problem with this reasoning, as noted by Andy Lutomirski in a
patch merged for 3.11-rc5, is that this functionality is already
available by way of the linkat() system call. All it takes is
having the /proc filesystem mounted — and a system without
/proc is quite rare. But the incantation needed to make a link in
this way is a bit arduous:
linkat(AT_FDCWD, "/proc/self/fd/N", destdirfd, newname, AT_SYMLINK_FOLLOW);
Where "N" is the number of the relevant file descriptor.
It would be a lot nicer, he said, to just allow the use of the
AT_EMPTY_PATH option, which causes the link to be made to the file
behind the original file descriptor:
linkat(fd, "", destdirfd, newname, AT_EMPTY_PATH);
In current kernels, though, that option is restricted to processes with the
CAP_DAC_READ_SEARCH capability out of the same security concerns
as described above. But, as Andy pointed out, the restriction makes no
sense given that the desired functionality is available anyway. So his
patch removes the check, making the second variant available to all users.
This functionality is expected to be useful with files opened with the
O_TMPFILE option, but other uses can be imagined as well. It will
be generally available in the 3.11 kernel.
Comments (17 posted)
Kernel development news
By Jonathan Corbet
August 6, 2013
Traffic on the kernel mailing lists often seems to follow a particular
theme. At the moment, one of those themes is memory management. What
follows is an overview of these patches,
hopefully giving an idea of what the memory management developers are up
to.
MADV_WILLWRITE
Normally, developers expect that a write to file-backed memory will execute
quickly. That data must eventually find its way back to persistent
storage, but the kernel usually handles that in the background while the
application continues running. Andy Lutomirski has discovered that things
don't always work that way, though. In particular, if the memory is backed
by a file that has never been written (even if it has been extended to the
requisite size with fallocate()), the first write to each page of that
memory can be quite slow, due to the filesystem's need to allocate on-disk
blocks, mark the block as being initialized, and otherwise get ready to
accept the data. If (as is the case with
Andy's application) there is a need to write multiple gigabytes of data,
the slowdown can be considerable.
One way to work around this problem is to write throwaway data to that memory
before getting into the time-sensitive part of the application, essentially
forcing the kernel to prepare the backing store. That approach works, but
at the cost of writing large amounts of useless data to disk; it might be
nice to have something a bit more elegant than that.
Andy's answer is to add a new operation,
MADV_WILLWRITE, to the madvise() system call. Within the
kernel, that call is passed to a new vm_operations_struct
operation:
long (*willwrite)(struct vm_area_struct *vma, unsigned long start,
unsigned long end);
In the current implementation, only the ext4 filesystem provides support
for this operation; it responds by reserving blocks so that the upcoming
write can complete quickly. Andy notes that there is a lot more that could
be done
to fully prepare for an upcoming write, including performing the
copy-on-write needed for private mappings, actually allocating pages of
memory, and so on. For the time being, though, the patch is intended as a
proof of concept and a request for comments.
Controlling transparent huge pages
The transparent huge pages feature uses
huge pages whenever possible, and without user-space awareness, in order to
improve memory access performance. Most of the time the result is faster
execution, but there are some workloads that can perform worse when
transparent huge pages are enabled. The feature can be turned off
globally, but what about situations where some applications benefit while
others do not?
Alex Thorlton's answer is to provide an
option to disable transparent huge pages on a per-process basis. It takes
the form of a new operation (PR_SET_THP_DISABLED) to the
prctl() system call. This operation sets a flag in the
task_struct structure; setting that flag causes the memory
management system to avoid using huge pages for the associated process.
And that allows the creation of mixed workloads, where some processes use
transparent huge pages and others do not.
Transparent huge page cache
Since their inception, transparent huge pages have only worked with
anonymous memory; there is no support for file-backed (page cache) pages.
For some time now, Kirill A. Shutemov has been working on a transparent huge page cache implementation to
fix that problem. The latest version, a 23-patch set, shows how complex
the problem is.
In this version, Kirill's patch has a number of limitations. Unlike the
anonymous page implementation, the transparent huge page cache code is
unable to create huge pages by coalescing small pages. It also, crucially,
is unable to create huge pages in response to page faults, so it does not
currently work well with files mapped into a process's address space; that
problem is slated to be fixed in a future patch set. The current
implementation only works with the ramfs filesystem — not, perhaps, the
filesystem that users were clamoring for most loudly. But the ramfs implementation is a good proof of
concept; it also shows that, with the appropriate infrastructure in place,
the amount of filesystem-specific code needed to support huge pages in the
page cache is relatively small.
One thing that is still missing is a good set of benchmark results showing
that the transparent huge page cache speeds things up. Since this is
primarily a performance-oriented patch set, such results are important.
The mmap() implementation is also important, but the patch set is
already a large chunk of code in its current form.
Reliable out-of-memory handling
As was described in this June 2013 article,
the kernel's out-of-memory (OOM) killer has some inherent
reliability problems. A process may have called deeply into the kernel by
the time it
encounters an OOM condition; when that happens, it is put on hold while
the kernel tries to make some memory available. That process may be
holding no end of locks, possibly including locks needed to enable a
process hit by
the OOM killer to exit and release its memory; that means that deadlocks
are relatively likely once the system goes into an OOM state.
Johannes Weiner has posted a set of patches
aimed at improving this situation. Following a bunch of cleanup work,
these patches make two fundamental changes to how OOM conditions are
handled in the kernel. The first of those is perhaps the most visible: it
causes the kernel to avoid calling the OOM killer altogether for most
memory allocation failures. In particular, if the allocation is being made
in response to a system call, the kernel will just cause the system call to
fail with an ENOMEM error rather than trying to find a process to
kill. That may cause system call failures to happen more often and in
different contexts than they used to. But, naturally, that will not be a
problem since all user-space code diligently checks the return status of
every system call and responds with well-tested error-handling code when
things go wrong.
The other change happens more deeply within the kernel. When a process
incurs a page fault, the kernel really only has two choices: it must either
provide a valid page at the faulting address or kill the process in
question. So the OOM killer will still be invoked in response to memory
shortages encountered when trying to handle a page fault. But the code has
been reworked somewhat; rather than wait for the OOM killer deep within the
page fault handling code, the kernel drops back out and releases all locks
first. Once the OOM killer has done its thing, the page fault is restarted
from the beginning. This approach should ensure reliable page fault
handling while avoiding the locking problems that plague the OOM killer
now.
Logging drop_caches
Writing to the magic sysctl file /proc/sys/vm/drop_caches will
cause the kernel to forget about all clean objects in the page, dentry, and
inode caches. That is not normally something one would want to do; those
caches are maintained to improve the performance of the system. But
clearing the caches can be useful
for memory management testing and for the production of reproducible
filesystem benchmarks. Thus, drop_caches exists primarily as a
debugging and testing tool.
It seems, though, that some system administrators have put writes to
drop_caches into various scripts over the years in the belief that
it somehow helps performance. Instead, they often end up creating
performance problems that would not otherwise be there. Michal Hocko, it
seems, has gotten a little tired of tracking down this kind of problem, so
he has revived an old patch from Dave
Hansen that causes a message to be logged whenever drop_caches
is used. He said:
I am bringing the patch up again because this has proved being
really helpful when chasing strange performance issues which
(surprise surprise) turn out to be related to artificially dropped
caches done because the admin thinks this would help... So mostly
those who support machines which are not in their hands would
benefit from such a change.
As always, the simplest patches cause the most discussion. In this case, a
number of developers expressed concern that administrators would not
welcome the additional log noise, especially if they are using
drop_caches frequently. But Dave expressed a hope that at least some of the
affected users would get in contact with the kernel developers and explain
why they feel the need to use drop_caches frequently. If it is
being used to paper over memory management bugs, the thinking goes, it
would be better to fix those bugs directly.
In the end, if this patch is merged, it is likely to include an option (the
value written to drop_caches is already a bitmask) to suppress the
log message. That led to another discussion on exactly which bit should be
used, or whether the drop_caches interface should be augmented to
understand keywords instead. As of this writing, the simple
printk() statement still has not been added; perhaps more
discussion is required.
Comments (20 posted)
By Jonathan Corbet
August 7, 2013
Kernel development, like development in most free software projects, is
built around the concept of peer review. All patches should be reviewed by
at least one other developer; that, it is hoped, will catch bugs before
they are merged and lead to a higher-quality end result. While a lot of
code review does take place in the kernel project, it is also clearly the
case that a certain amount of code goes in without ever having been looked
at by anybody other than the original developer. A couple of recent
episodes bear a closer look; they show why the community values code review
and the hazards of skipping it.
O_TMPFILE
The O_TMPFILE option to the open() system call was pulled
into the mainline during the 3.11 merge window; prior to that pull, it had
not been posted in any public location. There is no doubt that it provides
a useful feature; it allows an application to open a file in a given
filesystem with no visible name. In one stroke, it does away with a whole
range of temporary file vulnerabilities, most of which are based on
guessing which name will be used. O_TMPFILE can also be used with
the linkat() system call to create a file and make it visible in
the filesystem, with the right permissions, in a single atomic step. There
can be no doubt that application developers will want to make good use of
this functionality once it becomes widely available.
That said, O_TMPFILE has been going through a bit of a rough
start. It did not take long for Linus to express concerns about the new API; in short, there
was no way for applications to determine that they were running on a system
where O_TMPFILE was not supported. A couple of patches
later, those issues had been addressed. Since then, a couple of bugs have
been found in the implementation; one, fixed
by Zheng Liu, would oops the kernel. Another, reported by Andy Lutomirski, corrupts the
underlying filesystem through the creation
of a bogus inode. Finally, few filesystems actually support this
new option at this point, so it is not something that developers can count
on having available, even on Linux systems.
Meanwhile, Christoph Hellwig has questioned the
API chosen for this feature:
Why is the useful tmpfile functionality multiplexed over open when
it has very different semantics from a normal open?
In addition to the flag problems already discussed to death it also
just leads to splattering of the code in the implementation [...]
Christoph suggests that it would have been better to create a new
tmpfile() system call rather than adding this feature to
open(). In the end, he has said,
O_TMPFILE needs some more time:
Given all the problems and very limited fs support I'd much prefer
disabling O_TMPFILE for this release. That'd give it the needed
exposure it was missing by being merged without any previous public
review.
Neither Al Viro (the author of this feature) nor Linus has responded to
Christoph's suggestions, leading one to believe that the current plan is to
go ahead with the current implementation. Once the O_TMPFILE ABI
is exposed in the 3.11 release, it will need to be supported indefinitely.
It certainly is supportable in its current form, but it may well have come
out better with a bit more discussion prior to merging.
Secret security fixes
Russell King's pre-3.11-rc4 pull request does not appear to have been
sent to any public list. Based on the
merge commit in the mainline, what Russell said about this request was:
I've thought long and hard about what to say for this pull request,
and I really can't work out anything sane to say to summarise much
of these commits. The problem is, for most of these are, yet
again, lots of small bits scattered around the place without any
real overall theme to them.
Evidently, the fact that eight out of the 22 commits in that request were
security fixes does not constitute a "real overall theme." The patches
seem like worthwhile hardening for the ARM architecture, evidently written in response to disclosures
made at the recently concluded Black Hat USA 2013 event. While
most of the patches carry an Acked-by from Nicolas Pitre, none of them saw
any kind of public review before heading into the mainline.
It was not long before Olof Johansson encountered a number of problems with the
changes, leading to several systems that were unable to boot. LWN reader
kalvdans pointed out a different obvious bug
in the code. Olof
suggested that, perhaps, the patches might have benefited from some time in
the linux-next repository, but Russell responded:
Tell me how I can put this stuff into -next _and_ keep it secret
because it's security related. The two things are totally
incompatible with each other. Sorry.
In this case, it is far from clear that much was gained by taking these
patches out of the normal review process. The list of distributors rushing
to deploy these fixes to users prior to their public disclosure is likely
to be quite short, and, in any case, the cure, as was merged for 3.11-rc4,
was worse than the disease. As of this writing, neither bug has been fixed
in the mainline, though patches exist for both.
That said, one can certainly imagine scenarios where it might make sense to
develop and merge a fix outside of public view. If a security
vulnerability is known to be widely exploitable, one wants to get the fix
as widely distributed as possible before the attackers are able to develop
their exploits. In many cases, though, the vulnerabilities are not readily
exploitable, or, as is the case for the bulk of deployed ARM systems, there
is no way to quickly distribute an update in any case. In numerous other
cases, the vulnerability in question has been known to the attacker
community for a long time before it comes to the attention of a kernel
developer.
For all of those cases, chances are high that the practice of developing
fixes in secret does more harm than good. As has been seen here, such
fixes can introduce bugs of their own; sometimes, those new bugs can be new
security problems as well. In other situations, as in the
O_TMPFILE case, unreviewed code also runs the risk of introducing
suboptimal APIs that must then be maintained for many years. The code
review practices we have developed over the years exist for a reason;
bypassing those practices introduces a whole new set of risks to the kernel
development process. The 3.11 development cycle has demonstrated just how
real those risks can be.
Comments (5 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>