Kernel development
Brief items
Kernel release status
The current development kernel is 4.1-rc1, released on April 26. Linus said: "No earth-shattering new features come to mind, even if initial support for ACPI on arm64 looks funny. Depending on what you care about, your notion of 'big new feature' may differ from mine, of course. There's a lot of work all over, and some of it might just make a big difference to your use cases."
Stable updates: 4.0.1, 3.19.6, 3.14.40, and 3.10.76 were all released on April 29.
Quotes of the week
Garrett: Reducing power consumption on Haswell and Broadwell systems
Matthew Garrett looked into why Linux systems consume too much power on recent Intel chipsets and wrote up his results — a reduction of idle power use on his laptop from 8.5W to 5W. "This trend is likely to continue. As systems become more integrated we're going to have to pay more attention to the interdependencies in order to obtain the best possible power consumption, and that means that distribution vendors are going to have to spend some time figuring out what these dependencies are and what the appropriate default policy is for their users."
Kernel development news
The 4.1 merge window closes
By the time Linus released 4.1-rc1 and closed the merge window for this development cycle, some 10,659 non-merge changesets had been pulled into the mainline repository. That makes 4.1 a reasonably busy development cycle, but far from the busiest; 3.19 had 11,400 changes during the merge window, and 3.15, the record holder, had just over 12,000. Even if 4.1 is not a record breaker, though, anybody who worried that 4.0 signaled a general slowdown in kernel development can rest easier now.Only about 900 of those changesets were pulled since last week's summary, but there were some interesting changes buried in that last batch. Some of the more significant, user-visible changes include:
- The XFS filesystem has gained RENAME_WHITEOUT support,
meaning that it should now work with the overlayfs union filesystem.
Also new in XFS is support for the FALLOC_FL_INSERT_RANGE option to
fallocate(), allowing applications to insert a hole into a
file.
- The Btrfs filesystem has seen some important fixes, though they may
not be hugely relevant for many users: they apply to filesystems 20TB and
larger or to individual files that are 3TB or larger.
- The virtio subsystem has a new
virtio-input driver; its job is to collect and forward input-device
events to a virtual device.
- The arm64 architecture has gained support for the Advanced
Configuration and Power Interface, otherwise known as ACPI.
Supporting ACPI for ARM has been
controversial in the past; many
developers would rather see the device tree mechanism used universally
for hardware discovery on that architecture. The addition of ACPI
happened quietly in the end, though, and it seems likely that there
will be servers using ACPI shipping in the near future. That said,
there is still some work to do; the merge
commit notes that "
we don't support any peripherals yet, so it's fairly limited in scope.
" See Documentation/arm64/arm-acpi.txt for a lot of information about ACPI on ARM. - The MD (RAID) subsystem can now manage RAID 1 arrays in a
distributed fashion
across a cluster. This code is currently marked as being
experimental, but it is evidently nearing a production-ready state.
- New hardware support includes:
- DMA:
Ingenic JZ4780 DMA controllers,
Renesas USB-DMA controllers,
Applied Micro X-Gene SoC DMA engines, and
Freescale RAID engines.
- Miscellaneous: ChromeOS embedded controllers, ChromeOS "lightbar" devices, and Dell keyboard backlights.
- DMA:
Ingenic JZ4780 DMA controllers,
Renesas USB-DMA controllers,
Applied Micro X-Gene SoC DMA engines, and
Freescale RAID engines.
Changes visible to kernel developers include:
- The "exception table" lists locations in the kernel that might
generate faulting address references; in essence, the table contains a
list of every invocation of copy_*_user() and related
functions. When a fault happens in kernel mode, this table is
consulted to see whether the fault was expected or not. This
mechanism allows the kernel to safely access user-space data without
having to explicitly check each pointer before dereferencing it.
Each loadable module has an exception table of its own to mark such invocations. As of 4.1, the module loader will actively check to ensure that every exception-table entry points to a location within the module's executable text. Any entry not pointing to a known text section must be erroneous, but evidently they come up, especially in situations where a new executable section is being added to the kernel. Developers will want to watch out for this new type of failure, especially when working on the less-mainstream architectures.
One thing that did not get merged this time around was the kdbus interprocess communication system. Linus did not comment on his decision to leave it out, but it seems clear that this code was too controversial to be pulled straight into the mainline. Now both the supporters of kdbus and those who are concerned about aspects of its design have another development cycle to discuss the issues, and, hopefully, come to some sort of conclusion that allows kdbus to proceed.
Meanwhile, the 4.1 kernel is now in the stabilization phase of the development cycle. If things follow the recent pattern, the final 4.1 kernel release will happen on June 14.
Pagemap: security fixes vs. ABI compatibility
The kernel development community maintains a strong commitment to ABI compatibility; as a general rule, changes that will break existing applications are not allowed. But the community is also committed to fixing known security problems. There are times when a security issue cannot be fixed without changing the way a user-visible interface works, and that can lead to problems. One such situation has come up as the result of a change merged for the 4.0 kernel.Back in 2008, the 2.6.25 kernel included a patch adding a new virtual file (called pagemap) to each process's /proc directory. That file contains an array of 64-bit values describing each page in the process's virtual address space. If the page is currently resident, the physical page-frame number will be given; otherwise, information on how to find the page in swap is provided. The original purpose for the pagemap file was to enable investigations into which pages were resident and which were shared with other processes. Documentation/vm/pagemap.txt has information on what can be found in this file.
At the time this patch was merged, there appeared to be no harm in exposing the physical page-frame information. Since then, though, sentiments have turned against disclosing internal kernel information that is not strictly needed by user space. That, alone, might have eventually inspired somebody to remove the page-frame number from the pagemap file but, as it happens, something else came along first.
That something is the "rowhammer vulnerability," wherein the contents of a memory area can be changed by repeatedly hammering on a nearby memory area. If an attacker wanted to use this technique to compromise a system, the first order of business would be to obtain access to a page of memory physically adjacent to the memory that is targeted to be changed. The contents of the pagemap file, by providing the physical location of every page mapped in the system, would obviously be most helpful in such a situation. There will probably be other ways for an attacker to determine how pages are laid out in physical memory, but pagemap is almost certainly the easiest way.
To make life harder for attackers attempting to exploit the rowhammer vulnerability, a simple patch was merged for the 4.0-rc5 release in March. The patch turned the pagemap file into a privileged interface; attempts to open it will now fail unless the process in question has the CAP_SYS_ADMIN capability. The 4.0 release came out with that restriction in place, and everybody who was paying attention slept a little easier.
But that rest appears to have come at the cost of some sleepless nights elsewhere. It turns out that the UndoDB debugger uses the pagemap file to track changes to memory. When changes need to be tracked, the debugger will fork() the process, putting all of its writable memory into copy-on-write mode. After running the operation of interest (a system call, normally), the debugger can scan the pagemap file to see which pages have changed page-frame numbers; those are the pages that were written to, and, thus, copied. Without access to pagemap, UndoDB cannot get this information and, as a result, it no longer works.
In some situations of this type, one might just argue that the tool in
question should be run as root. But that is not generally a desirable way
to run an interactive debugging tool. So some other sort of solution must
be found, or UndoDB will remain broken. There are cases where "remains
broken" may be the final outcome; as Linus said in response to the report, "the one
exception to the regression rule is 'security fixes'
". But,
fortunately, there appear to be some better options available this time
around.
One possibility would be to restore access to the pagemap file but to somehow scramble the page-frame numbers before reporting them to user space. That would work for UndoDB, since it doesn't care about the actual page-frame numbers; it is only looking for changes. Linus was not convinced that this was the right way to go, though:
Andy Lutomirski also pointed out that even scrambled page-frame numbers might be enough for an attacker to obtain some memory-adjacency information. So that approach does not appear to be viable.
The alternative is to simply report the page-frame numbers as zero in the absence of CAP_SYS_ADMIN. That would make the rest of the information in pagemap available while not exposing the page-frame information. The bad news is that always-zero page-frame numbers are not helpful for UndoDB. The good news, though, is that there is something else in pagemap that is just as useful.
That "something else" is the "soft-dirty" mechanism added to the 3.11 kernel in support of the checkpoint-restore in user space (CRIU) effort. Along with the page-frame number, each pagemap entry contains a soft-dirty bit that is meant to track pages that have been written to. All of the soft-dirty bits for a process can be reset to zero by writing to the clear_refs file in that process's /proc directory. Thereafter, the soft-dirty bit will be set whenever that process writes to a given page. CRIU uses this mechanism to find pages that have been changed during the checkpoint process, but it also will work for the UndoDB case. (See Documentation/vm/soft-dirty.txt for details on the soft-dirty mechanism).
So the probable outcome in this case is that pagemap will, once again, become globally readable. But it will contain no useful page-frame numbers unless the reading process had CAP_SYS_ADMIN when it opened the file. That will make UndoDB users happy again while preserving the security objectives of the original patch. So this story has a happy ending — unless, of course, another user who truly needs the page-frame number information steps forward.
Tracking actual memory utilization
One might be tempted to think that an operating-system kernel should be able to answer a simple question: how much memory is a given process actually using? But, despite all the effort that has gone into providing visibility for this type of data, simple answers can be hard to come by. So the effort to provide better information continues, as can be seen by a recent patch set from Vladimir Davydov adding another way to calculate memory utilization.A process's resident set size (RSS) is relatively easily calculated; that is the number of pages of physical memory currently owned by that process. Interested parties can get this information now from /proc or the ps command. In theory, the kernel is handling page reclaim in such a way that each process is actually using every page in its resident set, but, in the real world, things don't always work out that way.
It can be worth knowing if there is a significant difference between a process's RSS and the amount of memory actually in use; this information can be helpful when partitioning the system between containers or setting control-group limits. As it happens, the kernel contains a mechanism designed to allow an observer to determine how much of a process's resident set has actually been referenced. That information is found in a virtual file called smaps in the process's /proc directory. For example, the following fragment comes from the smaps file corresponding the the X.org server on your editor's desktop:
016bc000-04af4000 rw-p 00000000 00:00 0 [heap]
Size: 53472 kB
Rss: 51936 kB
Pss: 51936 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 51936 kB
Referenced: 45384 kB
Anonymous: 51936 kB
AnonHugePages: 38912 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac
This entry describes an anonymous memory area that occupies 53,472KB of memory; 51,936KB of that area is currently resident (the Rss field), and 45,384KB have been referenced (the line in bold) since tracking was last reset. Since nothing is monitoring memory use on this system, that number has never been reset and thus counts every page referenced since the X.org server started.
If one wants to track usage over a specific period, it is necessary to reset the "referenced" count at the beginning, let the process run for a bit, then look in smaps to see how much memory was actually touched. That reset is done by writing a value of 1 to the clear_refs file in the same /proc directory.
At a first look, this mechanism seems like it should be able to answer the question of how much memory a process is actually using. But it turned out to not meet Vladimir's needs for a couple of reasons. One of those is that, while the smaps entry tracks references to memory mapped into the process's address space, it does not track page-cache memory used when files are accessed with system calls like read() or write(). That memory, too, is used by the process, so there would be value in knowing how much of it there is. Perhaps more importantly, the "referenced" state of each page is used by the memory-management subsystem itself to make decisions on which pages to evict. Resetting every page to the "not referenced" state will thus perturb page reclaim, and probably not for the better. If these measurements are to be made often, it would be good to have a less invasive way to make them.
Vladimir's patch adds a new file called /proc/kpageidle; since it's in the top-level /proc directory, it's a single file that describes an aspect of the the global state of the system. The file can be read like a long array of 64-bit integer values; each value corresponds to one physical page in the system, indexed by page-frame number. If a program wants to know whether physical page N has been referenced, it can seek to the appropriate location in /proc/kpageidle and read the value there; if the lowest bit is set, the page is idle. (Note that this file may change to a bitmap format in a future version of the patch set).
Once again, one needs to be able to reset that state to make observations over a given time period; in this case, setting a page to the "idle" state is done by writing 1 to the appropriate location in /proc/kpageidle. That action will make the page inaccessible (much like the normal kernel usage tracking does) so that a fault will result whenever a process tries to read or write that page. At that point, the "idle" state can be reset and the page made accessible again. The idle state will also be reset if the page is accessed via the file-related system calls, so it will track the state of pages in the page cache as well.
To track the idle state, the patch set adds a new "idle" page flag that is set whenever a page is marked idle. That flag is then passed back to user space whenever a given page's entry in /proc/kpageidle is read. As it turns out, there is a need for a second page flag as well, though. As mentioned above, making a page inaccessible is a technique already used within the memory-management subsystem; when a write to /proc/kpageidle causes that to happen, it makes the page appear to have never been accessed. To avoid that, Vladimir adds a second flag called "young"; whenever a write to /proc/kpageidle makes a page inaccessible, the "young" bit will be set as well. When the memory-management code asks whether a page has been referenced, the "young" bit is taken into account. In the end, that means that using /proc/kpageidle will not change how page reclaim is done.
There is one little problem with this approach: page flags are in short supply on 32-bit systems. To get around this problem, the code uses the "struct page extension" mechanism in the 32-bit case. This mechanism was originally created to support memory control groups (memcgs), which need to store more information about each page than can fit in the page structure. Using extensions can use quite a bit of memory in its own right, but there's little alternative on systems where shoehorning even one more bit into struct page is not an option.
Readers who have gotten this far may be wondering about one final piece of the puzzle: knowing which physical pages in the system are in use does not say much about what any specific processes are using. There are two ways of connecting the two pieces, one of which exists now and one which is part of Vladimir's patch. In current systems, the pagemap file in any process's /proc directory can be used to see which physical pages are mapped into that process's address space. That information is only available to privileged processes as of the 4.0 release, but /proc/kpageidle is a privileged interface too.
If the task at hand is partitioning a system's resources, though, then memcgs are likely already in use to set limits on groups of processes. In that case, it is more interesting to know how much memory each memcg is using than to track this information on a per-process basis. To that end, the patch set adds yet another file (/proc/kpagecgroup) which, when read, yields the control group that owns each page. By using that file together with /proc/kpageidle, a monitoring process can determine how many pages each memcg is using — and how many it owns but is not making use of.
The end result is an interface that can be used to determine how well a control group's memory limits fit its actual needs. As service providers of all types seek to run more clients on each physical system, they will likely be pleased to have this extra information available. That, of course, depends on this patch set being merged into the mainline. Given the lack of significant opposition, that seems likely to happen sooner or later — though, with memory-management patches, it's always hard to say just when that might happen.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
