Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.1-rc1, released on April 26. Linus said: "No earth-shattering new features come to mind, even if initial support for ACPI on arm64 looks funny. Depending on what you care about, your notion of 'big new feature' may differ from mine, of course. There's a lot of work all over, and some of it might just make a big difference to your use cases."

Stable updates: 4.0.1, 3.19.6, 3.14.40, and 3.10.76 were all released on April 29.

Comments (none posted)

Quotes of the week

Whee! I'm typing this email on a potato!

— Andy Lutomirski (thanks to Cesar Eduardo Barros)

Once user space is lean and mean, at that point do I believe that "ok, let's add kernel code for the last bit of performance". But as it is right now, anybody who works on kdbus and claims that _performance_ is the reason for their work is just looking at the wrong piece of the puzzle.

— Linus Torvalds

Comments (none posted)

Garrett: Reducing power consumption on Haswell and Broadwell systems

Matthew Garrett looked into why Linux systems consume too much power on recent Intel chipsets and wrote up his results — a reduction of idle power use on his laptop from 8.5W to 5W. "This trend is likely to continue. As systems become more integrated we're going to have to pay more attention to the interdependencies in order to obtain the best possible power consumption, and that means that distribution vendors are going to have to spend some time figuring out what these dependencies are and what the appropriate default policy is for their users."

Comments (45 posted)

The 4.1 merge window closes

By Jonathan Corbet
April 28, 2015

By the time Linus released 4.1-rc1 and closed the merge window for this development cycle, some 10,659 non-merge changesets had been pulled into the mainline repository. That makes 4.1 a reasonably busy development cycle, but far from the busiest; 3.19 had 11,400 changes during the merge window, and 3.15, the record holder, had just over 12,000. Even if 4.1 is not a record breaker, though, anybody who worried that 4.0 signaled a general slowdown in kernel development can rest easier now.

Only about 900 of those changesets were pulled since last week's summary, but there were some interesting changes buried in that last batch. Some of the more significant, user-visible changes include:

The XFS filesystem has gained RENAME_WHITEOUT support, meaning that it should now work with the overlayfs union filesystem. Also new in XFS is support for the FALLOC_FL_INSERT_RANGE option to fallocate(), allowing applications to insert a hole into a file.
The Btrfs filesystem has seen some important fixes, though they may not be hugely relevant for many users: they apply to filesystems 20TB and larger or to individual files that are 3TB or larger.
The virtio subsystem has a new virtio-input driver; its job is to collect and forward input-device events to a virtual device.
The arm64 architecture has gained support for the Advanced Configuration and Power Interface, otherwise known as ACPI. Supporting ACPI for ARM has been controversial in the past; many developers would rather see the device tree mechanism used universally for hardware discovery on that architecture. The addition of ACPI happened quietly in the end, though, and it seems likely that there will be servers using ACPI shipping in the near future. That said, there is still some work to do; the merge commit notes that "we don't support any peripherals yet, so it's fairly limited in scope." See Documentation/arm64/arm-acpi.txt for a lot of information about ACPI on ARM.
The MD (RAID) subsystem can now manage RAID 1 arrays in a distributed fashion across a cluster. This code is currently marked as being experimental, but it is evidently nearing a production-ready state.
New hardware support includes:
- DMA: Ingenic JZ4780 DMA controllers, Renesas USB-DMA controllers, Applied Micro X-Gene SoC DMA engines, and Freescale RAID engines.
- Miscellaneous: ChromeOS embedded controllers, ChromeOS "lightbar" devices, and Dell keyboard backlights.

Changes visible to kernel developers include:

The "exception table" lists locations in the kernel that might generate faulting address references; in essence, the table contains a list of every invocation of copy_*_user() and related functions. When a fault happens in kernel mode, this table is consulted to see whether the fault was expected or not. This mechanism allows the kernel to safely access user-space data without having to explicitly check each pointer before dereferencing it.
Each loadable module has an exception table of its own to mark such invocations. As of 4.1, the module loader will actively check to ensure that every exception-table entry points to a location within the module's executable text. Any entry not pointing to a known text section must be erroneous, but evidently they come up, especially in situations where a new executable section is being added to the kernel. Developers will want to watch out for this new type of failure, especially when working on the less-mainstream architectures.

One thing that did not get merged this time around was the kdbus interprocess communication system. Linus did not comment on his decision to leave it out, but it seems clear that this code was too controversial to be pulled straight into the mainline. Now both the supporters of kdbus and those who are concerned about aspects of its design have another development cycle to discuss the issues, and, hopefully, come to some sort of conclusion that allows kdbus to proceed.

Meanwhile, the 4.1 kernel is now in the stabilization phase of the development cycle. If things follow the recent pattern, the final 4.1 kernel release will happen on June 14.

Comments (1 posted)

Pagemap: security fixes vs. ABI compatibility

By Jonathan Corbet
April 29, 2015

The kernel development community maintains a strong commitment to ABI compatibility; as a general rule, changes that will break existing applications are not allowed. But the community is also committed to fixing known security problems. There are times when a security issue cannot be fixed without changing the way a user-visible interface works, and that can lead to problems. One such situation has come up as the result of a change merged for the 4.0 kernel.

Back in 2008, the 2.6.25 kernel included a patch adding a new virtual file (called pagemap) to each process's /proc directory. That file contains an array of 64-bit values describing each page in the process's virtual address space. If the page is currently resident, the physical page-frame number will be given; otherwise, information on how to find the page in swap is provided. The original purpose for the pagemap file was to enable investigations into which pages were resident and which were shared with other processes. Documentation/vm/pagemap.txt has information on what can be found in this file.

At the time this patch was merged, there appeared to be no harm in exposing the physical page-frame information. Since then, though, sentiments have turned against disclosing internal kernel information that is not strictly needed by user space. That, alone, might have eventually inspired somebody to remove the page-frame number from the pagemap file but, as it happens, something else came along first.

That something is the "rowhammer vulnerability," wherein the contents of a memory area can be changed by repeatedly hammering on a nearby memory area. If an attacker wanted to use this technique to compromise a system, the first order of business would be to obtain access to a page of memory physically adjacent to the memory that is targeted to be changed. The contents of the pagemap file, by providing the physical location of every page mapped in the system, would obviously be most helpful in such a situation. There will probably be other ways for an attacker to determine how pages are laid out in physical memory, but pagemap is almost certainly the easiest way.

To make life harder for attackers attempting to exploit the rowhammer vulnerability, a simple patch was merged for the 4.0-rc5 release in March. The patch turned the pagemap file into a privileged interface; attempts to open it will now fail unless the process in question has the CAP_SYS_ADMIN capability. The 4.0 release came out with that restriction in place, and everybody who was paying attention slept a little easier.

But that rest appears to have come at the cost of some sleepless nights elsewhere. It turns out that the UndoDB debugger uses the pagemap file to track changes to memory. When changes need to be tracked, the debugger will fork() the process, putting all of its writable memory into copy-on-write mode. After running the operation of interest (a system call, normally), the debugger can scan the pagemap file to see which pages have changed page-frame numbers; those are the pages that were written to, and, thus, copied. Without access to pagemap, UndoDB cannot get this information and, as a result, it no longer works.

In some situations of this type, one might just argue that the tool in question should be run as root. But that is not generally a desirable way to run an interactive debugging tool. So some other sort of solution must be found, or UndoDB will remain broken. There are cases where "remains broken" may be the final outcome; as Linus said in response to the report, "the one exception to the regression rule is 'security fixes'". But, fortunately, there appear to be some better options available this time around.

One possibility would be to restore access to the pagemap file but to somehow scramble the page-frame numbers before reporting them to user space. That would work for UndoDB, since it doesn't care about the actual page-frame numbers; it is only looking for changes. Linus was not convinced that this was the right way to go, though:

However, I don't believe that we have a good enough scrambling model to make that reasonable. Remember: any attacker will be able to see our scrambling code, so it would need to be both cryptographically secure *and* use a truly random per-VM secret key. Quite frankly, that's a _lot_ of effort for dubious gain...

Andy Lutomirski also pointed out that even scrambled page-frame numbers might be enough for an attacker to obtain some memory-adjacency information. So that approach does not appear to be viable.

The alternative is to simply report the page-frame numbers as zero in the absence of CAP_SYS_ADMIN. That would make the rest of the information in pagemap available while not exposing the page-frame information. The bad news is that always-zero page-frame numbers are not helpful for UndoDB. The good news, though, is that there is something else in pagemap that is just as useful.

That "something else" is the "soft-dirty" mechanism added to the 3.11 kernel in support of the checkpoint-restore in user space (CRIU) effort. Along with the page-frame number, each pagemap entry contains a soft-dirty bit that is meant to track pages that have been written to. All of the soft-dirty bits for a process can be reset to zero by writing to the clear_refs file in that process's /proc directory. Thereafter, the soft-dirty bit will be set whenever that process writes to a given page. CRIU uses this mechanism to find pages that have been changed during the checkpoint process, but it also will work for the UndoDB case. (See Documentation/vm/soft-dirty.txt for details on the soft-dirty mechanism).

So the probable outcome in this case is that pagemap will, once again, become globally readable. But it will contain no useful page-frame numbers unless the reading process had CAP_SYS_ADMIN when it opened the file. That will make UndoDB users happy again while preserving the security objectives of the original patch. So this story has a happy ending — unless, of course, another user who truly needs the page-frame number information steps forward.

Comments (6 posted)

Tracking actual memory utilization

By Jonathan Corbet
April 29, 2015

One might be tempted to think that an operating-system kernel should be able to answer a simple question: how much memory is a given process actually using? But, despite all the effort that has gone into providing visibility for this type of data, simple answers can be hard to come by. So the effort to provide better information continues, as can be seen by a recent patch set from Vladimir Davydov adding another way to calculate memory utilization.

A process's resident set size (RSS) is relatively easily calculated; that is the number of pages of physical memory currently owned by that process. Interested parties can get this information now from /proc or the ps command. In theory, the kernel is handling page reclaim in such a way that each process is actually using every page in its resident set, but, in the real world, things don't always work out that way.

It can be worth knowing if there is a significant difference between a process's RSS and the amount of memory actually in use; this information can be helpful when partitioning the system between containers or setting control-group limits. As it happens, the kernel contains a mechanism designed to allow an observer to determine how much of a process's resident set has actually been referenced. That information is found in a virtual file called smaps in the process's /proc directory. For example, the following fragment comes from the smaps file corresponding the the X.org server on your editor's desktop:

    016bc000-04af4000 rw-p 00000000 00:00 0                      [heap]
    Size:              53472 kB
    Rss:               51936 kB
    Pss:               51936 kB
    Shared_Clean:          0 kB
    Shared_Dirty:          0 kB
    Private_Clean:         0 kB
    Private_Dirty:     51936 kB
    Referenced:        45384 kB
    Anonymous:         51936 kB
    AnonHugePages:     38912 kB
    Swap:                  0 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Locked:                0 kB
    VmFlags: rd wr mr mw me ac

This entry describes an anonymous memory area that occupies 53,472KB of memory; 51,936KB of that area is currently resident (the Rss field), and 45,384KB have been referenced (the line in bold) since tracking was last reset. Since nothing is monitoring memory use on this system, that number has never been reset and thus counts every page referenced since the X.org server started.

If one wants to track usage over a specific period, it is necessary to reset the "referenced" count at the beginning, let the process run for a bit, then look in smaps to see how much memory was actually touched. That reset is done by writing a value of 1 to the clear_refs file in the same /proc directory.

At a first look, this mechanism seems like it should be able to answer the question of how much memory a process is actually using. But it turned out to not meet Vladimir's needs for a couple of reasons. One of those is that, while the smaps entry tracks references to memory mapped into the process's address space, it does not track page-cache memory used when files are accessed with system calls like read() or write(). That memory, too, is used by the process, so there would be value in knowing how much of it there is. Perhaps more importantly, the "referenced" state of each page is used by the memory-management subsystem itself to make decisions on which pages to evict. Resetting every page to the "not referenced" state will thus perturb page reclaim, and probably not for the better. If these measurements are to be made often, it would be good to have a less invasive way to make them.

Vladimir's patch adds a new file called /proc/kpageidle; since it's in the top-level /proc directory, it's a single file that describes an aspect of the the global state of the system. The file can be read like a long array of 64-bit integer values; each value corresponds to one physical page in the system, indexed by page-frame number. If a program wants to know whether physical page N has been referenced, it can seek to the appropriate location in /proc/kpageidle and read the value there; if the lowest bit is set, the page is idle. (Note that this file may change to a bitmap format in a future version of the patch set).

Once again, one needs to be able to reset that state to make observations over a given time period; in this case, setting a page to the "idle" state is done by writing 1 to the appropriate location in /proc/kpageidle. That action will make the page inaccessible (much like the normal kernel usage tracking does) so that a fault will result whenever a process tries to read or write that page. At that point, the "idle" state can be reset and the page made accessible again. The idle state will also be reset if the page is accessed via the file-related system calls, so it will track the state of pages in the page cache as well.

To track the idle state, the patch set adds a new "idle" page flag that is set whenever a page is marked idle. That flag is then passed back to user space whenever a given page's entry in /proc/kpageidle is read. As it turns out, there is a need for a second page flag as well, though. As mentioned above, making a page inaccessible is a technique already used within the memory-management subsystem; when a write to /proc/kpageidle causes that to happen, it makes the page appear to have never been accessed. To avoid that, Vladimir adds a second flag called "young"; whenever a write to /proc/kpageidle makes a page inaccessible, the "young" bit will be set as well. When the memory-management code asks whether a page has been referenced, the "young" bit is taken into account. In the end, that means that using /proc/kpageidle will not change how page reclaim is done.

There is one little problem with this approach: page flags are in short supply on 32-bit systems. To get around this problem, the code uses the "struct page extension" mechanism in the 32-bit case. This mechanism was originally created to support memory control groups (memcgs), which need to store more information about each page than can fit in the page structure. Using extensions can use quite a bit of memory in its own right, but there's little alternative on systems where shoehorning even one more bit into struct page is not an option.

Readers who have gotten this far may be wondering about one final piece of the puzzle: knowing which physical pages in the system are in use does not say much about what any specific processes are using. There are two ways of connecting the two pieces, one of which exists now and one which is part of Vladimir's patch. In current systems, the pagemap file in any process's /proc directory can be used to see which physical pages are mapped into that process's address space. That information is only available to privileged processes as of the 4.0 release, but /proc/kpageidle is a privileged interface too.

If the task at hand is partitioning a system's resources, though, then memcgs are likely already in use to set limits on groups of processes. In that case, it is more interesting to know how much memory each memcg is using than to track this information on a per-process basis. To that end, the patch set adds yet another file (/proc/kpagecgroup) which, when read, yields the control group that owns each page. By using that file together with /proc/kpageidle, a monitoring process can determine how many pages each memcg is using — and how many it owns but is not making use of.

The end result is an interface that can be used to determine how well a control group's memory limits fit its actual needs. As service providers of all types seek to run more clients on each physical system, they will likely be pleased to have this extra information available. That, of course, depends on this patch set being merged into the mainline. Given the lack of significant opposition, that seems likely to happen sooner or later — though, with memory-management patches, it's always hard to say just when that might happen.

Comments (4 posted)

Linus Torvalds Linux 4.1-rc1 ?

Greg KH Linux 4.0.1 ?

Greg KH Linux 3.19.6 ?

Luis Henriques Linux 3.16.7-ckt10 ?

Greg KH Linux 3.14.40 ?

Greg KH Linux 3.10.76 ?

Masahiro Yamada ARM: SoC: add a new platform, UniPhier (arch/arm/mach-uniphier) ?

Paul Burton JZ4780 & CI20 support ?

Joachim Eastwood Add support for NXP LPC18xx family ?

Jun Nie Basic support to ZTE ZX296702 ?

Frank.Li@freescale.com Add Freescale i.mx7d support ?

AKASHI Takahiro arm64: add kdump support ?

AKASHI Takahiro arm64: add livepatch support ?

Yoshinori Sato Re-introduce h8300 architecture ?

Alexey Kardashevskiy powerpc/iommu/vfio: Enable Dynamic DMA windows ?

Waiman Long qspinlock: a 4-byte queue spinlock with PV support ?

Baolin Wang Convert the posix_clock_operations and k_clock structure to ready for 2038 ?

Tejun Heo printk: implement extended console support ?

Steven Rostedt tracing: Add new hwlat_detector tracer ?

Feng Kan APM X-Gene Mailbox driver ?

Mathieu Poirier Support for coresight ETMv4 tracer ?

Brian Norris AHCI and SATA PHY support for Broadcom STB SoCs ?

LABBE Corentin crypto: Add Allwinner Security System crypto accelerator ?

Ingi Kim Add ktd2692 Flash LED driver using LED Flash class ?

Irina Tirdea Add support for BMC150 magnetometer ?

Anda-Maria Nicolae power_supply: Add support for Richtek rt9455 battery charger ?

Kevin Cernekee tas571x amplifier driver ?

Jonathan Richardson Add DTE driver for Cygnus ?

Álvaro Fernández Rojas BCM6328 LED driver ?

Irina Tirdea Add support for BMC150 magnetometer ?

Fabien Dessenne Add media bdisp driver for stihxxx platforms ?

Eddie Huang Add Mediatek SoC RTC driver ?

Ramakrishna Pallala extcon-axp288: Add axp288 extcon driver support ?

Dan Williams libnd: non-volatile memory device support ?

Sudeep Dutt misc: mic: SCIF driver ?

micky_ching@realsil.com.cn mmc: core: add SD4.0 support ?

Pankaj Dubey Add support for Exynos SROM Controller driver ?

Pali Rohár Dell Airplane Mode Switch driver ?

Richard Fitzgerald Add support for Wolfson WM8998 and WM1814 codecs ?

Tomeu Vizoso On-demand device registration ?

Kamil Debski HDMI CEC framework ?

Heikki Krogerus usb: ulpi bus ?

Andreas Gruenbacher Richacls ?

Mike Kravetz hugetlbfs: add fallocate support ?

Beata Michalska fs: Add generic file system event notifications ?

Li Xi ext4: add project quota support ?

Ming Lin simplify block layer based on immutable biovecs ?

David Howells [RFC][PATCH 00/13] Convert bitop funcs to bool return type and propagate to various callers/users ?

Mel Gorman Parallel struct page initialisation v3 ?

Kirill A. Shutemov THP refcounting redesign ?

Anisse Astier Sanitizing freed pages ?

Sergey Senozhatsky introduce on-demand zram device creation ?

Johannes Weiner mm: improve OOM mechanism v2 ?

Vladimir Davydov idle memory tracking ?

Jiri Pirko introduce programable flow dissector and cls_flower ?

Pablo Neira Ayuso Netfilter ingress support (v2) ?

Stephan Mueller Seeding DRBG with more entropy ?

Hajime Tazaki an introduction of Linux library operating system (LibOS) ?

Mathieu Desnoyers Userspace RCU 0.7.14 and 0.8.7 ?

David Sterba Btrfs progs release 3.19 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Garrett: Reducing power consumption on Haswell and Broadwell systems

Kernel development news

The 4.1 merge window closes

Pagemap: security fixes vs. ABI compatibility

Tracking actual memory utilization

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous