|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.1-rc8, released on June 14. "So I'm on vacation, but time doesn't stop for that, and it's Sunday, so time for a hopefully final rc."

Stable updates: none have been released in the last week.

Comments (none posted)

Quotes of the week

    +		/*
    +		 * That should not fail at boot due to OOM, and it'll
    +		 * already warn if we somehow get two identical names,
    +		 * but this one line should quiet both gcc and lkml.
    +		 */
Rusty Russell

This scared the hell out of me as I'm thinking that I have got some kind of NSA backdoor hooked into my server and it is monitoring my plans to smuggle Kinder Überraschung into the USA from Germany. I panicked!
Steven Rostedt

Comments (none posted)

Kernel development news

Greybus

By Jonathan Corbet
June 17, 2015

LinuxCon Japan
A little while back, Greg Kroah-Hartman was given an opportunity to work on an interesting problem. Project Ara is developing a phone handset that can be assembled (and customized) from a range of components; these components include speakers, displays, cameras, interesting sensors, and more. Making this work requires an internal bus that can handle devices that may come and go at any time — something that cellphones have not needed up to this point. Greg's job was to help to bring that bus into being; the result was "Greybus," which was the subject of Greg's LinuxCon Japan talk.

Project Ara imposes some interesting requirements on its internal bus. Beyond being dynamic, the bus must be routable and secure; in other words, any two components on the bus must be able to communicate directly with each other, without any other component being able to listen in. As a result, standard buses like USB and PCIe are not suitable. After some searching, Greg and company came across the UniPro bus, which seems to fit the bill.

UniPro has its origins in Nokia, which wanted a way to be able to easily integrate cameras from any vendor. Google then picked it up and made it routable. At this point, Greg said, it is fast and mature. It promises low latency, has low-power modes and quality-of-service features, and [Greg Kroah-Hartman] promises in-order message delivery. The standards have been out there for a while; they are driven by the MIPI Alliance.

Everything on UniPro happens by way of bidirectional connections that pass directly from one component to another; data does not pass through the processor. A connection is represented by a "CPort," which looks a lot like a network socket. There is a switch on the bus that sets up the actual routes. Messages can pass at a rate of around 10Gb/s; the bus also has message prioritization, error handling, and notification of delivery problems. What UniPro does not do is streams or multicast delivery; Greg suggested the latter was a good non-feature, since it prevents modules from sniffing unrelated traffic.

UniPro adheres to the OSI network model, except that it has no application layer defined. So the Project Ara developers made their own, which they called Greybus. (The name evidently comes from the gray color of the original prototype device; nobody has since come up with a better one). Greybus adds device discovery; all modules on the bus are self-describing. There are, it seems, advantages when the people who have to make the software side work get a say in how the hardware is designed.

Greybus also adds a network routing layer internally and a set of class protocols for specific device types. This is something that USB and PCI got right, Greg said. When they adhere to the class protocols, devices like keyboards or WiFi adapters just work with no additional software development needed. This is important for Project Ara; it would like to see a lively market in third-party modules, and that is much more likely to happen if new modules simply work in existing systems.

Each device on the bus offers a description to the rest of the system; it includes vendor and product IDs, a serial number, the protocols used, etc. Each module (a physical thing plugged into the phone) offers one or more "interfaces," each of which is a physical connection. CPort 0 on each interface controls the interface as a whole; other CPorts offer the actual functionality and will be what the kernel normally talks to. Those CPorts support "operations," which are a way of getting a module to do something using an interface that resembles a remote procedure call API.

Currently a number of protocol classes have been implemented; these include the battery, vibrator, and near-field communications classes. Others, including audio, input, sensors, and cameras, are in progress. There are a few that have been left for later, including WiFi, Bluetooth, cellular modems, GPS receivers, lights, and the display — which, Greg suggested, will be a fair amount of work.

Also built into Greybus is the ability to bridge other protocols. So, for example, devices speaking protocols like USB, I2C, I2S, or SPI can be driven directly over the bus. There is also support for tunneling protocols like CSI (for cameras) or DSI (for displays).

Greg concluded by noting that the code is all available in the Greybus GitHub repository. It represents a piece of the Project Ara puzzle, but not the whole solution. In particular, making this device successful will require turning Android into a much more dynamic system; it is the same challenge that Linux distributors dealt with ten or fifteen years ago. It's a big job, but developers at Google are working on it. Once they are done (and the hardware is available), we will have stepped into a world where phone handsets are far more dynamic and configurable than they have been in the past.

Greg let people look at his prototype handset after the talk. It consisted of a chassis into which a number of postage-stamp-sized modules could be plugged. It was a nice-looking device, though the chassis seems like it will always force the device to be a bit fatter and heavier than less configurable devices. Suffice to say your editor wants one.

[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan.]

Comments (13 posted)

Writeback and control groups

By Jonathan Corbet
June 17, 2015

LinuxCon Japan
"Writeback" is the process of writing dirty pages in memory back to permanent storage. It is a tricky job; the kernel must arbitrate the use of limited I/O bandwidth while ensuring that the system is not overwhelmed by dirty pages. Some years ago, writeback was a perennial discussion topic at gatherings of memory-management developers; the kernel did not do as good a job as anybody would have liked. Those problems have, for the most part, been solved in recent years — until one adds control groups into the mix. A solution to that problem is in the works, though, and should be hitting the mainline in the near future.

Tejun Heo took some time to discuss the current situation during his LinuxCon Japan talk. The memory-management subsystem will, by default, try to limit dirty pages to a maximum of 15% of the memory on the system. There is a "magical function" called balance_dirty_pages() that will, if need be, throttle processes dirtying a lot of pages in order to match the rate at which pages are being dirtied and the rate at which they [Tejun Heo] can be cleaned. It works reasonably well in current kernels, but it only operates globally; it is not equipped to deal with control groups.

On the control group side, the memory controller can regulate the amount of memory that is available to any given group, while the block controller is in charge of regulating I/O bandwidth use. Writeback is clearly related to both memory use and I/O bandwidth, but the control-group mechanism offers no way to enable controllers to work together — so these two controllers don't. The result, Tejun said, is a "really sad situation."

The memory controller currently tags pages in memory with owner information so that it knows which control group to charge for each page. The block controller is unable to use that information, though, so it has no way of knowing which control group to charge for writeback I/O traffic. So control groups do not use the system's global throttling mechanism at all; instead, there is a "hacky" mechanism built into the memory controller itself that, according to Tejun, "does not throttle anything effectively." It ignores the global dirty-page watermarks that control throttling and is, he said, "completely broken." There has been talk of fixing the situation for at least five years but nothing has been done, leading to a certain amount of frustration.

Fixing writeback in control groups

So Tejun set out to deal with the problem. His approach is driven by the idea that control-group features should not need completely new mechanisms for their implementation — writeback control in control groups should use the same mechanism that the system as a whole uses. The global mechanism should just be a degenerate form of the single-group case.

There are two structures involved in writeback control in the kernel. struct backing_dev_info contains information about a specific device to which dirty pages are being written; it tracks the observed I/O bandwidth of the device and how it is being used. The bdi_writeback structure, instead, regulates writeback activity in particular. There is currently a single bdi_writeback structure for each backing_dev_info structure, and the separation of their roles is somewhat fuzzy. (Both of these structures are defined in include/linux/backing-dev.h)

One of the first things Tejun's control-group writeback support patch set does is to move more writeback-specific information from struct backing_dev_info into the bdi_writeback structure. That structure then goes from a single instance per device to one instance for each control group, allowing for each group to be regulated separately. balance_dirty_pages() is changed to use the per-group bdi_writeback structure, as are other pieces of the writeback-control mechanism. Tejun described it as being mostly "a giant plumbing job."

Details

The completion of that plumbing job allows the block bandwidth controller to regulate writeback I/O, but it is missing an important piece: the throttling of processes that are dirtying more memory than can be cleaned within their group's I/O bandwidth limits. Or, more precisely, while the system can throttle processes when the global dirty-page limit is reached, it cannot throttle those that have dirtied too much of the memory that is available to their specific control group. Solving that problem is the subject of a separate patch set adding per-group throttling.

This patch set adds a new structure (struct wb_domain) for the control of dirty-page throttling. There is one global domain that implements the "15% of total memory" limit that exists in current kernels. Each control group gets its own wb_domain structure as well, to enforce limits specific to that group. When the memory-management code computes the number of pages that a process within a specific group is allowed to dirty, it looks at both the global and per-group wb_domain structures and uses the more restrictive of the two. A process will never be allowed to exceed the number of dirty pages allowed to its control group, but that limit may be lowered if the system as a whole has a lot of dirty pages.

That is still not a complete solution to the problem, though. The writeback mechanism uses the inode (open file) as its fundamental unit of control, while the memory controller applies limits on a per-page basis. Tejun explained that each makes sense within its own context, but there is a mismatch between the two that makes it harder to make those mechanisms work well together.

The writeback mechanism is designed to focus on a single inode at a time; among other things, writing out all of a single file's dirty pages together tends to improve disk I/O locality. When the I/O bandwidth controller first sees writeback activity for an inode, it assigns "ownership" of the inode to the control group responsible for that activity. Thereafter, all writeback activity for that inode is charged to that control group, regardless of who actually dirtied the pages. Tejun looked into making the accounting more fine-grained but, he said, the result was far too complex and wasn't worth it. In the end, one control group is usually responsible for the majority of writeback traffic to any given file.

There is still a problem, though, that the initial assignment of responsibility for any given file might be incorrect. Or the file could move from one control group to another over time. In either case, the result could be that one group finds itself charged for large amounts of writeback created by another group entirely.

To resolve that issue, Tejun has posted yet another patch set adding "foreign cgroup inode bdi_writeback switching." This mechanism watches the ownership of the pages (as tracked by the memory controller) being written back to each inode. Using the Boyer-Moore majority vote algorithm, it decides which control group is responsible for the most I/O traffic. If most traffic originates in a group other than the owner of that inode, and that pattern holds for a period of time (two seconds, in the current patch), the ownership of the inode will be switched to the new "winner". Over time, that mechanism should ensure that writeback I/O traffic is charged correctly without adding the need to track things on a sub-inode level.

As for the status of all this work: Tejun said that it works and is currently slated for the 4.2 merge window. That said, it is still experimental and there are probably some issues to be shaken out. At the time of the talk, only the ext2 filesystem was supported; since then, ext4 support has been posted as well. Each filesystem will require changes to support the new writeback mechanism, but the changes tend to be quite small. Getting those pieces into place should not take too long; then, once this work stabilizes, another longstanding Linux memory-management shortcoming should be no more.

[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan.]

Comments (4 posted)

Leap-second issues, 2015 edition

By Jonathan Corbet
June 17, 2015
The leap second is an occasional ritual wherein Coordinated Universal Time (UTC) is held back for one second to account for the slowing of the Earth's rotation. The last leap second happened on June 30, 2012; the next is scheduled for June 30 of this year. Leap seconds are thus infrequent events. One might easily imagine that infrequent events involving time discontinuities would be likely to expose software problems, and, sure enough, the 2012 leap second had its share of issues. The 2015 leap second looks to be a calmer affair, but it appears that it will not be entirely problem-free.

Prarit Bhargava reported a problem at the end of May: it seems that, when the leap second hits, some timers can fire one second earlier than they should. This is not a good outcome; timers can be delayed, but they should never fire before their appointed time. It did not take long to understand the problem, but finding a proper solution was a rather slower task.

Linux handles leap seconds in the kernel. An application (typically the network time protocol (NTP) daemon) informs the kernel of an upcoming leap second via the adjtimex() system call; when the appointed time arrives, the system clock will be set back by one second. There is an important detail in all of this, though: this adjustment happens during the normal timer tick. The tick is not precisely lined up with the second boundaries as determined by UTC, so there is a window of time between the beginning of the leap second and when the kernel figures out that it needs to hold the clock back. The window is brief (a maximum of about 10ms, usually shorter), but that's enough time for timers set for just after midnight to fire.

One might argue that one second every few years is not a big problem, and that applications that really care should be using the International Atomic Time (TAI) clock anyway. There are applications with precise timekeeping requirements; some of them are certainly using the UTC clock rather than TAI, but they should work anyway if possible. So this seems like a problem worth fixing.

John Stultz's first attempt at providing that fix did not go entirely well, though. In particular, the patch that moved the leap-second adjustments into the timer fast-path code ran into opposition. The patch added some significant complexity to code that is already far from simple, and it threatened to slow down some of the most frequently exercised code in the kernel. The loudest opposition came from Ingo Molnar, who asked: "why do we add over 100 lines of code for something that occurs (literally) once in a blue moon?"

Ingo's suggestion was to implement leap-second smearing instead. The smearing approach does away with the time discontinuity by tweaking the speed of the clock instead; it would handle a leap second insertion by running the clock just a little slower for a number of hours prior to the event. There are no abrupt time transitions, and no weird times (like 23:59:60) that applications may not be prepared to deal with. It could, Ingo said, also be handled almost entirely from user space via adjtimex(), allowing administrators to control policy and getting the kernel out of the leap-second business entirely.

In truth, life is not so simple. Clocks that do not have leap seconds (the TAI clock in particular) should not smear, so the kernel would have to stay involved and it would have to maintain time bases running at different speeds while the smearing was happening. As John noted, that does not look like a path leading to lower levels of complexity. He argued that, while smearing looks like a worthwhile thing to add to the kernel, it does not address the immediate problem.

Ingo also argued that the leap-second code should not support leap-second deletion, where the clock is set ahead by one second. Deletion has never happened, and, when one looks at the physical processes involved in the slowing of the Earth's rotation, it seems like it probably never will ("discounting massive asteroid strikes, at which point leap seconds will be the least of our problems"). Other operating systems are unlikely to handle deletion gracefully; if they support it at all, it is with code that has never been exercised in the real world. The adjtimex() interface allows for it, though, so John argues that it should be supported; the code is already there and seems unlikely to be removed.

After the discussion calmed down, John came back with a reduced patch set limited to the essential fix. In this version of the patch, the timer code does the leap-second adjustment early enough to avoid premature timer expirations; one might suggest that the new timer code looks before it leaps. The rest of the code remains untouched, though, so the performance impact of the change is minimal. This version found a friendlier reception; it has been added to the "tip" tree for merging into the 4.2 kernel.

As of this writing, the leap-second insertion is less than two weeks away, so John's fix is clearly not destined to be deployed worldwide before that happens. But getting it into the mainline now will ensure that it's running on at least some development systems when the leap second hits, giving it a certain amount of real-world testing. That should increase confidence in the correctness of the patch and help to ensure that, when the next leap second is declared, the kernel will handle it properly.

Comments (21 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.1-rc8 ?
Sebastian Andrzej Siewior 4.0.5-rt3 ?
Sebastian Andrzej Siewior 4.0.5-rt4 ?
Luis Henriques Linux 3.16.7-ckt13 ?
Jiri Slaby Linux 3.12.44 ?

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds