Kernel development

Brief items

Kernel release status

The current development kernel is 4.1-rc6, released on May 31. As Linus put it: "things look normal".

Stable updates: none have been released in the last week.

Linux support for digital video broadcasting

Mauro Carvalho Chehab, the maintainer of the kernel's media subsystem, has posted the first two in a series of articles on digital video broadcasting support in Linux. Part 1 gives an overview of how the devices and protocols work, while part 2 looks at digital TV network interface use. "Supporting embedded Digital TV hardware is complex, considering that such hardware generally has multiple components that can be rewired in runtime to dynamically change the stream pipelines and provide flexibility for things like recording a video stream, then tuning into another channel to see a different program. This article describes how the DVB pipelines are setup and the needs that should be addressed by the Linux Kernel."

Comments (76 posted)

Kernel development news

4.1 development statistics

By Jonathan Corbet
June 3, 2015

The 4.1-rc6 prepatch is out, and things look on track for an on-schedule 4.1 release — unless, of course, Linus's vacation plans get in the way. But the kernel development community doesn't stop, even when Linus does, so it seems like about time to have a look at some development statistics for this kernel cycle. 4.1 was a fairly typical development cycle with few surprises.

With 11,664 changesets merged as of 4.1-rc6, this cycle is a little bit slower than most that have been seen in the last year, though it is significantly busier than 4.0, which finished with 10,346 changesets. The number of developers involved, at 1,492, comfortably exceeds 4.0; indeed, it currently ties 3.15 for the title of the most developers ever. Chances are good that, by the time it is released, the 4.1 kernel cycle will be the first to see the participation of over 1,500 developers. Of those over 1,500 developers, 270 have contributed their first patch ever in this cycle — a fairly typical number for recent years.

Those developers have added 486,000 lines of code and removed 286,000 lines, for a net growth of 200,000 lines this time around. The most active developers doing this work were:

Most active 4.1 developers

By changesets

Ian Abbott 129 1.1%

Takashi Iwai 121 1.0%

Hans Verkuil 119 1.0%

Marcel Holtmann 117 1.0%

Aya Mahfouz 107 0.9%

Geert Uytterhoeven 105 0.9%

Laurent Pinchart 102 0.9%

Richard Weinberger 95 0.8%

Joe Perches 92 0.8%

Eric Dumazet 92 0.8%

Al Viro 90 0.8%

Krzysztof Kozlowski 77 0.7%

Fabian Frederick 77 0.7%

Benjamin Romer 74 0.6%

Jiri Olsa 73 0.6%

Denys Vlasenko 72 0.6%

Mauro Carvalho Chehab 67 0.6%

Nicholas Mc Guire 66 0.6%

Guenter Roeck 65 0.6%

Lars-Peter Clausen 65 0.6%

By changed lines

Jie Yang 20194 3.4%

Stephen Boyd 13536 2.3%

Sudip Mukherjee 10198 1.7%

Chanwoo Choi 8571 1.5%

Heiko Carstens 8239 1.4%

Tomeu Vizoso 7647 1.3%

Hongzhou Yang 7391 1.3%

Joe Perches 7135 1.2%

Laurent Pinchart 6589 1.1%

J. German Rivera 6359 1.1%

Takashi Iwai 6173 1.0%

Magnus Damm 6082 1.0%

Mathieu Poirier 5915 1.0%

Michael Ellerman 5874 1.0%

Ray Jui 5362 0.9%

Andy Shevchenko 4857 0.8%

Hai Li 4487 0.8%

Andrew Bresticker 4252 0.7%

Markus Stockhausen 4221 0.7%

James Hogan 4172 0.7%

Hartley Sweeten no longer tops out the list with his Comedi driver work but, never fear, Ian Abbott contributed the most 4.1 changesets by working on, yes, the Comedi drivers. Remember, though, that Hartley contributed 463 Comedi changes in 3.19, so it may well be that work on this code is finally slowing down, though there doesn't appear to be a plan to move it out of the staging tree quite yet. Takashi Iwai is the sound subsystem maintainer, so most of his work concentrates in that area; for similar reasons, it is unsurprising that Hans Verkuil's changes were all found within the media subsystem and Marcel Holtmann's patches were applied to the Bluetooth code. Aya Mahfouz, instead, is an intern in the Outreachy project's current round; she has clearly gotten off to a strong start with a lot of cleanup patches applied to staging drivers.

On the "lines changed" side, Jie Yang's work was mostly a reorganization of the Intel audio driver code. Stephen Boyd removed some old ARM drivers, making him the developer having removed the most code in this development cycle. Sudip Mukherjee did a bunch of cleanupv work on a number of staging drivers, Chanwoo Choi worked mostly on the Samsung Exynos clock drivers, and Heiko Carstens removed a bunch of relatively obscure S/390 code — such as 31-bit support.

There were (at least) 215 employers supporting work on the 4.1 kernel, the most active of which were:

Most active 4.1 employers

By changesets

Intel 1308 11.2%

Red Hat 1069 9.2%

(None) 1055 9.0%

(Unknown) 950 8.1%

SUSE 437 3.7%

Linaro 387 3.3%

IBM 385 3.3%

Outreachy 381 3.3%

Google 362 3.1%

Samsung 340 2.9%

Renesas Electronics 279 2.4%

(Consultant) 258 2.2%

Texas Instruments 217 1.9%

Broadcom 162 1.4%

Oracle 155 1.3%

Imagination Technologies 151 1.3%

Cisco 150 1.3%

Freescale 134 1.1%

MEV Limited 129 1.1%

ARM 129 1.1%

By lines changed

Intel 74566 12.6%

Red Hat 41496 7.0%

(None) 40119 6.8%

IBM 39301 6.7%

(Unknown) 31558 5.3%

Linaro 29588 5.0%

Code Aurora Forum 23495 4.0%

Samsung 22175 3.8%

Google 21588 3.7%

Renesas Electronics 17548 3.0%

SUSE 16830 2.9%

Broadcom 15202 2.6%

Freescale 15156 2.6%

Imagination Technologies 10935 1.9%

VECTOR Institute 10742 1.8%

Nokia 9829 1.7%

MediaTek 9582 1.6%

Texas Instruments 8843 1.5%

Collabora Multimedia 8621 1.5%

(Consultant) 8312 1.4%

As usual, there are few surprises here, with the possible exception of the 3.3% of the changesets contributed in this cycle by current and aspiring Outreachy interns.

The Signed-off-by tags in a patch provide a picture of who handled the patch on its way into the appropriate subsystem maintainer's tree. In particular, if one looks at tags applied by developers other than the author of each patch, the result gives a view of who the gatekeepers are. For the 4.1 development cycle, the numbers look like this:

Most non-author signoffs in 4.1

By Developer

Greg Kroah-Hartman 1544 13.8%

David S. Miller 1067 9.6%

Ingo Molnar 407 3.6%

Mark Brown 405 3.6%

Andrew Morton 404 3.6%

Daniel Vetter 360 3.2%

Mauro Carvalho Chehab 342 3.1%

Ralf Baechle 263 2.4%

Arnaldo Carvalho de Melo 242 2.2%

Kalle Valo 210 1.9%

By employer

Red Hat 2249 20.2%

Linux Foundation 1568 14.1%

Intel 1327 11.9%

Linaro 981 8.8%

Google 621 5.6%

Samsung 521 4.7%

SUSE 375 3.4%

(None) 316 2.8%

(Unknown) 314 2.8%

IBM 286 2.6%

While this development cycle is the result of the work of 1,500 developers and over 200 companies, at the subsystem maintainer level things are, as always, far more concentrated: over 60% of the changes going into this kernel passed through the hands of developers working for just five companies. This concentration reflects a simple fact: while many companies are willing to support developers working on specific tasks, the number of companies supporting subsystem maintainers is far smaller. Subsystem maintainership is also, increasingly, not a job for volunteer developers.

And that is the story for the 4.1 development cycle. The kernel-development machine continues to hum along, integrating the work of thousands of developers and producing the kernels that run Linux systems worldwide. There are not many surprises to be seen here, but, for such an important piece of software, "not many surprises" is generally deemed to be a good thing.

Comments (5 posted)

Reinventing the timer wheel

By Jonathan Corbet
June 3, 2015

The kernel's "timer wheel" data structure has served it well for some time; it has changed little since it was described in this article in 2005. There are, however, some shortcomings in its original design that have become more costly over time, and the timer wheel has not adapted well to other changes in the scheduler code. So, after many years, this venerable data structure may soon be replaced with a variant that runs more efficiently, but loses some accuracy in timekeeping in the process.

The kernel maintains two types of timers with two distinct use cases. The high-resolution timer ("hrtimer") mechanism provides accurate timers for work that needs to be done in the near future; hrtimer use is relatively rare, but, when hrtimers are used, they almost always run to completion. "Timeouts," instead, are normally used to alert the kernel to an expected event that has failed to arrive — a missing network packet or I/O completion interrupt, for example. The accuracy requirements for these timers are less stringent (it doesn't matter if an I/O timeout comes a few milliseconds late), and, importantly, these timers are usually canceled before they expire. The timer wheel is used for the latter variety of timers.

Here is the 2005 diagram showing the design of the timer wheel:

This data structure is indexed by the kernel's low-resolution "jiffies" clock; one jiffy corresponds to something between 1ms and 10ms, depending on how the kernel is configured. Once every jiffy, the kernel processes any expired timers. That is done by taking the lowest eight bits of the jiffies variable and using them to index into the rightmost array in the above diagram; the result will be a linked list of timer events that expire at the current time.

Every 256 jiffies (in most configurations) the kernel will hit the end of that array; at that point it is necessary to perform a "cascade" operation. Each entry in the next higher array contains 256 jiffies worth of events; the timer code will select the correct entry (by using the next six bits of the jiffies value as an index), collect all of the timer entries found there, and distribute them across the 256 entries of the first array according to their expiration times. When the second level is exhausted, it is refilled by cascading down entries from the third level, and so on.

There are a number of advantages to this data structure, including the ability to immediately locate expired entries and quick addition and removal of events. But it has some downsides as well. The cascade operation can be expensive, and the time required is, to a first approximation, unpredictable; that can lead to unwanted latencies elsewhere in the system. The cascade operation is not particularly cache-friendly. There is also no way to quickly determine when the next timer expiration will happen; that requires searching through the wheel to actually find that event. The presence of deferrable timers, which do not have to expire in any sort of timely manner, makes the identification of the next event that actually does have to expire at the requested time harder yet. For these reasons and more, developers have talked about replacing the timer wheel for years.

The new timer wheel

Thomas Gleixner has now posted a first draft of a reinvented timer wheel. It does away with the costly cascade operations (almost all the time) and handles deferrable timers in a much more straightforward manner. These gains come from the realization that not all timers have to be handled with the same level of accuracy.

At a superficial level, the new data structure is quite similar to the old. There is still a hierarchy of arrays containing lists of timer events. In this case, though, the arrays are all the same size (32 entries), and there are eight levels of them. The lowest array contains events with single-jiffy resolution as before, so any new timeout expiring less than 32 jiffies in the future will be placed in this array.

The next array is a little different, though; each entry represents eight jiffies worth of future timer events. Since there are 32 entries in this level as well, it can represent events up to 256 jiffies into the future. Entries in the third level each hold 64 jiffies worth of events; in the fourth level, they hold 512 jiffies worth, and so on. So each level covers a time period eight times longer than the level below it. The numbers are different from the old implementation, but the concept is the same, so far.

The old timer wheel would, each jiffy, run any expiring timers found in the appropriate entry in the highest-resolution array. It would then cascade entries downward if need be, spreading out the timer events among the higher-resolution entries of each lower-level array. The new code also runs the timers from the highest-resolution array in the same way, but the handling of the remaining arrays is different. Every eight jiffies, the appropriate entry in the second-level array will be checked, and any events found there will be expired. The same thing happens in the third level every 64 jiffies, in the fourth level every 512 jiffies, and so on.

The end result is that the timer events in the higher-level arrays will be expired in place, rather than being cascaded downward. This approach, obviously, saves all the effort of performing the cascade. But it also means that any timeout that is more than 31 jiffies in the future will be run with lower accuracy. For example, a timeout that is 36 jiffies in the future will be put in the next higher eight-jiffy slot — 40 jiffies in the future. So that event will expire four jiffies later than requested. As timeouts are placed further into the future, the accuracy of their expiration will decline accordingly. The seventh level in this scheme will hold timeouts that are at least 1,048,576 jiffies in the future with 262,144-jiffy resolution. On a 1000HZ system, that corresponds to timeouts at least 17 minutes in the future; they will expire with a resolution of four minutes.

The old implementation was not subject to this loss of accuracy; even timeouts days in the future would expire at "exactly" the right time, for a one-jiffy value of "exactly." So one could argue that the replacement timer wheel does not work as well. But, first, one should remember that (1) almost all timeouts are set for the near future, (2) almost all timeouts are canceled before expiration, and (3) timeouts indicate that something went wrong and do not need to be delivered with a high degree of accuracy. So sacrificing some of that accuracy for higher timer-wheel performance would appear to be a good tradeoff.

Thomas's patch also dispenses with the timer slack mechanism. Timer slack allows the expiration of timeouts to be deferred; it is intended to cause timeouts to be executed together and reduce the number of times the system wakes up. The new timer wheel batches things naturally for anything but the shortest of timeouts, so there is arguably no longer a need for a separate "slack" mechanism.

Deferrable timers are a bit different though; they can be deferred indefinitely if need be. They usually correspond to some sort of cleanup work that must be done eventually, but with no particular urgency. If the CPU is running in the tickless mode, those timeouts should be deferred for as long as it takes to avoid interrupting the running application. In Thomas's patch, deferrable timers are stored in a separate, parallel timer wheel; this gets them out of the way and eases the task of figuring out when the next timer interrupt should be scheduled.

The new timer wheel code maintains a bitmap with a bit corresponding to each entry in the timer arrays; if there are timeouts stored in that entry, that bit is set. Finding the first array entry with an outstanding timeout is thus a simple matter of finding the first set bit in the bitmap — a fast operation. Then, since the expiration time of each array entry is known, the time of the next expiring timeout can be calculated without actually needing to look at the timeout entry. Placing deferrable timeouts in their own array makes it easy to simply avoid looking at them when checking for this next expiring timeout, speeding the operation further.

This code is all new and untried; Thomas warns that "it might eat your disk, kill your cat and make your kids miss the bus". That would suggest that it is certainly not considered to be 4.2 material. But, with some time and testing, it could likely be ready for a development cycle shortly after that. Then the kernel will, at last, have a shiny, new, faster timer wheel.

Comments (6 posted)

A tour of /sys/devices

June 3, 2015

This article was contributed by Neil Brown

Modern Linux systems have a directory tree at /sys/devices that contains information about all of the "devices" represented in the device model. Having clarified exactly what each of "devices", "buses", and "classes" are (or maybe what they aren't), I am now in a position to address a shortcoming I had to admit in a recent article where I stated that /sys/devices "has a structure that, in all honesty, is rather hard to describe".

/sys/devices is part of the sysfs virtual filesystem; it presents devices as directories arranged in a hierarchy. It includes various files to allow details of each device to be examined and sometimes to be changed. Seeing how they all plug together can give insights into the device model that are not evident by just examining individual devices. To describe this directory structure in detail there are three particular concepts we need to examine beyond the device, bus, and class that we met last time. We need to explore parentage, attributes, and namespace management.

Parenthood

Each device (or "thing") in the Linux device model can have a "parent" device. In the examples from last week, the "tca6507" device was the parent of some "leds" devices, and itself had a parent that was an i2c adapter device, such as "i2c-1". The parent of the device representing a partition on a hard drive is the device representing the hard drive as a whole. "Workqueue" devices do not have parents. This variety leads to the question of what, exactly, is a "parent". There are several different, though related, concepts that are tied up in this idea of "parenthood".

Parent as connection point or service provider

Firstly there is the idea of "addressability" and the related idea of "connectedness". When a device represents a piece of hardware, the parent often represents a connection to that device. This parent knows how to interpret the "address" of the device, and how to send instructions to that address, or how to transfer data to or from that address. The parent doesn't know much about what those instructions or data might do; that knowledge is in the device itself (and its driver).

Closely related to connection is the idea of service provision. When a device represents a piece of hardware, the primary service it needs from a parent device is communication with the CPU. When a device represents something else, the primary service requirement might be different.

Many devices that are implemented by classes do not exactly represent hardware but instead represent an abstraction. Devices in the "disk" class represent anything that can provide read or write access to addressable blocks of storage. Any device that can facilitate that access might reasonably find itself as the parent of a "disk" device. Sometimes the thing that provides service is not a device — it might be a file or a network connections or a set of devices. In those cases the disk device won't have a parent.

I like to picture these two complementary ideas (connection and service provision) by imagining the path a request takes from user space all the way to ultimate physical reality. A request such as "turn the LED on" or "read that block of data", enters the device hierarchy at one of the leaf devices, possibly by a write() or ioctl() on a device node in /dev or by a write to a sysfs attribute file. The request then propagates up the tree, being translated at each stage as is appropriate to each level. Some levels might add extra detail to the address, some levels might wrap "protocol headers" around the message, or divide the message up into components. Another level might add a retry mechanism or might impose a media-access protocol.

Once the message gets to the top of the hierarchy it is transmitted on a core system bus (probably via writing to one or more memory addresses), at which point the request travels through the hardware from controller to controller in a rough analog of traveling back down the tree of devices and out to the physical leaves that are the hardware that will actually pass current through an LED, or read the data off some sort of media.

To make this more concrete, consider an leds device with the "default-on" trigger selected, and then consider writing "100" to the "brightness" file, which implies a request to set the brightness to 100/255 of the maximum brightness (assuming that max_brightness is 255). The full path to that file might be:

    /sys/devices/platform/omap_i2c.2/i2c-2/2-0045/leds/gta04:green:power/brightness

The trigger driver leaves that request for the leds core to handle; the core determines whether one of two possible interfaces (brightness_set() or brightness_set_sync()) is available and calls it if so; otherwise it returns an error. When an leds device is registered by a call to led_classdev_register(), a struct led_classdev must be passed; it contains various function pointers and other settings that provide access to the underlying mechanism. So the leds driver turns the "set brightness" request into one of two possible function calls into the parent device.

If that parent device is the TCA6507 we discussed previously (named "2-0045" above) then tca6507_brightness_set() is called. The TCA6507 does not provide completely independent control of the seven output pins. Apart from fully on and fully off, there are at most three distinct brightness levels from a set of 16 possible levels. If the request level matches a level already configured, that can be assigned to the appropriate pin. If not, and there are fewer than three in use, then the new level can be programmed to a free slot. If all brightness slots are in use, the closest to the requested value is used.

In any case, the decisions need to be sent to the TCA6507 which, like most simple I2C devices, appears to contain a set of registers. So the leds-tca6507 driver converts the brightness_set() request into one or more "write to register" requests. Those requests must, in turn, be passed to the device via an I2C bus.

When an I2C driver is attached to a device, its probe() function is called with a struct i2c_client parameter. When the driver wants to write to a register it calls i2c_smbus_write_byte_data() or a similar function, giving it the same i2c_client handle, the register address, and the data. "smbus" refers to "System Management Bus" which is a subset of the I2C protocol — drivers that only need that subset are encouraged to use the smbus interfaces.

The i2c/smbus code (implementing device "i2c-2") takes the "register write" request and converts that into a message to transmitted on the i2c bus; it contains a command, an address, and data. The underlying I2C adapter driver can provide a couple of different functions, such as smbus_xfer() or master_xfer(), the latter being a lower-level function that needs more support in the I2C bus code. These functions may be called repeatedly to effect retry on certain failure modes, and are only called after i2c_lock_adapter() has successfully gained exclusive access to the bus.

So now those "write to register" requests have become "message transfer" calls with locking and retry.

If that underlying I2C adapter is the OMAP I2C adapter ("omap_i2c.2"), it will ensure the clocks for the adapter are enabled, make sure it isn't being held in reset, and wait to make sure the bus isn't being used by some other bus master. The driver will then load the size and address of each message into the DMA engine of the adapter so that it can copy the messages out onto the bus. This is all done by writing directly to registers in the I2C adapter using the system memory bus.

At this point the message hasn't quite reached the top of the /sys/devices hierarchy; that step brings in the "platform" device, which represents the whole platform and the main system memory bus, used by the I2C driver to talk to the I2C adapter. The memory bus driver doesn't get involved in copying bytes out, but it was involved when the devices were registered in that it effectively told the I2C driver what address to use to talk to its hardware.

The memory bus carries these writes to the I2C adapter, which fetches the messages from memory and serializes them out onto the I2C bus. The TCA6507 collects the bits off the bus, places them in the addressed register, and then loads internal counters as appropriate so that it can turn the particular pin on and off with the right duty cycle to achieve the requested brightness (or something close to it). This pulls current through the LED so that it generates light some of the time, and doesn't for the rest of the time. Our eyes, which naturally integrate those high frequency changes, see something close to the requested brightness level. It seems like a long journey, but it gets from:

    echo 100 > brightness

all the way to my eyes faster than I can blink.

While this is a helpful picture, it is a somewhat idealized picture. It is important to be aware that the "primary" service provider is not necessarily the only service provider. A specific LED can be configured to blink whenever some specific battery other than the primary is charging, for example. So both the LED and the charge monitoring hardware provide services to the "power supply" device.

In most cases where there are multiple services being provided, there will be one that is fairly obviously primary. Other times there is not. A RAID array receives service of equal value from a number of different block devices. And it could stop receiving that service at any moment if one fails. So there is no clear primary service provider for a RAID array, and indeed md/raid devices do not have a parent.

Parent as discoverer

The third idea is again related to the others but provides a helpful alternate perspective. It is "discovery". Every device must be deliberately added to the device model; this process is referred to a "device discovery". When driver code, acting on behalf of a particular device, discovers another device, it needs to assign a parent to the new device; it will typically use the current device as that parent.

Sometimes this discovery is performed by probing a (physical) bus to see what devices are connected to it. A similar situation happens when a bus supports hotplug and the bus controller reports that a new device has been attached. In these cases, the device that detects the new device is almost certainly the device that can address and send commands to the new device, so it is logical for it to be the parent.

Other times, this discovery happens by examining a configured description of available devices. This might be based on information extracted via a BIOS service such as ACPI, it might be encoded using a device tree, or it could be encoded directly in the kernel using arrays of "struct platform_device" and calls to platform_add_devices(), though that approach is deprecated. When ACPI is used, it will provide a top-level device similar to "platform", which is parent to the devices that are discovered and managed using ACPI, even though ACPI is not exactly a physical connection.

Parent as power source

Finally there is the idea of power management. As power management was apparently one of the driving forces that led to the creation of the device model and /sys/devices, you might expect that power management would be fairly central. The reality is ... more complex.

There are two sides to power management. One is system suspend or system hibernate, where the focus is putting as many devices as possible to the lowest power state. The other is runtime power management, where the focus is placing individual devices into a low power state whenever they aren't actively in use. It might seem reasonable to follow parent-child links in a depth-first order, turning off children and then their parents. But that isn't quite what is done.

In the case of system suspend, there is actually a completely separate structure that is used to determine the order for shutting down devices. As I described in 2012, the power-management code maintains a linear list of devices called "dpm_list" that is used to sequence suspend and resume. This list is created roughly in order of device discovery, which is roughly how the tree of devices grows out to the leaves. So processing dpm_list in order will normally visit parents before children. However power often isn't managed in exactly the same way as device addressing, so the power management system needs a little more control, which it gets by having its own list.

Runtime power management does make some use of parent/child relationships, but not always. Possibly not even often.

For connections that are designed to carry both power and control, such as USB, it makes lots of sense for the bus controller to remain powered on (and providing power) whenever an attached device is powered on. However it may well make sense for the bus controller to enter a low-power state while the device continues to do useful work, particularly if the device can trigger some sort of wakeup event.

For buses that do not carry power, such as i2c, it makes no sense to directly link the power state of the bus controller with the power state of the attached device. Typically an i2c controller will power down (stopping its clock) whenever it doesn't have any commands to send. When a device driver needs to talk to its device it will wake up the controller (its parent), send the message, and let the controller go back to sleep, even while the device itself is fully active (blinking those LEDs).

In the runtime power management system the default behavior is that, when a child device is powered on, its parent must be powered on as well. This default can be overridden by a call to pm_suspend_ignore_children(). Given that many buses do not carry power, I was at first surprised at how few calls there are to this function: in 4.1-rc1 there are only 14 calls that enable the ignoring of children.

Part of the reason for this small number is that the default as stated misses an important detail. It only applies to devices that actually support different runtime power-management states. If both a parent and child support runtime power management, then the default will keep the parent powered when the child is active. If either does not support runtime power management, then no connection is implied. This means that when a device wakes up, that wakeup will only propagate up the tree until it reaches a device that sets "ignore_children" or a device that doesn't support runtime power management at all. Then the propagation will stop.

There are actually quite a lot of devices that do not support runtime power management. Most devices implemented by classes do not.

    $ cat /sys/class/*/*/power/runtime_status | sort | uniq -c
    318 unsupported

Many devices implemented by buses do not either:

    $ cat /sys/bus/*/devices/*/power/runtime_status | sort | uniq -c
     57 active
     11 suspended
    274 unsupported

So while runtime power management can make use of the parent link, it doesn't to a great extent. The parent very often is not the source of power, only of addressing and control.

Despite the apparent richness of meaning we find in parenthood, the device model doesn't really use it much. As a message passes "up" the hierarchy, it doesn't follow the "->parent" links of the device model, but uses references that were provided to each device when it was initialized. The runtime power management code does make direct use of these ->parent links, but not very much. It could just as easily make use of explicit dependencies, just as it does when there are dependencies that are not reflected in the hierarchy.

Possibly the biggest user of these parent links is the "udev" utility, which responds to device addition and removal events in a highly configurable way. This configuration can depend on information about arbitrarily distant ancestors. On the other hand, it cannot depend at all on related devices that are not direct ancestors.

What role for attributes?

Attributes are arbitrary details about a given device; they are presented via small files in /sys/devices known as "attribute files". The particular content and use of these files was discussed here some years ago and has not significantly changed since so will not be addressed further.

Each device is presented by a directory in /sys/devices; most attribute files appear as files directly in those directories. There are exceptions, though, and while they are not particularly interesting, they need to be understood. These exceptions collect a group of attributes together into a subdirectory. The most prevalent such subdirectory is power, which contains attributes relating to the power status of the device. Every device has this subdirectory.

The second most prevalent on the machine on which I am typing are the capabilities and id subdirectories that various devices in the input class contain. Additionally, cpu devices can have a two-level attribute tree with state directories in a cpuidle directory. The important point here is that, when you find a directory in the devices tree, it might just be a collection of attributes. The reason why they have been collected will vary from case to case.

Note that "classes" and "buses" can also have attributes. These affect the whole subsystem and so are somewhat similar to module parameters. They do not appear in "/sys/device", only in "/sys/bus" or "/sys/class", so they are out of scope for this article.

Managing namespaces

While Linux does have the Linux Assigned Names And Numbers Authority, that authority does not oversee the assignment of names to classes, buses, devices, or attributes. In generally we just have to "play nice", try to catch inappropriate duplication during review, and fall back on the kernel complaining if a name gets used twice in one namespace.

Within /sys/devices there can often be names from distinct namespaces appearing in the one directory, so it is worth knowing what namespaces there are and what support there is for conflict avoidance. Firstly, there is a namespace for buses (/sys/bus) and another for classes (/sys/class). These are currently kept separate and are largely disjoint anyway. On my laptop I find:

    $ comm -12 <(ls /sys/bus) <(ls /sys/class)
    mei

so only one name in common — which is more than I expected.

Each bus and each class defines a namespace for devices that are implemented by that bus or class. Typically a simple syntactic rule will distinguish between different types of devices in the bus. So in the i2c bus, devices that represent an adapter are "i2c-%d" while devices that represent attached hardware are "%d-%04x".

Each bus or class, and to some extent each driver, also defines a namespace of attributes. A driver must clearly not define attributes that conflict with names used by its bus, but different drivers on the same bus could safely use the same name (possibly for different purposes, though that wouldn't be wise). There is also a namespace of system-imposed attributes, of which we have already seen one example: the "power" set of attributes that is imposed on every device. Clearly subsystems and drivers must avoid these.

Reserving the single name "power" for all power management attributes makes it practical to add new power management attributes, such as pm_qos_latency_tolerance_us which was added a little over a year ago. This approach of reserving just one name as a directory to minimize namespace pollution is also used to some extent to reduce the risk of a device name in one subsystem conflicting with an identically named device in another subsystem. When this is done, a directory named for the subsystem is used to contain all relevant devices of that subsystem. But this is only a sometimes thing, as we will shortly see.

Bringing it all together

It seems easiest to describe the whole tree by starting in the middle. Every directory under /sys/devices that contains a file called "uevent" represents a device. This file can be written to in order to synthesize "ADD" events, "REMOVE" events, or other events that can be processed by udev. It can be read to show context information that accompanies those events.

In each device directory there is also a "power" directory of attributes, as we have already seen, and often a "subsystem" symlink that points to the owning subsystem in either /sys/bus or /sys/class.

While any device must be implemented by some subsystem, it may not actually be linked into the list of devices for that subsystem. For example, the "usb" subsystem creates "endpoint" and "port" devices but does not identify them as belonging to the "usb" subsystem; as a result they don't get the "subsystem" symbolic link and do not appear in "/sys/bus/usb/devices".

A device may also have a "device" link which is of legacy interest only and usually points to a recent parent, and (in Linux 4.1) an "of_node" link that, if the device was described by a device tree, will be a link to the relevant location in /sys/firmware/devicetree. Other files and symbolic links will be device-specific or subsystem-specific attributes. Other directories can be one of three sorts of things.

Subdirectories of a device directory may also be device directories if the parent is a "class" or "bus" device. This case is easily recognized by looking for "uevent" in the child directory.

Subdirectories of a device directory may be grouping directories for all children of a particular "class", but only when the parent device is a "bus" device. So the bus device for the TCA6507 device has a directory called .../i2c-2/2-0045, and it contains a grouping directory leds that contains a further subdirectory for each leds device. This case is also easily recognized as the child has the same name as a class, and its children contain uevent files.

Finally, subdirectories of a device directory can be grouping directories for attributes and can contain arbitrary attribute files, symbolic links, and further sub-directories, but no device directories.

When a device has a parent in the device model, the parent of the device directory will either be the device directory of the parent, or will be an intermediate grouping directory named after the class of this device. When a device does not have a parent, there are a few different possibilities.

When a bus is created, it can indicate whether parentless devices should be placed in /sys/devices/virtual, or /sys/devices/system, or directly in /sys/devices. Classes don't have that flexibility: all parentless class devices are placed in /sys/devices/virtual. Being class devices, they have an intermediate directory named after the class, and the device has its device directory attached within that intermediate directory. So md RAID block devices appear as /sys/devices/virtual/block/md0 for example.

Summing up

"Entities must not be multiplied beyond necessity" is a popular formulation of Occam's Razor, but could equally be used as advice for software development. Unfortunately the current state of /sys/devices doesn't seem to meet that goal very well as it seems filled with multiplicity, though that is doubtless due to baggage that has been carried along as the design evolved.

Some of the multiplicities are quite valuable: having directories used for both the actual devices and for namespace control seems very sensible, though it is unfortunate that the namespace control is a little ad hoc. Similarly, having attributes that can be imposed system-wide (power, subsystem, of_node), subsystem-wide, or per-device is a fairly natural choice.

Other multiplicities are largely a matter of interpretation such as the various ways of looking at the meaning of parenthood. It could be that the concept of "parent" isn't really well defined enough to deserve to be enshrined in the hierarchy, but it does seem to be useful to some degree.

And finally there are the multiple subsystems ("class" and "bus") and the multiple roots (/sys/devices, /sys/devices/virtual, and /sys/devices/system) that don't appear to add much value. The fact that there are so many places for parentless devices lends weight to the possibility that parents aren't really as important as we might imagine.

While it might be nice to have more simplicity, the more important thing is understanding what we do have. There is certainly a lot of value in /sys/devices, even if some aspects are not as valuable as others. Understanding what goes where, which parts are important, and which parts don't deserve so much attention, is really where it is most useful to focus. So while it might have been hard to describe, for myself at least it was worth describing.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.1-rc6 ?

Kamal Mostafa Linux 3.19.8-ckt1 ?

Kamal Mostafa Linux 3.13.11-ckt21 ?

Architecture-specific

Hans de Goede Introduce Allwinner A33 support ?

Bintian Wang arm64,hi6220: Enable Hisilicon Hi6220 SoC ?

Vikas Shivappa New cpumask API and Intel Cache Allocation support ?

Core kernel code

Peter Zijlstra sched: balance callbacks ?

Paul Gortmaker Replace module_init with an alternate initcall in non modules ?

Tycho Andersen seccomp: add ptrace commands for suspend/resume ?

Tom Zanussi tracing: 'hist' triggers ?

Development tools

Mikko Rapeli Userspace compile test and fixes for exported uapi header files ?

Device drivers

Mathieu Olivari net: dsa: add QCA AR8xxx switch family support ?

Moritz Fischer Adding driver for Xilinx LogiCORE IP mailbox. ?

Anda-Maria Nicolae Add support for Richtek RT9455 battery charger ?

Rob Herring Marvell PXA1928 USB support ?

Finn Thain Re-use nvram module ?

Andy Shevchenko mfd: introduce a driver for LPSS devices on SPT ?

Amir Vadai net/mlx5: ConnectX-4 100G Ethernet driver ?

fu.wei@linaro.org Watchdog: introduce ARM SBSA watchdog driver ?

Device driver infrastructure

wdavis@nvidia.com IOMMU/DMA map_resource support for peer-to-peer ?

Dan Williams pmem api, generic ioremap_cache, and memremap ?

Filesystems and block I/O

Ming Lei block: loop: improve loop with AIO ?

Ross Zwisler I/O path improvements for ND_BLK and PMEM ?

Chandan Rajendra Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size. ?

Memory management

Eric B Munson Allow user to request memory to be locked on page fault ?

Sergey Senozhatsky zsmalloc auto-compaction ?

Gioh Kim enable migration of non-LRU pages ?

Kirill A. Shutemov THP refcounting redesign ?

Virtualization and containers

Xiao Guangrong KVM: x86: fully implement vMTRR ?

Miscellaneous

Adrian Hunter perf tools: Introduce an abstraction for AUX Area and Instruction Tracing ?

Masami Hiramatsu perf-probe --cache support ?

Will Deacon Stand-alone kvmtool repository ?

Page editor: Jonathan Corbet
Next page: Distributions>>