User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.0-rc4, released on June 20. It consists of a bunch of fixes (some for a couple of significant performance regressions) and a couple of new drivers. It also apparently has a new compilation error which may require the application of this patch to get around. The full changelog has all the details.

Stable updates: there have been no stable updates released in the last week.

Comments (none posted)

Quotes of the week

Hardware often makes me want to dress all in black, sit at the end of the bar, drink, and cry. Often Matthew Garrett is right there with me so at least I have company on my trip to black, black oblivion.
-- Dan Williams

So being such a hopeless optimist I set out to solve all pin controlling in this subsystem. The sales rap would be something like:

  • Do you need to bias your pin to 3.3 V with an 1MOhm resistor?
  • Do you want to mux your pin with different functions?
  • Do you want to set the load capacitance of your pin to 10pF?
  • Do you want to drive your pin as open collector?
  • Is all or some the above software-controlled in your ASIC?
  • Do you despair from the absence of relevant frameworks?

DON'T WORRY! The pinctrl subsystem is here to save YOU!

-- Linus Walleij

Due to intermittent email access on my vacation right now, the stable and longterm kernels will be delayed until late this week, or early next week.

In place of them, here's a lovely haiku to sooth you:

Sand castle contest
Rain falling sideways and cold
Summer in Oregon
-- Greg Kroah-Hartman

Comments (9 posted)

Netconf 2011

By Jonathan Corbet
June 22, 2011
The Linux networking developers held their roughly annual networking minisummit on June 14 and 15 in Toronto. There will probably not be a detailed report from the gathering, and no videotaping was done, but there is still some information out there. Interested readers can start with the agenda for the gathering, which includes the slides for almost all of the presentations. Also worth a read is Ben Hutchings's summary with notes from most of the talks:

World IPv6 Day seems to have mostly worked. However there are still some gaps and silly bugs in IPv6 suport in both Linux kernel (e.g. netfilter can't track DHCPv6 properly) and user-space (e.g. ping6 doesn't restrict hostname lookup to IPv6 addresses).


David [Miller] wants to get rid of the IPv4 routing cache. Removing the cache entirely seems to make route lookup take about 50% longer than it currently does for a cache hit, and much less time than for a cache miss. It avoids some potential for denial of service (forced cache misses) and generally simplifies routing.

All told, it looks like an interesting and productive gathering; if past patterns hold, we will get a more thorough summary at the 2011 Kernel Summit in October.

Comments (none posted)

Kernel development news

User-friendly disk names

By Jake Edge
June 22, 2011

Device names, particularly for disks, can be confusing to Linux administrators because they get assigned at boot time based on the order in which the disks are discovered. So the same physical disk can be assigned a different device name (in /dev) on each boot, which means that kernel log messages and the output of various utilities may not correspond with the administrator's view of the system. A recent patch set looks to change that situation, but it is meeting some resistance from kernel hackers who think it should be handled by user space.

The patches posted by Nao Nishijima are pretty straightforward. They just add a preferred_name entry to struct device, which can be updated via sysfs. The patches then change some SCSI log messages and /proc/partitions output to use the preferred name if it has been set. Greg Kroah-Hartman expressed concerns about changing /proc/partitions as various tools parse that file so it is part of the kernel's user-space interface. Adding the preferred name as a new field on each line might very well confuse utilities that parse the file.

More importantly, though, he notes that one could just change the tools so that they use the names as arguments or in their output. Any scheme that would map preferred names to specific disks requires some kind of mapping file, so tools that wanted to use these preferred names (things like mount, smartd, and other disk-related tools) could do so using that mapping without involving the kernel at all:

Seriously, this could be done by now, it's been over a year since this was first discussed. All distros could have the updated packages by now and this would not be an issue.

I still think this is the correct way to solve the problem as it is a userspace issue, not a kernel one.

While the patches only use preferred_name for disk devices, the idea is to allow them to be added to any device (and then change log messages and utilities to use them). It is modeled after the ifalias entry that was added for network devices back in 2008, but some don't see that as something to emulate. Allowing only one alias for network devices is generally not enough "because people want not a single but multiple names at the same time", Kay Sievers said, so ifalias is only used by some SNMP utilities. Currently, udev maintains a set of links in /dev/disk/by-* that relate disks to kernel devices by a variety of characteristics (ID, label, path, and UUID). James Bottomley would like to see that be extended for the preferred names:

All userspace naming will be taken care of by the usual udev rules, so for disks, something like /dev/disk/by-preferred/<fred> which would be the usual symbolic link.

This will ensure that kernel output and udev input are consistent. It will still require that user space utilities which derive a name for a device will need modifying to print out the preferred name.

But there are problems inherent in that idea. In order for udev to know that the preferred name was set, a uevent would have to be generated. That could be done, but it leads to other problems, as Sievers points out (instead of by-preferred, he uses by-pretty):

What would happen if we mount:
and some tool later thinks the pretty name should better be 'bar', it writes the name to /sys, we get a uevent, the old link disappears, we get a new link, mount has no device node anymore for the mounted device ...

Essentially, udev keeps track of the devices present in the system (and their attributes like, potentially, preferred name), but doesn't have any concept of tracking "no longer valid names" as Sievers puts it. That means that udev can't just leave older entries around when user space changes the preferred name: "We can not just add stuff to /dev without a udev database entry, it would never get removed on device unplug and leave a real mess behind."

One possible solution for the renaming problem is to only allow one write to preferred_name, so that, once established, those aliases couldn't be changed without a reboot. udev could set up the proper links, and various tools could use the aliases as needed. That would solve the renaming problem at the cost of some flexibility. In general, no one was really opposed to the idea of some kind of more-mnemonic name for disks, it is more of a question of how to get there.

Sievers proposed adding a way for udev to list all of the symlinks that it creates during device discovery. Anyone (or any tool) that needed to associate an alias with a particular disk could use that output to determine the current device being used (based on the UUID for example), then make the substitution as appropriate. That would work, in general, but Bottomley sees it as overly complex for users:

However, even if we assume they choose one of the current names, they still have to do the mapping manually; even if they have all the information, they can't just cut and paste from dmesg say, they have to cut, edit the buffer to put in the preferred name and then paste ... that's just one annoying step too far for most users. I agree that all the output tools within reason can be fixed to do this automatically, but fixing cat say, just so cat /proc/partitions works would never be acceptable upstream.

The reason for storing this in the kernel is just that it's easier than trying to update all the tools, and it solves 90% of the problem, which makes the solution usable, even if we have to update tools to get to 100%.

But Sievers and driver core maintainer Kroah-Hartman see it as papering over more substantial issues. Sievers, at least, would like to see text-file-style debug and error message output replaced (or supplemented) with something more structured:

We need _smart_ userspace with a debug/error message channel from the kernel to userspace that pops out _structured_ data. Userspace needs to index the data, and merge a lot of userspace information into it.

Adding just another single-name to the kernel just makes the much-too-dumb free-text printk() a bit more readable, but still sounds not like a solution. Pimping up syslog is not the solution to this problem, and it can't be solved in the kernel alone.

But, from the user's perspective, disks may already have names (with labels on the enclosures themselves for example) and it would be quite convenient for the kernel's messages to reflect them. In the end, Sievers isn't opposed to a disk-specific (rather than for all devices) solution, though he thinks it isn't really the right direction to go. Kroah-Hartman agrees and is adamant that this change not go into the driver core. Based on that, Nishijima plans to redo the patches, moving the name to struct gendisk, renaming the field to alias_name (rather than "preferred") to better reflect its purpose, and generating a uevent when the name changes.

But, following the lead of the network ifalias will add to the kernel's user-space interface, only this time for disks. While it may solve an immediate problem for administrators, it will also leave behind some legacy code when, or if, a better solution comes around. That's unfortunate, but, since it solves a real problem, and the change is restricted to subsystem whose maintainer (Bottomley) is in favor of it, it may well turn up in the mainline before long. Any change to system error and debug logging along the lines of what Sievers described is certainly quite a ways off, though there have long been calls for structured kernel output. Sometimes it is just easier to make a change like this in one place, rather than trying to identify and fix all of the places outside of the kernel that would need it.

Comments (30 posted)

The platform device API

By Jonathan Corbet
June 21, 2011
In the very early days, Linux users often had to tell the kernel where specific devices were to be found before their systems would work. In the absence of this information, the driver could not know which I/O ports and interrupt line(s) the device was configured to use. Happily, we now live in the days of busses like PCI which have discoverability built into them; any device sitting on a PCI bus can tell the system what sort of device it is and where its resources are. So the kernel can, at boot time, enumerate the devices available and everything Just Works.

Alas, life is not so simple; there are plenty of devices which are still not discoverable by the CPU. In the embedded and system-on-chip world, non-discoverable devices are, if anything, increasing in number. So the kernel still needs to provide ways to be told about the hardware that is actually present. "Platform devices" have long been used in this role in the kernel. This article will describe the interface for platform devices; it is meant as needed background material for a following article on integration with device trees.

Platform drivers

A platform device is represented by struct platform_device, which, like the rest of the relevant declarations, can be found in <linux/platform_device.h>. These devices are deemed to be connected to a virtual "platform bus"; drivers of platform devices must thus register themselves as such with the platform bus code. This registration is done by way of a platform_driver structure:

    struct platform_driver {
	int (*probe)(struct platform_device *);
	int (*remove)(struct platform_device *);
	void (*shutdown)(struct platform_device *);
	int (*suspend)(struct platform_device *, pm_message_t state);
	int (*resume)(struct platform_device *);
	struct device_driver driver;
	const struct platform_device_id *id_table;

At a minimum, the probe() and remove() callbacks must be supplied; the other callbacks have to do with power management and should be provided if they are relevant.

The other thing the driver must provide is a way for the bus code to bind actual devices to the driver; there are two mechanisms which can be used for that purpose. The first is the id_table argument; the relevant structure is:

    struct platform_device_id {
	kernel_ulong_t driver_data;

If an ID table is present, the platform bus code will scan through it every time it has to find a driver for a new platform device. If the device's name matches the name in an ID table entry, the device will be given to the driver for management; a pointer to the matching ID table entry will be made available to the driver as well. As it happens, though, most platform drivers do not provide an ID table at all; they simply provide a name for the driver itself in the driver field. As an example, the i2c-gpio driver turns two GPIO lines into an i2c bus; it sets itself up as a platform device with:

    static struct platform_driver i2c_gpio_driver = {
	.driver		= {
		.name	= "i2c-gpio",
		.owner	= THIS_MODULE,
	.probe		= i2c_gpio_probe,
	.remove		= __devexit_p(i2c_gpio_remove),

With this setup, any device identifying itself as "i2c-gpio" will be bound to this driver; no ID table is needed.

Platform drivers make themselves known to the kernel with:

    int platform_driver_register(struct platform_driver *driver);

As soon as this call succeeds, the driver's probe() function can be called with new devices. That function gets as an argument a platform_device pointer describing the device to be instantiated:

    struct platform_device {
	const char	*name;
	int		id;
	struct device	dev;
	u32		num_resources;
	struct resource	*resource;
	const struct platform_device_id	*id_entry;
	/* Others omitted */

The dev structure can be used in contexts where it is needed - the DMA mapping API, for example. If the device was matched using an ID table entry, id_entry will point to the specific entry matched. The resource array can be used to learn where various resources, including memory-mapped I/O registers and interrupt lines, can be found. There are a number of helper functions for getting data out of the resource array; these include:

    struct resource *platform_get_resource(struct platform_device *pdev, 
					   unsigned int type, unsigned int n);
    struct resource *platform_get_resource_byname(struct platform_device *pdev,
					   unsigned int type, const char *name);
    int platform_get_irq(struct platform_device *pdev, unsigned int n);

The "n" parameter says which resource of that type is desired, with zero indicating the first one. Thus, for example, a driver could find its second MMIO region with:

    r = platform_get_resource(pdev, IORESOURCE_MEM, 1);

Assuming the probe() function finds the information it needs, it should verify the device's existence to the extent possible, register the "real" devices associated with the platform device, and return zero.

Platform devices

So now we have a driver for a platform device, but no actual devices yet. As was noted at the beginning, platform devices are inherently not discoverable, so there must be another way to tell the kernel about their existence. That is typically done with the creation of a static platform_device structure providing, at a minimum, a name which is used to find the associated driver. So, for example, a simple (fictional) device might be set up this way:

    static struct resource foomatic_resources[] = {
		.start	= 0x10000000,
		.end	= 0x10001000,
		.flags	= IORESOURCE_MEM,
		.name	= "io-memory"
		.start	= 20,
		.end	= 20,
		.flags	= IORESOURCE_IRQ,
		.name	= "irq",

    static struct platform_device my_foomatic = {
	.name 		= "foomatic",
	.resource	= foomatic_resources,
	.num_resources	= ARRAY_SIZE(foomatic_resources),

These declarations describe a "foomatic" device with a one-page MMIO region starting at 0x10000000 and using IRQ 20. The device is made known to the system with:

    int platform_device_register(struct platform_device *pdev);

Once both a platform device and an associated driver have been registered, the driver's probe() function will be called and the device will be instantiated. Registration of device and driver are usually done in different places and can happen in either order. A call to platform_device_unregister() can be used to remove a platform device.

Platform data

The above information is adequate to instantiate a simple platform device, but many devices are more complex than that. Even the simple i2c-gpio driver described above needs two additional pieces of information: the numbers of the GPIO lines to be used as i2c clock and data lines. The mechanism used to pass this information is called "platform data"; in short, one defines a structure containing the specific information needed and passes it in the platform device's dev.platform_data field.

With the i2c-gpio example, a full configuration looks like this:

    #include <linux/i2c-gpio.h>

    static struct i2c_gpio_platform_data my_i2c_plat_data = {
	.scl_pin	= 100,
	.sda_pin	= 101,

    static struct platform_device my_gpio_i2c = {
	.name		= "i2c-gpio",
	.id		= 0,
	.dev = {
		.platform_data = &my_i2c_plat_data,

When the driver's probe() function is called, it can fetch the platform_data pointer and use it to obtain the rest of the information it needs.

Not everybody in the kernel community is enamored with platform devices; they seem like a bit of a hack used to encode information about specific hardware platforms into the kernel. Additionally, the platform data mechanism lacks any sort of type checking; drivers must simply assume that they have been passed a structure of the expected type. Even so, platform devices are heavily used, and that's unlikely to change, though the means by which they are created and discovered is changing. The way of the future appears to be device trees, which will be described in the following article.

Comments (5 posted)

Platform devices and device trees

By Jonathan Corbet
June 21, 2011
The first part of this pair of articles described the kernel's mechanism for dealing with non-discoverable devices: platform devices. The platform device scheme has a long history and is heavily used, but it has some disadvantages, the biggest of which is the need to instantiate these devices in code. There are alternatives coming into play, though; this article will describe how platform devices interact with the device tree mechanism.

The current platform device mechanism is relatively easy to use for a developer trying to bring up Linux on a new system. It's just a matter of creating the descriptions for the devices present on that system and registering all of the devices at boot time. Unfortunately, this approach leads to the proliferation of "board files," each of which describes a single type of computer. Kernels are typically built around a single board file and cannot boot on any other type of system. Board files sort of worked when there were relatively small numbers of embedded system types to deal with. Now Linux-based embedded systems are everywhere, architectures which have typically depended on board files (ARM, in particular) are finding their way into more types of systems, and the whole scheme looks poised to collapse under its own weight.

The hoped-for solution to this problem goes by the term "device trees"; in essence, a device tree is a textual description of a specific system's hardware configuration. The device tree is passed to the kernel at boot time; the kernel then reads through it to learn about what kind of system it is actually running on. With luck, device trees will abstract the differences between systems into boot-time data and allow generic kernels to run on a much wider variety of hardware.

This article is a good introduction to the device tree format and how it can be used to describe real-world systems; it is recommended reading for anybody interested in the subject.

It is possible for platform devices to work on a device-tree-enabled system with no extra work at all, especially once Grant Likely's improvements are merged. If the device tree includes a platform device (where such devices, in the device tree context, are those which are direct children of the root or are attached to a "simple bus"), that device will be instantiated and matched against a driver. The memory-mapped I/O and interrupt resources will be marshalled from the device tree description and made available to the device's probe() function in the usual way. The driver need not know that the device was instantiated out of a device tree rather than from a hard-coded platform device definition.

Life is not always quite that simple, though. Device names appearing in the device tree (in the "compatible" property) tend to take a standardized form which does not necessarily match the name given to the driver in the Linux kernel; among other things, device trees really are meant to work with more than one operating system. So it may be desirable to attach specific names to a platform device for use with device trees. The kernel provides an of_device_id structure which can be used for this purpose:

    static const struct of_device_id my_of_ids[] = {
	{ .compatible = "long,funky-device-tree-name" },
	{ }

When the platform driver is declared, it stores a pointer to this table in the driver substructure:

    static struct platform_driver my_driver = {
	/* ... */
	.driver	= {
		.name = "my-driver",
		.of_match_table = my_of_ids

The driver can also declare the ID table as a device table to enable autoloading of the module as the device tree is instantiated:

    MODULE_DEVICE_TABLE(of, my_of_ids);

The one other thing capable of complicating the situation is platform data. Needless to say, the device tree code is unaware of the specific structure used by a given driver for its platform data, so it will be unable to provide that information in that form. On the other hand, the device tree mechanism is equipped to allow the passing of just about any information that the driver may need to know. Making use of that information will require the driver to become a bit more aware of the device tree subsystem, though.

Drivers expecting platform data should check the dev.platform_data pointer in the usual way. If there is a non-null value there, the driver has been instantiated in the traditional way and device tree does not enter into the picture; the platform data should be used in the usual way. If, however, the driver has been instantiated from the device tree code, the platform_data pointer will be null, indicating that the information must be acquired from the device tree directly.

In this case, the driver will find a device_node pointer in the platform devices dev.of_node field. The various device tree access functions (of_get_property(), primarily) can then be used to extract the needed information from the device tree. After that, it's business as usual.

In summary: making platform drivers work with device trees is a relatively straightforward task. It is mostly a matter of getting the right names in place so that the binding between a device tree node and the driver can be made, with a bit of additional work required in cases where platform data is in use. The nice result is that the static platform_device declarations can go away, along with the board files that contain them. That should, eventually, allow the removal of a bunch of boilerplate code from the kernel while simultaneously making the kernel more flexible.

Comments (16 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds