LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.0-rc1, released on May 29. Linus said:

So what are the big changes? NOTHING. Absolutely nothing. Sure, we have the usual two thirds driver changes, and a lot of random fixes, but the point is that 3.0 is *just* about renumbering, we are very much *not* doing a KDE-4 or a Gnome-3 here. No breakage, no special scary new features, nothing at all like that. We've been doing time-based releases for many years now, this is in no way about features. If you want an excuse for the renumbering, you really should look at the time-based one ('20 years') instead.

See the separate article below for a summary of the changes merged in the second half of the 3.0 merge window.

Stable updates: The 2.6.38.8 and 2.6.39.1 stable updates were released on June 2. This will be the last 2.6.38 stable release, so users should move to the 2.6.39 series.

Comments (3 posted)

Quotes of the week

Users are a really terrible source of interface specifications. "Hackers" are often not much better, but at least if the interface is lousy the developer has the potential to be accountable for it and its improvement.
-- Casey Schaufler

IMHO the key design mistake of LSM is that it detaches security policy from applications: you need to be admin to load policies, you need to be root to use/configure an LSM. Dammit, you need to be root to add labels to files! This not only makes the LSM policies distro specific (and needlessly forked and detached from real security), but also gives the message that:
'to ensure your security you need to be privileged'
which is the anti-concept of good security IMO.
-- Ingo Molnar

Comments (1 posted)

Garrett: Rebooting

Matthew Garrett appears to be having some "fun" looking into how to reboot x86 hardware. He lists five different mechanisms to reboot 64-bit x86 hardware including: "kbd - reboot via the keyboard controller. The original IBM PC had the CPU reset line tied to the keyboard controller. Writing the appropriate magic value pulses the line and the machine resets. This is all very straightforward, except for the fact that modern machines don't have keyboard controllers (they're actually part of the embedded controller) and even more modern machines don't even pretend to have a keyboard controller. Now, embedded controllers run software. And, as we all know, software is dreadful. But, worse, the software on the embedded controller has been written by BIOS authors. So clearly any pretence that this ever works is some kind of elaborate fiction. Some machines are very picky about hardware being in the exact state that Windows would program. Some machines work 9 times out of 10 and then lock up due to some odd timing issue. And others simply don't work at all. Hurrah!"

Comments (21 posted)

The Wonderful World of Linux 3.0

Once upon a time, Joe Pranevich made a name for himself by writing comprehensive release notes for major kernel releases. He got out of that business during the 2.6.x series, but he is now back with a draft version of "The Wonderful World of Linux 3.0". Readers who are curious about what has happened since the 2.6.0 release may be interested in giving it a look. "This document describes just a few of the thousands of changes and improvements that have been made to the Linux kernel since the launch of Linux 2.6. I have attempted to make it as accessible as possible for a general audience, while not shying away from technical language when necessary."

Full Story (comments: 11)

Kernel development news

3.0 merge window part 2

By Jonathan Corbet
June 1, 2011
In the end, 7,333 non-merge changesets were pulled into the mainline kernel before Linus closed the merge window and decreed that the next release would be called "3.0". There have not been vast numbers of exciting new features added since last week's summary was written, but there are a few. The most significant user-visible changes include:

  • The namespace file descriptors patch, which includes the setns() system call, has been merged. This feature makes it easier to manage containers running in different namespaces.

  • The XFS filesystem now has online discard support.

  • The Cleancache functionality has been merged. Cleancache allows for intermediate storage of pages which have been pushed out of the page cache but which might still be useful in the future. Cleancache is initially supported by ext3, ext4, and ocfs2.

  • A new netlink-based infrastructure allows the management of RDMA clients.

  • It is now possible to move all threads in a group into a control group at once using the cgroup.procs control file.

  • The Blackfin architecture has gained perf events support.

  • The btrfs filesystem has gained support for a administrator-initiated "scrub" operation that can read through a filesystem's blocks and verify checksums. When possible, bad copies of data will be replaced by good copies from another storage device. Also supported by btrfs is an auto_defrag mount option causing the filesystem to notice random writes to files and schedule them for defragmentation.

  • The no-hlt boot parameter has been deprecated; no machines have needed it in this millennium. Should there be any machines with non-working HLT instructions running current kernels, they can be booted with idle=poll.

  • Support for the pNFS protocol backed by object storage devices has been added.

  • New hardware support includes:

    • Systems and processors: TILE-Gx 64-bit processors and Blackfin SPORT SPI busses.

    • Input: Qualcomm PMIC8XXX keypads.

    • Media: Fintek consumer infrared transceivers, and Fujitsu M-5MOLS 8MP sensors.

    • Network: GPIO-controlled RF-kill switches.

    • USB: VUB300 USB to SDIO/SD/MMC host controllers.

    • Miscellaneous: ST-Ericsson DB5500 power reset control management units, AMD Family 15h processor power monitors, SMSC EMC6W201 hardware monitors, Marvell 88PM860x real-time clocks, HTC ASIC3 LED controllers, Qualcomm PM8921 PMIC chips, Micro Crystal RV3029-C2 RTC chips, VIA/WonderMedia 85xx SoC RTC chips, ST M41T93 SPI RTC chips, EM Microelectronic EM3027 RTC chips, Maxim/Dallas DS2780 stand-alone fuel gauge ICs, Maxim MAX8903 battery chargers, and TI TPS65910 and TPS65911 power management chips.

Changes visible to kernel developers include:

  • There is a new core support module for GPIO controllers based on memory-mapped I/O.

  • There is a new atomic_or() operation to perform a logical OR operation on an atomic_t value.

With the -rc1 release, Linus tagged the kernel "3.0.0" (with a new name of "Sneaky Weasel"). His stated intent is to drop the last digit during the stabilization period so that the final kernel would be just "3.0", but that depends on getting various user-space scripts fixed. Either way, the stable updates that most people will actually run will start with 3.0.1.

Linus is clearly hoping for a relatively smooth development cycle this time around; he has hinted that he may be fussier than usual about the fixes that he'll pull from now on. 3.0, it seems, is supposed to be boring, just to drive home the point that the version number change does not really mean much. The final release, boring or not, can be expected sometime in the first half of July.

Comments (1 posted)

Forking the ARM kernel?

June 2, 2011

This article was contributed by Thomas Gleixner

In the last few months, many people have suggested forking the ARM kernel and maintaining it as a separate project. While the reasons for forking ARM may seem attractive to some, it turns out that it really doesn't make very much sense for either the ARM community or the kernel as a whole.

Here are the most common reasons given for this suggestion:

  • Time to market

  • It matches the one-off nature of consumer electronics

  • It better suits the diversity of the system-on-chip (SoC) world

  • It avoids the bottleneck of maintainers and useless extra work in response to reviews

Let's have a look at these reasons.

Time to market

The time-to-market advocates reason that it takes less time to hack up a new driver than to get it accepted upstream. I know I'm old school and do not understand the rapidly changing world and new challenges of the semiconductor industry anymore, but I still have enough engineering knowledge and common sense to know that there is no real concept of totally disconnected "new" things in that industry. Most of an SoC's IP blocks (functionality licensed for inclusion in an SoC) will not be new. So what happens to time to market the second time an IP block is used? If the driver is upstream, it can simply be reused. But all too often, a new driver gets written from scratch, complete with a new set of bugs that must be fixed. Are vendors really getting better time to market by rewriting new drivers for old IP blocks on every SoC they sell?

In addition, the real time to market for a completely new generation of chips is not measured in weeks. The usual time frame for a new chip, from the marketing announcement to real silicon usable in devices is close to a year. This should be ample time to get a new driver upstream. In addition, the marketing announcement certainly does not happen the day after the engineering department met for beers after work and some genius engineer sketched the new design on a napkin at the bar, so most projects have even more time for upstreaming.

The one-off nature of embedded

It's undisputed that embedded projects, especially in the consumer market, tend to be one-off, but it is also a fact that the variations of a given SoC family have a lot in common and differ only in small details. In addition, variations of a given hardware type - e.g. the smartphone family which differs in certain aspects of functionality - share most of the infrastructure. Even next generation SoCs often have enough pieces of the previous generation embedded into them as there is no compelling reason to replace already proven-to-work building blocks when the functionality is sufficient. Reuse of known-to-work parts is not a new concept and is, unsurprisingly, an essential part of meeting time-to-market goals.

We recently discovered the following gem. An SoC with perfect support for almost all peripheral components that was already in the mainline kernel underwent a major overhaul. It replaced the CPU core, while the peripheral IP blocks remained the same - except for a slightly different VHDL glue layer which was necessary to hook them up to the new CPU core. Now the engineer in me would have expected that the Linux support for the resulting SoC generation B would have just been a matter of adjusting the existing drivers. The reality taught us that the vendor assigned a team to create the support for the "new" SoC and all the drivers got rewritten from scratch. While we caught some of the drivers in review, some of the others went into the mainline, so now we have two drivers for the same piece of silicon that are neither bug nor feature compatible.

I have a really hard time accepting that rewriting a dozen drivers from scratch is faster than sitting down and identifying the existing - proven to work - drivers and modifying them for the new SoC design. The embedded industry often reuses hardware. Why not reuse software, too?

The SoC diversity

Of course, every SoC vendor will claim that its chip is unique in all aspects and so different that sharing more than a few essential lines of code is impossible. That's understandable from the marketing side, but if you look at the SoC data sheets, the number of unique peripheral building blocks is not excitingly large. Given the fact that the SoC vendors target the same markets and the same customer base, that's more or less expected. A closer look reveals that different vendors often end up using the same or very similar IP blocks for a given functionality. There are only a limited number of functional ways to implement a given requirement in hardware and there are only a few relevant IP block vendors that ship their "unique" building blocks to all of the SoC vendors. The diversity is often limited to a different arrangement of registers or the fact that one vendor chooses a different subset of functionality than the other.

We have recently seen consolidation work in the kernel which proves this to be correct. When cleaning up the interrupt subsystem I noticed that there are only two widely used types of interrupt controllers. Without much effort it was possible to replace the code for more than thirty implementations of "so different" chips with a generic implementation. A similar effort is on the way to replace the ever-repeating pattern of GPIO chip support code. These are the low hanging fruit and there is way more potential for consolidation.

Avoiding the useless work

Dealing with maintainers and their often limited time for review is seen as a bottleneck. The extra work which results in addressing the review comments is also seen as waste of time. The number of maintainers and the time available for review is indeed a limiting factor that needs to be addressed. The ability to review is not limited to those who maintain a certain subsystem of the kernel and we would like to see more people participating in the review process. Spending a bit of time reviewing other people's code is a very beneficial undertaking as it opens one's mind to different approaches and helps to better understand the overall picture.

On the other side, getting code reviewed by others is beneficial as well and, in general, leads to better and more maintainable code. It also helps in avoiding mistakes in the next project. In a recent review, which went through many rounds, a 1200-line driver boiled down to 250 lines and at least a handful of bugs and major mistakes got fixed.

When I have the chance to talk to developers after a lengthy review process, most of them concede that the chance to learn and understand more about the Linux way of development by far outweighs the pain of the review process and the necessary rework. When looking at later patches I've often observed that these developers have improved and avoided the mistakes they made in their first attempts. So review is beneficial for the developer and for their company as it helps writing better code in a more efficient way. I call out those who still claim that review and the resulting work is a major obstacle as hypocrites who are trying to shift the blame for other deficiencies in their companies to the kernel community.

One deficiency is assigning proprietary RTOS developers to write Linux kernel code without teaching them how to work with the Linux community. There is no dishonor in not knowing how to work with the Linux community; after all, every kernel developer including myself started at some point without knowing it. But it took me time to learn how to work with the community, and it will take time for proprietary RTOS developers to work with the community. It is well worth the time and effort.

What would be solved by forking the ARM kernel?

Suppose there was an ARM-specific Git repository which acted as a dumping ground for all of the vendor trees. It would pull in the enhancements of the real mainline kernel from time to time so that the embedded crowd gets the new filesystems, networking features, etc. Extrapolating the recent flow of SoC support patches into Linux and removing all the sanity checks on them would result in a growth rate of that ARM tree which would exceed the growth rate of the overall mainline kernel in no time. And what if the enhancements from mainline require changes to every driver in the ARM tree, as was required for some of my recent work? Who makes those changes? If the drivers are in mainline, the drivers are changed as part of the enhancement. If there is a separate ARM fork, some ARM-fork maintainer will have to make these changes.

And who is going to maintain such a tree? I have serious doubts that there will surface a sufficient number of qualified maintainers out of the blue who have the bandwidth and the experience to deal with such a flood of changes. So to avoid the bottleneck that is one of the complaints when working with mainline, the maintainers would probably just have the role of integrators who merely aggregate the various vendor trees in a central place.

What's the gain of such an exercise? Nothing as far as I can see, it would just allow everyone to claim that all of their code is part of a mysterious ARM Git tree and, of course, it would fulfill the ultimate "time to market" requirements, in the short term, anyway.

How long would an ARM fork be sustainable?

I seriously doubt that an ARM fork would work longer than a few kernel cycles simply because the changes to core code will result in a completely unmaintainable #ifdef mess with incompatible per-SoC APIs that will drive anyone who has to "maintain" such a beast completely nuts in no time. I'm quite sure that none of the ARM-fork proponents has ever tried to pull five full-flavored vendor trees into a single kernel tree and deal with the conflicting changes to DMA APIs, driver subsystems, and infrastructure. I know that maintainers of embedded distribution kernels became desperate in no time exactly for those reasons and I doubt that any reasonable kernel developer is insane enough to cope with such an horror for more than a couple of months.

Aside from the dubious usefulness, such an ARM fork would cut off the ARM world from influencing the overall direction of the Linux kernel entirely. ARM would become a zero-interest issue for most of the experienced mainline maintainers and developers as with other out-of-tree projects. I doubt that the ARM industry can afford to disconnect itself in such a way, especially as the complexity of operating-system-level software is increasing steadily.

Is there a better answer?

There is never an ultimate answer which will resolve all problems magically, but there are a lot of small answers which can effectively address the main problem spots.

One of the root causes for the situation today is of a historical nature. For over twenty years the industry dealt with closed source operating systems where changes to the core code were impossible and collaboration with competitors was unthinkable and unworkable. Now, after moving to Linux, large parts of the industry still think in this well-known model and let their engineers - who have often worked on other operating systems before working on Linux - just continue the way they enabled their systems in the past. This a perfect solution for management as well, because the existing structures and the idea of top-down software development and management still applies.

That "works" as long as the resulting code does not have to be integrated into the mainline kernel and each vendor maintains its own specialized fork. There are reasonable requests from customers for mainline integration, however, as it makes the adoption of new features easier, there is less dependence on the frozen vendor kernels, and, as seen lately, it allows for consolidation toward multi-platform kernels. The latter is important for enabling sensible distribution work targeted at netbooks, tablets, and similar devices. This applies even more for the long-rumored ARM servers which are expected to materialize in the near future. Such consolidation requires cooperation not only across the ARM vendors, it requires a collaborative effort across many parts of the mainline kernel along with the input of maintainers and developers who are not necessarily part of the ARM universe.

So we need to help management understand that holding on to the known models is not the most efficient way to deal with the growing complexity of SoC hardware and the challenges of efficient and sustainable operating system development in the open source space. At the same time, we need to explain at the engineering level that treating Linux in the same way as other OS platforms is making life harder, and is at least partially responsible for the grief which is observed when code hits the mailing lists for review.

Another area that we need to work on is massive collaborative consolidation, which has the preliminary that silicon vendors accept that, at least at the OS engineering level, their SoCs are not as unique as their marketing department wants them to believe. As I explained above, there are only a limited number of ways to implement a given functionality in hardware, which is especially true for hardware with a low complexity level. So we need to encourage developers to first look to see whether existing code might be refactored to fit the new device instead of blindly copying the closest matching driver - or in the worst case a random driver - and hacking it into shape somehow.

In addition, the kernel maintainers need to be more alert to that fact as well and help to avoid the reinvention of the wheel. If driver reuse cannot be achieved, we can often pull out common functionality into core code and avoid duplication that way. There is a lot of low hanging fruit here, and the Linux kernel community as a whole needs to get better and spend more brain cycles on avoiding duplication.

One step forward was taken recently with the ARM SoC consolidation efforts that were initiated by Linaro. But this only will succeed when we are aware of the conflicts with the existing corporate culture and address these conflicts at the non-technical level as well.

Secrecy

Aside from the above issues the secrecy barrier is going to be the next major challenge. Of course a silicon vendor is secretive about the details of its next-generation SoC design, but the information which is revealed in marketing announcements allows us to predict at least parts of the design pretty precisely.

The most obvious recent example is the next-generation ARM SoCs from various vendors targeted for the end of 2011. Many of them will come with USB 3.0 support. Going through the IP block vendor offerings tells us that there are less USB 3.0 IP blocks available for integration than there are SoC vendors who have announced new chips with USB 3.0 support. That means that there are duplicate drivers in the works and I'm sure that, while the engineers are aware of this, no team is allowed to talk to the competitor's team. Even if they would be allowed to do so, it's impossible to figure out who is going to use which particular IP block. So we will see several engineering teams fighting over the "correct" implementation to be merged as the mainline driver in a couple of month when the secrecy barrier has been lifted.

Competing implementations are not a bad thing per se, but the inability to exchange information and discuss design variants is not helping anyone in the "time to market" race. I seriously doubt that any of the to-be-released drivers will have a relevant competitive advantage and even if one does, it will not last very long when the code becomes public. It's sad that opportunities to collaborate and save precious engineering resources for all involved parties are sacrificed in favor of historical competition-oriented behaviour patterns. These patterns have not been overhauled since they were invented twenty or more years ago and they have never been subject to scrutiny in the context of competition on the open source operating system playground.

A way forward

The ever-increasing complexity of hardware, which causes more complex operating-system-level software, caused a shortage of high-profile OS level software developers long ago which cannot be counterbalanced by either money or by assigning a large enough number of the cheapest available developer resources to it. The ARM universe is diversified enough that there is no chance for any of the vendors to get hold of a significant enough number of outstanding kernel developers to cause any serious damage to their competitors. That is particularly true given the fact that those outstanding developers generally prefer to work in the open rather than a semi-closed environment.

It's about time for managers to rethink their competition models and start to comprehend and utilize the massive advantage of collaborative models over an historic and no-longer-working model that assumes the infinite availability of up-to-the-task resources when there is enough money thrown into the ring. Competent developers certainly won't dismiss the chance to get some extra salary lightly, but the ability to get a way more enjoyable working environment for a slightly smaller income is for many of them a strong incentive to refuse the temptation. It's an appealing thought for me that there is no "time to market" howto, no "shareholder value" handbook, and no "human resources management" course which ever took into consideration that people might be less bribable than generally expected, especially if this applies to those people who are some of the scarcest resources in the industry.

I hope that managers working for embedded vendors will start to understand how Open Source works and why there is an huge benefit to work with the community. After all, the stodgy old server vendors were able to figure out how to work with the Linux community, so it cannot be that hard for the fast-moving embedded vendors. However, the realist in me - sometimes called "the grumpy old man" - who has worked in the embedded industry for more than twenty five years does not believe that at all. For all too many SoC vendors, the decision to "work" with the community was made due to outside pressure and an obsession with following the hype.

Outside pressure is not what the open source enthusiasts might hope for: the influence of the community itself. No, it's simply the pressure made by (prospective) customers who request that the chip is - at least basically - supported by the mainline kernel. Following the hype is omnipresent and it always seems to be a valid argument to avoid common-sense-driven decisions based on long-term considerations.

The eternal optimist in me still has hope that the embedded world will become a first class citizen in the Linux community sooner rather than later. The realist in me somehow doubts that it will happen before the "grumpy old man" is going to retire.

Comments (12 posted)

Object-oriented design patterns in the kernel, part 1

June 1, 2011

This article was contributed by Neil Brown

Despite the fact that the Linux Kernel is mostly written in C, it makes broad use of some techniques from the field of object-oriented programming. Developers wanting to use these object-oriented techniques receive little support or guidance from the language and so are left to fend for themselves. As is often the case, this is a double-edged sword. The developer has enough flexibility to do really cool things, and equally the flexibility to do really stupid things, and it isn't always clear at first glance which is which, or more accurately: where on the spectrum a particular approach sits.

Instead of looking to the language to provide guidance, a software engineer must look to established practice to find out what works well and what is best avoided. Interpreting established practice is not always as easy as one might like and the effort, once made, is worth preserving. To preserve that effort on your author's part, this article brings another installment in an occasional series on Linux Kernel Design Patterns and attempts to set out - with examples - the design patterns in the Linux Kernel which effect an object-oriented style of programming.

Rather than providing a brief introduction to the object-oriented style, tempting though that is, we will assume the reader has a basic knowledge of objects, classes, methods, inheritance, and similar terms. For those as yet unfamiliar with these, there are plenty of resources to be found elsewhere on the web.

Over two weeks we will look for patterns in just two areas: method dispatch and data inheritance. Despite their apparent simplicity they lead to some rich veins for investigation. This first article will focus on method dispatch.

Method Dispatch

The large variety of styles of inheritance and rules for its usage in languages today seems to suggest that there is no uniform understanding of what "object-oriented" really means. The term is a bit like "love": everyone thinks they know what it means but when you get down to details people can find they have very different ideas. While what it means to be "oriented" might not be clear, what we mean by an "object" does seem to be uniformly agreed upon. It is simply an abstraction comprising both state and behavior. An object is like a record (Pascal) or struct (C), except that some of the names of members refer to functions which act on the other fields in the object. These function members are sometimes referred to a "methods".

The most obvious way to implement objects in C is to declare a "struct" where some fields are pointers to functions which take a pointer to the struct itself as their first argument. The calling convention for method "foo" in object "bar" would simply be: bar->foo(bar, ...args); While this pattern is used in the Linux kernel it is not the dominant pattern so we will leave discussion of it until a little later.

As methods (unlike state) are not normally changed on a per-object basis, a more common and only slightly less obvious approach is to collect all the methods for a particular class of objects into a separate structure, sometimes known as a "virtual function table" or vtable. The object then has a single pointer to this table rather than a separate pointer for each method, and consequently uses less memory.

This then leads to our first pattern - a pure vtable being a structure which contains only function pointers where the first argument of each is a pointer to some other structure (the object type) which itself contains a pointer to this vtable. Some simple examples of this in the Linux kernel are the file_lock_operations structure which contains two function pointers each of which take a pointer to a struct file_lock, and the seq_operations vtable which contains four function pointers which each operate on a struct seq_file. These two examples display an obvious naming pattern - the structure holding a vtable is named for the structure holding the object (possibly abbreviated) followed by "_operations". While this pattern is common it is by no means universal. Around the time of 2.6.39 there are approximately 30 "*_operations" structures along with well over 100 "*_ops" structures, most if not all of which are vtables of some sort. There are also several structs such as struct mdk_personality which are essentially vtables but do not have particularly helpful names.

Among these nearly 200 vtable structures there is plenty of variability and so plenty of scope to look for interesting patterns. In particular we can look for common variations from the "pure vtable" pattern described above and determine how these variations contribute to our understanding of object use in Linux.

NULL function pointers

The first observation is that some function pointers in some vtables are allowed to be NULL. Clearly trying to call such a function would be futile, so the code that calls into these methods generally contains an explicit test for the pointer being NULL. There are a few different reasons for these NULL pointers. Probably easiest to justify is the incremental development reason. Because of the way vtable structures are initialized, adding a new function pointer to the structure definition causes all existing table declarations to initialise that pointer to NULL. Thus it is possible to add a caller of the new method before any instance supports that method, and have it check for NULL and perform a default behavior. Then as incremental development continues those vtable instances which need it can get non-default methods.

A recent example is commit 77af1b2641faf4 adding set_voltage_time_sel() to struct regulator_ops which acts on struct regulator_dev. Subsequent commit 42ab616afe8844 defines that method for a particular device. This is simply the most recent example of a very common theme.

Another common reason is that certain methods are not particularly meaningful in certain cases so the calling code simply tests for NULL and returns an appropriate error when found. There are multiple examples of this in the virtual filesystem (VFS) layer. For instance, the create() function in inode_operations is only meaningful if the inode in question is a directory. So inode_operations structures for non-directories typically have NULL for the create() function (and many others) and the calling code in vfs_create() checks for NULL and returns -EACCES.

A final reason that vtables sometimes contain NULL is that an element of functionality might be being transitioned from one interface to another. A good example of this is the ioctl() operation in file_operations. In 2.6.11, a new method, unlocked_ioctl() was added which was called without the big kernel lock held. In 2.6.36, when all drivers and filesystems had been converted to use unlocked_ioctl(), the original ioctl() was finally removed. During this transition a file system would typically define only one of two, leaving the other defaulting to NULL.

A slightly more subtle example of this is read() and aio_read(), also in file_operations, and the corresponding write() and aio_write(). aio_read() was introduced to support asynchronous IO, and if it is provided the regular synchronous read() is not needed (it is effected using do_sync_read() which calls the aio_read() method). In this case there appears to be no intention of ever removing read() - it will remain for cases where async IO is not relevant such as special filesystems like procfs and sysfs. So it is still the case that only one of each pair need be defined by a filesystem, but it is not simply a transition, it is a long-term state.

Though there seem to be several different reasons for a NULL function pointer, almost every case is an example of one simple pattern - that of providing a default implementation for the method. In the "incremental development" examples and the non-meaningful method case, this is fairly straightforward. e.g. the default for inode->create() is simply to return an error. In the interface transition case it is only slightly less obvious. The default for unlocked_ioctl() would be to take the kernel lock and then call the ioctl() method. The default for read() is exactly do_sync_read() and some filesystems such as ext3 actually provide this value explicitly rather than using "NULL" to indicate a default.

With that in mind, a little reflection suggests that if the real goal is to provide a default, then maybe the best approach would be to explicitly give a default rather than using the circuitous route of using a default of NULL and interpreting it specially.

While NULL is certainly the easiest value to provide as a default - as the C standard assures us that uninitialized members of a structure do get set to NULL - it is not very much harder to set a more meaningful default. I am indebted to LWN reader wahern for the observation that C99 allows fields in a structure to be initialized multiple times with only the final value taking effect and that this allows easy setting of default values such as by following the simple model:

    #define FOO_DEFAULTS  .bar = default_bar, .baz = default_baz
    struct foo_operations my_foo = { FOO_DEFAULTS,
	.bar = my_bar,
    };

This will declare my_foo with a predefined default value for baz and a localized value for bar. Thus for the small cost of defining a few "default" functions and including a "_DEFAULTS" entry to each declaration, the default value for any field can easily be chosen when the field is first created, and automatically included in every use of the structure.

Not only are meaningful defaults easy to implement, they can lead to a more efficient implementation. In those cases where the function pointer actually is NULL it is probably faster to test and branch rather than to make an indirect function call. However the NULL case is very often the exception rather than the rule, and optimizing for an exception is not normal practice. In the more common case when the function pointer is not NULL, the test for NULL is simply a waste of code space and a waste of execution time. If we disallow NULLs we can make all call sites a little bit smaller and simpler.

In general, any testing performed by the caller before calling a method can be seen as an instance of the "mid-layer mistake" discussed in a previous article. It shows that the mid-layer is making assumptions about the behavior of the lower level driver rather than simply giving the driver freedom to behave in whatever way is most suitable. This may not always be an expensive mistake, but it is still best avoided where possible. Nevertheless there is a clear pattern in the Linux kernel that pointers in vtables can sometimes be NULLable, typically though not always to enable a transition, and the call sites should in these cases test for NULL before proceeding with the call.

The observant reader will have noticed a hole in the above logic denouncing the use NULL pointers for defaults. In the case where the default is the common case and where performance is paramount, the reasoning does not hold and a NULL pointer could well be justified. Naturally the Linux kernel provides an example of such a case for our examination.

One of the data structures used by the VFS for caching filesystem information is the "dentry". A "dentry" represents a name in the filesystem, and so each "dentry" has a parent, being the directory containing it, and an "inode" representing the named file. The dentry is separate from the inode because a single file can have multiple names (so an "inode" can have multiple "dentry"s). There is a dentry_operations vtable with a number of operations including, for example, "d_compare" which will compare two names and "d_hash" which will generate a hash for the name to guide the storage of the "dentry" in a hash table. Most filesystems do not need this flexibility. They treat names as uninterpreted strings of bytes so the default compare and hash functions are the common case. A few filesystems define these to handle case-insensitive names but that is not the norm.

Further, filename lookup is a common operation in Linux and so optimizing it is a priority. Thus these two operations appear to be good candidates where a test for NULL and an inlined default operation might be appropriate. What we find though is that when such an optimization is warranted it is not by itself enough. The code that calls d_compare() and d_hash() (and a couple of other dentry operations) does not test these functions for NULL directly. Rather they require that a few flag bits (DCACHE_OP_HASH, DCACHE_OP_COMPARE) in the "dentry" are set up to indicate whether the common default should be used, or whether the function should be called. As the flag field is likely to be in cache anyway, and the dentry_operations structure will often be not needed at all, this avoids a memory fetch in a hot path.

So we find that the one case where using a NULL function pointer to indicate a default could be justified, it is not actually used; instead, a different, more efficient, mechanism is used to indicate that the default method is requested.

Members other than function pointers

While most vtable-like structures in the kernel contain exclusively function pointers, there are a significant minority that have non-function-pointer fields. Many of these appear on the surface quite arbitrary and a few closer inspections suggest that some of them result of poor design or bit-rot and their removal would only improve the code.

There is one exception to the "functions only" pattern that occurs repeatedly and provides real value, and so is worth exploring. This pattern is seen in its most general form in struct mdk_personality which provides operations for a particular software RAID level. In particular this structure contains an "owner", a "name", and a "list". The "owner" is the module that provides the implementation. The "name" is a simple identifier: some vtables have string names, some have numeric names, and it is often called something different like "version", "family", "drvname", or "level". But conceptually it is still a name. In the present example there are two names, a string and a numeric "level".

The "list", while part of the same functionality, is less common. The mdk_personality structure has a struct list_head, as does struct ts_ops. struct file_system_type has a simple pointer to the next struct file_system_type. The underlying idea here is that for any particular implementation of an interface (or "final" definition of a class) to be usable, it must be registered in some way so that it can be found. Further, once it has been found it must be possible to ensure that the module holding the implementation is not removed while it is in use.

There seem to be nearly as many styles of registration against an interface in Linux as there are interfaces to register against, so finding strong patterns there would be a difficult task. However it is fairly common for a "vtable" to be treated as the primary handle on a particular implementation of an interface and to have an "owner" pointer which can be used to get a reference on the module which provides the implementation.

So the pattern we find here is that a structure of function pointers used as a "vtable" for object method dispatch should normally contain only function pointers. Exceptions require clear justification. A common exception allows a module pointer and possible other fields such as a name and a list pointer. These fields are used to support the registration protocol for the particular interface. When there is no list pointer it is very likely that the entire vtable will be treated as read-only. In this case the vtable will often be declared as a const structure and so could even be stored in read-only memory.

Combining Methods for different objects

A final common deviation from the "pure vtable" pattern that we see in the Linux kernel occurs when the first argument to the function is not always the same object type. In a pure vtable which is referenced by a pointer in a particular data structure, the first argument of each function is exactly that data structure. What reason could there be for deviating from that pattern? It turns out that there are few, some more interesting than others.

The simplest and least interesting explanation is that, for no apparent reason, the target data structure is listed elsewhere in the argument list. For example all functions in struct fb_ops take a struct fb_info. While in 18 cases that structure is the first argument, in five cases it is the last. There is nothing obviously wrong with this choice and it is unlikely to confuse developers. It is only a problem for data miners like your author who need to filter it out as an irrelevant pattern.

A slight deviation on this pattern is seen in struct rfkill_ops where two functions take a struct rkfill but the third - set_block() - takes a void *data. Further investigation shows that this opaque data is exactly that which is stored in rfkill->data, so set_block() could easily be defined to take a struct rfkill and simply to follow the ->data link itself. This deviation is sufficiently non-obvious that it could conceivably confuse developers as well as data miners and so should be avoided.

The next deviation in seen for example in platform_suspend_ops, oprofile_operations, security_operations and a few others. These take an odd assortment of arguments with no obvious pattern. However these are really very different sorts of vtable structures in that the object they belong to are singletons. There is only one active platform, only one profiler, only one security policy. Thus the "object" on which these operations act is part of the global state and so does not need to be included in the arguments of any functions.

Having filtered these two patterns out as not being very interesting we are left with two that do serve to tell us something about object use in the kernel.

quota_format_ops and export_operations are two different operations structures that operate on a variety of different data structures. In each case the apparent primary object (e.g. a struct super_block or a struct dentry) already has a vtable structure dedicated to it (such as super_operations or dentry_operations) and these new structures add new operations. In each case the new operations form a cohesive unit providing a related set of functionality - whether supporting disk quotas or NFS export. They don't all act on the same object simply because the functionality in question depends on a variety of objects.

The best term from the language of object-oriented programming for this is probably the "mixin". Though the fit may not be perfect - depending on what your exact understanding of mixin is - the idea of bringing in a collection of functionality without using strict hierarchical inheritance is very close to the purpose of quota_format_ops and export_operations.

Once we know to be on the lookout for mixins like these we can find quite a few more examples. The pattern to be alert for is not the one that led us here - an operations structure that operates on a variety of different objects - but rather the one we found where the functions in an "operations" structure operate on objects that already have their own "operations" structure. When an object has a large number of operations that are relevant and these operations naturally group into subsets, it makes a lot of sense to divide them into separate vtable-like structures. There are several examples of this in the networking code where for instance both tcp_congestion_ops and inet_connection_sock_af_ops operate (primarily) on a struct sock, which itself has already got a small set of dedicated operations.

So the pattern of a "mixin" - at least as defined as a set of operations which apply to one or more objects without being the primary operations for those objects - is a pattern that is often found in the kernel and appears to be quite valuable in allowing better modularization of code.

The last pattern which explains non-uniform function targets is probably the most interesting, particularly in its contrast to the obvious application of object-oriented programming style. Examples of this pattern abound with ata_port_operations, tty_operations, nfs_rpc_ops and atmdev_ops all appearing as useful examples. However we will focus primarily on some examples from the filesystem layer, particularly super_operations and inode_operations.

There is a strong hierarchy of objects in the implementation of a filesystem where the filesystem - represented by a "super_block" - has a number of files (struct inode) which may have a number of names or links (struct dentry). Further each file might store data in the page cache (struct address_space) which comprises a number of individual pages (struct page). There is a sense in which all of these different objects belong to the filesystem as a whole. If a page needs to be loaded with data from a file, the filesystem knows how to do that, and it is probably the same mechanism for every page in every file. Where it isn't always the same, the filesystem knows that too. So we could conceivably store every operation on every one of these objects in the struct super_block, as it represents the filesystem and could know what to do in each case.

In practice that extreme is not really helpful. It is quite likely that while there are similarities between the storage of a regular file and a directory, there are also important differences and being able to encode those differences in separate vtables can be helpful. Sometimes small symbolic links are stored directly in the inode while larger links are stored like the contents of a regular file. Having different readlink() operations for the two cases can make the code a lot more readable.

While the extreme of every operation attached to the one central structure is not ideal, it is equally true that the opposite extreme is not ideal either. The struct page in Linux does not have a vtable pointer at all - in part because we want to keep the structure as small as possible because it is so populous. Rather the address_space_operations structure contains the operations that act on a page. Similarly the super_operations structure contains some operations that apply to inodes, and inode_operations contains some operations that apply to dentries.

It is clearly possible to have operations structures attached to a parent of the target object - providing the target holds a reference to the parent, which it normally does - though it is not quite so clear that it is always beneficial. In the case of struct page which avoids having a vtable pointer altogether the benefit is clear. In the case of struct inode which has its own vtable pointer, the benefit of having some operations (such as destroy_inode() or write_inode()) attached to the super_block is less clear.

As there are several vtable structures where any given function pointer could be stored, the actual choice is in many cases little more than historical accident. Certainly the proliferation of struct dentry operations in inode_operations seems to be largely due to the fact that some of them used to act directly on the inode, but changes in the VFS eventually required this to change. For example in 2.1.78-pre1, each of link(), readlink(), followlink() (and some others which are now defunct) were changed from taking a struct inode to take a struct dentry instead. This set the scene for "dentry" operations to be in inode_operations, so when setattr and getattr were added for 2.3.48, it probably seemed completely natural to include them in inode_operations despite the fact that they acted primarily on a dentry.

Possibly we could simplify things by getting rid of dentry_operations altogether. Some operations that act on dentries are already in inode_operations and super_operations - why not move them all there? While dentries are not as populous as struct page there are still a lot of them and removing the "d_op" field could save 5% of the memory used by that structure (on x86-64).

With two exceptions, every active filesystem only has a single dentry operations structure in effect. Some filesystem implementations like "vfat" define two - e.g. one with case-sensitive matching and one with case-insensitive matching - but there is only one active per super-block. So it would seem that the operations in dentry_operations could be moved to super_operations, or at least accessed through "s_d_op". The two exceptions are ceph and procfs. These filesystems use different d_revalidate() operations in different parts of the filesystem and - in the case of procfs - different d_release() operations. The necessary distinctions could easily be made in per-superblock versions of these operations. Do these cases justify the 5% space cost? Arguably not.

Directly embedded function pointers

Finally it is appropriate to reflect on the alternate pattern mentioned at the start, where function pointers are stored directly in the object rather than in a separate vtable structure. This pattern can be seen in struct request_queue which has nine function pointers, struct efi which has ten function pointers, and struct sock which has six function pointers.

The cost of embedded pointers is obviously space. When vtables are used, there is only one copy of the vtable and multiple copies of an object (in most cases) so if more than one function pointer is needed, a vtable would save space. The cost of a vtable is an extra memory reference, though cache might reduce much of this cost in some cases. A vtable also has a cost of flexibility. When each object needs exactly the same set of operations a vtable is good, but if there is a need to individually tailor some of the operations for each object, then embedded function pointer can provide that flexibility. This is illustrated quite nicely by the comment with "zoom_video" in struct pcmcia_socket

	/* Zoom video behaviour is so chip specific its not worth adding
	   this to _ops */

So where objects are not very populous, where the list of function pointers is small, and where multiple mixins are needed, embedded function pointers are used instead of a separate vtable.

Method Dispatch Summary

If we combine all the pattern elements that we have found in Linux we find that:

Method pointers that operate on a particular type of object are normally collected in a vtable associated directly with that object, though they can also appear:

  • In a mixin vtable that collects related functionality which may be selectable independently of the base type of the object.

  • In the vtable for a "parent" object when doing so avoids the need for a vtable pointer in a populous object

  • Directly in the object when there are few method pointers, or they need to be individually tailored to the particular object.

These vtables rarely contain anything other than function pointers, though fields needed to register the object class can be appropriate. Allowing these function pointers to be NULL is a common but not necessarily ideal technique for handling defaults.

So in exploring the Linux Kernel code we have found that even though it is not written in an object-oriented language, it certainly contains objects, classes (represented as vtables), and even mixins. It also contains concepts not normally found in object-oriented languages such as delegating object methods to a "parent" object.

Hopefully understanding these different patterns and the reasons for choosing between them can lead to more uniform application of the patterns across the kernel, and hence make it easier for a newcomer to understand which pattern is being followed. In the second part of our examination of object oriented patterns we will explore the various ways that data inheritance is achieved in the Linux kernel and discuss the strengths and weaknesses of each approach so as to see where each is most appropriate.

Comments (125 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds