Brief items
The current development kernel is 3.0-rc1,
released on May 29. Linus said:
So what are the big changes? NOTHING. Absolutely nothing. Sure, we
have the usual two thirds driver changes, and a lot of random
fixes, but the point is that 3.0 is *just* about renumbering, we
are very much *not* doing a KDE-4 or a Gnome-3 here. No breakage,
no special scary new features, nothing at all like that. We've been
doing time-based releases for many years now, this is in no way
about features. If you want an excuse for the renumbering, you
really should look at the time-based one ('20 years')
instead.
See the separate article below for a summary of the changes merged in the
second half of the 3.0 merge window.
Stable updates: The
2.6.38.8 and 2.6.39.1 stable updates were released on June
2. This will be the last 2.6.38 stable release, so users should move to
the 2.6.39 series.
Comments (3 posted)
Users are a really terrible source of interface
specifications. "Hackers" are often not much better, but at least
if the interface is lousy the developer has the potential to be
accountable for it and its improvement.
--
Casey Schaufler
IMHO the key design mistake of LSM is that it detaches security
policy from applications: you need to be admin to load policies,
you need to be root to use/configure an LSM. Dammit, you need to be
root to add labels to files!
This not only makes the LSM policies distro specific (and
needlessly forked and detached from real security), but also gives
the message that:
'to ensure your security you need to be privileged'
which is the anti-concept of good security IMO.
--
Ingo Molnar
Comments (1 posted)
Matthew Garrett appears to be having some "fun"
looking into how to reboot x86 hardware. He lists five different mechanisms to reboot 64-bit x86 hardware including: "
kbd - reboot via the keyboard controller. The original IBM PC had the CPU reset line tied to the keyboard controller. Writing the appropriate magic value pulses the line and the machine resets. This is all very straightforward, except for the fact that modern machines don't have keyboard controllers (they're actually part of the embedded controller) and even more modern machines don't even pretend to have a keyboard controller. Now, embedded controllers run software. And, as we all know, software is dreadful. But, worse, the software on the embedded controller has been written by BIOS authors. So clearly any pretence that this ever works is some kind of elaborate fiction. Some machines are very picky about hardware being in the exact state that Windows would program. Some machines work 9 times out of 10 and then lock up due to some odd timing issue. And others simply don't work at all. Hurrah!"
Comments (21 posted)
Once upon a time, Joe Pranevich made a name for himself by writing
comprehensive release notes for major kernel releases. He got out of that
business during the 2.6.x series, but he is now back with a draft version
of "The Wonderful World of Linux 3.0". Readers who are curious about what
has happened since the 2.6.0 release may be interested in giving it a
look. "
This document describes just a few of the thousands of changes and
improvements that have been made to the Linux kernel since the launch
of Linux 2.6. I have attempted to make it as accessible as possible
for a general audience, while not shying away from technical language
when necessary."
Full Story (comments: 11)
Kernel development news
By Jonathan Corbet
June 1, 2011
In the end, 7,333 non-merge changesets were pulled into the mainline kernel
before Linus closed the merge window and decreed that the next release
would be called "3.0". There have not been vast numbers of exciting new
features added since
last week's summary
was written, but there are a few. The most significant user-visible
changes include:
- The namespace file descriptors patch,
which includes the setns() system call, has been merged.
This feature makes it easier to manage containers running in different
namespaces.
- The XFS filesystem now has online discard support.
- The Cleancache functionality has been
merged. Cleancache allows for intermediate storage of pages which
have been pushed out of the page cache but which might still be useful
in the future. Cleancache is initially supported by ext3, ext4, and
ocfs2.
- A new netlink-based infrastructure allows the management of RDMA
clients.
- It is now possible to move all threads in a group into a control group
at once using the cgroup.procs control file.
- The Blackfin architecture has gained perf events support.
- The btrfs filesystem has gained support for a administrator-initiated
"scrub" operation
that can read through a filesystem's blocks and verify checksums.
When possible, bad copies of data will be replaced by good copies from
another storage device. Also supported by btrfs is an
auto_defrag mount option causing the filesystem to notice
random writes to files and schedule them for defragmentation.
- The no-hlt boot parameter has been deprecated; no machines
have needed it in this millennium. Should there be any machines with
non-working HLT instructions running current kernels, they can be
booted with idle=poll.
- Support for the pNFS protocol backed by object storage devices has
been added.
- New hardware support includes:
- Systems and processors: TILE-Gx 64-bit processors and
Blackfin SPORT SPI busses.
- Input: Qualcomm PMIC8XXX keypads.
- Media: Fintek consumer infrared transceivers, and
Fujitsu M-5MOLS 8MP sensors.
- Network:
GPIO-controlled RF-kill switches.
- USB:
VUB300 USB to SDIO/SD/MMC host controllers.
- Miscellaneous:
ST-Ericsson DB5500 power reset control management units,
AMD Family 15h processor power monitors,
SMSC EMC6W201 hardware monitors,
Marvell 88PM860x real-time clocks,
HTC ASIC3 LED controllers,
Qualcomm PM8921 PMIC chips,
Micro Crystal RV3029-C2 RTC chips,
VIA/WonderMedia 85xx SoC RTC chips,
ST M41T93 SPI RTC chips,
EM Microelectronic EM3027 RTC chips,
Maxim/Dallas DS2780 stand-alone fuel gauge ICs,
Maxim MAX8903 battery chargers, and
TI TPS65910 and TPS65911 power management chips.
Changes visible to kernel developers include:
- There is a new core support module for GPIO controllers based
on memory-mapped I/O.
- There is a new atomic_or() operation to perform a logical
OR operation on an atomic_t value.
With the -rc1 release, Linus tagged the kernel "3.0.0" (with a new name of
"Sneaky Weasel"). His stated intent is to drop the last digit during the
stabilization period so that the final kernel would be just "3.0", but that
depends on getting various user-space scripts fixed. Either way, the
stable updates that most people will actually run will start with 3.0.1.
Linus is clearly hoping for a relatively smooth development cycle this time
around; he has hinted that he may be fussier than usual about the fixes
that he'll pull from now on. 3.0, it seems, is supposed to be boring, just
to drive home the point that the version number change does not really mean
much. The final release, boring or not, can be expected sometime in the
first half of July.
Comments (1 posted)
June 2, 2011
This article was contributed by Thomas Gleixner
In the last few months, many people have suggested forking
the ARM kernel and maintaining it as a separate project. While the reasons for
forking ARM may seem attractive to some, it turns out that it really
doesn't make very much sense for either the ARM community or the
kernel as a whole.
Here are the most common reasons given for this suggestion:
- Time to market
- It matches the one-off nature of consumer electronics
- It better suits the diversity of the system-on-chip (SoC) world
- It avoids the bottleneck of maintainers and useless extra work in
response to reviews
Let's have a look at these reasons.
Time to market
The time-to-market advocates reason that it takes less time to
hack up a new driver than to get it accepted upstream.
I know I'm old school and do not understand the rapidly changing world
and new challenges of the semiconductor industry anymore, but I
still have enough engineering knowledge and common sense to know that
there is no real concept of totally disconnected "new" things in that
industry.
Most of an SoC's IP blocks (functionality licensed for inclusion in an SoC)
will not be new.
So what happens to time to market the second time an IP block is used?
If the driver is upstream, it can simply be reused.
But all too often, a new driver gets written from scratch, complete with a new
set of bugs that must be fixed.
Are vendors really getting better time to market by rewriting new drivers
for old IP blocks on every SoC they sell?
In addition, the real time to market for a completely new generation of
chips is not measured in weeks.
The usual time frame for a new chip, from the marketing
announcement to real silicon usable in devices is close to a year.
This should be ample time to get a new driver upstream.
In addition, the marketing announcement certainly does not happen the day
after the engineering department met for beers
after work and some genius engineer sketched the new design on a
napkin at the bar, so most projects have even more time for upstreaming.
The one-off nature of embedded
It's undisputed that embedded projects, especially in the consumer
market, tend to be one-off, but it is also a fact that the variations of
a given SoC family have a lot in common and differ only in small
details. In addition, variations of a given hardware type -
e.g. the smartphone family which differs in certain aspects of
functionality - share most of the infrastructure. Even next
generation SoCs often have enough pieces of the previous generation
embedded into them as there is no compelling reason to replace already
proven-to-work building blocks when the functionality is
sufficient. Reuse of known-to-work parts is not a new concept and
is, unsurprisingly, an essential part of meeting time-to-market goals.
We recently discovered the following gem. An SoC with perfect support
for almost all peripheral components that was already in the mainline kernel
underwent a
major overhaul. It replaced the CPU core, while the peripheral IP
blocks remained the same - except for a slightly different VHDL glue
layer which was necessary to hook them up to the new CPU core. Now the
engineer in me would have expected that the Linux support for the
resulting SoC generation B would have just been a matter of adjusting
the existing drivers. The reality taught us that the vendor assigned
a team to create the support for the "new" SoC and all the drivers got
rewritten from scratch. While we caught some of the drivers in review,
some of the others went into the mainline, so now we have two drivers for the
same piece of silicon that are neither bug nor feature compatible.
I have a really hard time accepting that rewriting a dozen drivers from
scratch is faster than sitting down and identifying the existing -
proven to work - drivers and modifying them for the new SoC design.
The embedded industry often reuses hardware.
Why not reuse software, too?
The SoC diversity
Of course, every SoC vendor will claim that its chip is unique in all
aspects and so different that sharing more than a few essential lines
of code is impossible. That's understandable from the marketing side,
but if you look at the SoC data sheets,
the number of unique peripheral building blocks is not excitingly
large. Given the fact that the SoC vendors target the same markets and
the same customer base, that's more or less expected. A closer look
reveals that different vendors often end up using the same or
very similar IP blocks for a given functionality. There are only a
limited number of functional ways to implement a given requirement in
hardware and there are only a few relevant IP block vendors that ship
their "unique" building blocks to all of the SoC vendors. The
diversity is often limited to a different arrangement of
registers or the fact that one vendor chooses a different subset of
functionality than the other.
We have recently seen consolidation work in the kernel which proves
this to be correct. When cleaning up the interrupt subsystem I noticed
that there are only two widely used types of interrupt controllers.
Without much effort it was possible to replace the code for
more than thirty implementations of "so different" chips with a
generic implementation. A similar effort is on the way to replace the
ever-repeating pattern of GPIO chip support code. These are the low
hanging fruit and there is way more potential for consolidation.
Avoiding the useless work
Dealing with maintainers and their often limited time for review is
seen as a bottleneck. The extra work which results in addressing
the review comments is also seen as waste of time. The number of
maintainers and the time available for review is indeed a limiting
factor that needs to be addressed. The ability to review is not
limited to those who maintain a certain subsystem of the kernel and we
would like to see more people participating in the review process. Spending
a bit of time reviewing other people's code is a very beneficial
undertaking as it opens one's mind to different approaches and helps to
better understand the overall picture.
On the other side, getting code reviewed by others is beneficial as
well and, in general, leads to better and more maintainable code. It
also helps in avoiding mistakes in the next project. In a recent review,
which went through many rounds, a 1200-line driver boiled down to
250 lines and at least a handful of bugs and major mistakes got
fixed.
When I have the chance to talk to developers after a lengthy
review process, most of them concede that the chance to learn and
understand more about the Linux way of development by far outweighs
the pain of the review process and the necessary rework. When looking
at later patches I've often observed that these developers
have improved and avoided the mistakes they made in their first
attempts. So review is beneficial for the developer and for their company as
it helps writing better code in a more efficient way. I call
out those who still claim that review and the resulting work is a
major obstacle as hypocrites who are trying to shift the blame for other
deficiencies in their companies to the kernel community.
One deficiency is assigning proprietary RTOS developers to
write Linux kernel code without teaching them how to work with
the Linux community.
There is no dishonor in not knowing how to work with the Linux
community; after all, every kernel developer including myself started
at some point without knowing it.
But it took me time to learn how to work with the community, and it will
take time for proprietary RTOS developers to work with the community.
It is well worth the time and effort.
What would be solved by forking the ARM kernel?
Suppose there was an ARM-specific Git repository which acted as a
dumping ground for all
of the vendor trees. It would pull in the enhancements of the real mainline
kernel
from time to time so that the embedded crowd gets the new filesystems,
networking features, etc.
Extrapolating
the recent flow of SoC support patches into Linux and removing all the
sanity checks on them would result in a growth rate of that ARM tree
which would exceed the growth rate of the overall mainline kernel
in no time.
And what if the enhancements from mainline require changes to every driver in
the ARM tree, as was required for some of my recent work?
Who makes those changes?
If the drivers are in mainline, the drivers are changed as part of the
enhancement.
If there is a separate ARM fork, some ARM-fork maintainer will have to
make these changes.
And who is going to maintain such a tree? I have serious doubts that
there will surface a sufficient number of qualified maintainers out of
the blue who have the bandwidth and the experience to deal with such a
flood of changes. So to avoid the bottleneck that is one of the
complaints when working with mainline, the maintainers would probably
just have the role of integrators who merely aggregate the various vendor
trees in a central place.
What's the gain of such an exercise? Nothing as far as I can see, it
would just allow everyone to claim that all of their code is part of a
mysterious ARM Git tree and, of course, it would fulfill the ultimate
"time to market" requirements, in the short term, anyway.
How long would an ARM fork be sustainable?
I seriously doubt that an ARM fork would work longer than a few kernel cycles
simply because the changes to core code will result in a completely
unmaintainable #ifdef mess with incompatible per-SoC APIs that will
drive anyone who has to "maintain" such a beast completely nuts in no
time. I'm quite sure that none of the ARM-fork proponents has ever
tried to pull five full-flavored vendor trees into a single kernel
tree and deal with the conflicting changes to DMA APIs, driver
subsystems, and infrastructure. I know that maintainers of embedded
distribution kernels became desperate in no time exactly for those
reasons and I doubt that any reasonable kernel developer is insane
enough to cope with such an horror for more than a couple of months.
Aside from the dubious usefulness, such an ARM fork would cut off
the ARM world from influencing the overall direction of the Linux
kernel entirely. ARM would become a zero-interest issue for most
of the experienced mainline maintainers and developers as with other
out-of-tree projects. I doubt that the ARM industry can afford to
disconnect itself in such a way, especially as the complexity of
operating-system-level software is increasing steadily.
Is there a better answer?
There is never an ultimate answer which will resolve all problems
magically, but there are a lot of small answers which can effectively
address the main problem spots.
One of the root causes for the situation today is of a historical
nature. For over twenty years the industry dealt with closed source
operating systems where changes to the core code were impossible and
collaboration with competitors was unthinkable and unworkable. Now,
after moving to Linux, large parts of the industry still think in this
well-known model and let their engineers - who have often worked on other
operating systems before working on Linux - just continue the way
they enabled their systems in the past. This a perfect solution
for management as well, because the existing structures and the idea of
top-down software development and management still applies.
That "works" as long as the resulting code does not have to be integrated
into the mainline kernel and each vendor maintains its own specialized
fork. There are reasonable requests from customers for mainline
integration, however, as it makes the adoption of new features
easier, there is less dependence on the frozen vendor kernels, and, as seen lately,
it allows for consolidation toward multi-platform kernels. The latter is
important for enabling sensible distribution work targeted at netbooks,
tablets, and similar devices. This applies even more for the long-rumored
ARM servers which are expected to materialize in the near
future. Such consolidation requires cooperation not only across the ARM
vendors, it requires a collaborative effort across many parts of the
mainline kernel along with the input of maintainers and developers who are
not necessarily part of the ARM universe.
So we need to help management understand that holding on to the known
models is not the most efficient way to deal with the growing
complexity of SoC hardware and the challenges of efficient and
sustainable operating system development in the open source space. At
the same time, we need to explain at the engineering level that
treating Linux in the same way as other OS platforms is making
life harder, and is at least partially responsible for the grief which
is observed when code hits the mailing lists for review.
Another area that we need to work on is massive collaborative
consolidation, which has the preliminary that silicon vendors accept
that, at least at the OS engineering level, their SoCs are not as unique
as their marketing department wants them to believe. As I explained
above, there are only a limited number of ways to implement a given
functionality in hardware, which is especially true for hardware with a
low complexity level. So we need to encourage developers to first look to see
whether existing code might be refactored to fit the new device
instead of blindly copying the closest matching driver - or in the worst case a
random driver - and hacking it into shape somehow.
In addition, the kernel
maintainers need to be more alert to that fact as well and help to
avoid the reinvention of the wheel. If driver reuse cannot be
achieved, we can often pull out common functionality into
core code and avoid duplication that way. There is a lot of low
hanging fruit here, and the Linux kernel community as a whole needs to
get better and spend more brain cycles on avoiding duplication.
One step forward was taken recently with the ARM SoC consolidation
efforts that were initiated by Linaro. But
this only will succeed when we are aware of the conflicts with the
existing corporate culture and address these conflicts at the non-technical
level as well.
Secrecy
Aside from the above issues the secrecy barrier is going to be the next
major challenge. Of course a silicon vendor is secretive about the
details of its next-generation SoC design, but the information which
is revealed in marketing announcements allows us to predict at least
parts of the design pretty precisely.
The most obvious recent example
is the next-generation ARM SoCs from various vendors targeted for the end
of 2011. Many of them will come with USB 3.0 support. Going through
the IP block vendor offerings tells us that there are less USB 3.0 IP
blocks available for integration than there are SoC vendors who have announced
new chips with USB 3.0 support. That means that there are duplicate
drivers in the works and I'm sure that, while the engineers are aware
of this, no team is allowed to talk to the competitor's team. Even if they
would be allowed to do so, it's impossible to
figure out who is going to use which particular IP block. So we will
see several engineering teams fighting over the "correct" implementation to
be merged as the mainline driver in a couple of month when the secrecy
barrier has
been lifted.
Competing implementations are not a bad thing per se, but
the inability to exchange information and discuss design variants is
not helping anyone in the "time to market" race. I seriously doubt
that any of the to-be-released drivers will have a relevant
competitive advantage and even if one does, it will not last
very long when the code becomes public. It's sad that opportunities to
collaborate and save precious engineering resources for all involved
parties are sacrificed in favor of historical competition-oriented
behaviour patterns. These patterns have not been overhauled since they were
invented twenty or more years ago and they have never been subject to
scrutiny in the context of competition on the open source operating
system playground.
A way forward
The ever-increasing complexity of hardware, which
causes more complex operating-system-level software, caused a
shortage of high-profile OS level software developers long ago which
cannot be counterbalanced by either money or by assigning a large
enough number of the cheapest available developer resources to
it. The ARM universe is diversified enough that there is no
chance for any of the vendors to get hold of a significant enough number
of outstanding kernel developers to cause any serious damage to their
competitors. That is particularly true given the fact that those outstanding
developers generally prefer to work in the open rather than a semi-closed
environment.
It's about time for managers to rethink their competition models
and start to comprehend and utilize the massive advantage of
collaborative models over an historic and no-longer-working model
that assumes the infinite availability of up-to-the-task resources
when there is enough money thrown into the ring. Competent developers
certainly won't dismiss the chance to get some extra salary lightly,
but the ability to get a way more enjoyable working environment for a
slightly smaller income is for many of them a strong incentive to
refuse the temptation. It's an appealing thought for me that there is
no "time to market" howto, no "shareholder value" handbook, and no
"human resources management" course which ever took into consideration
that people might be less bribable than generally expected, especially
if this applies to those people who are some of the scarcest resources
in the industry.
I hope that managers working for embedded vendors will start to
understand how Open Source works and why there is an huge benefit to
work with the community.
After all, the stodgy old server vendors were able to figure out how
to work with the Linux community, so it cannot be that hard for
the fast-moving embedded vendors.
However, the realist in me - sometimes called "the grumpy old man" - who
has worked
in the embedded industry for more than twenty five years does not believe that
at all. For all too many SoC vendors, the decision to "work" with the
community was made due to outside pressure and an obsession with
following the hype.
Outside pressure is not what the open source enthusiasts might hope
for: the influence of the community itself. No, it's simply the
pressure made by (prospective) customers who request that the chip
is - at least basically - supported by the mainline kernel.
Following the hype is omnipresent and it always seems to be a valid
argument to avoid common-sense-driven decisions based on long-term
considerations.
The eternal optimist in me still has hope that the embedded world will
become a first class citizen in the Linux community
sooner rather than later. The realist in me somehow doubts that it
will happen before the "grumpy old man" is going to retire.
Comments (12 posted)
June 1, 2011
This article was contributed by Neil Brown
Despite the fact that the Linux Kernel is mostly written in C, it
makes broad use of some techniques from the field of object-oriented
programming.
Developers wanting to use these object-oriented techniques receive
little support or guidance from the language and so are left to fend
for themselves. As is often the case, this is a double-edged sword.
The developer has enough flexibility to do really cool things, and
equally the flexibility to do really stupid things, and it isn't
always clear at first glance which is which, or more accurately: where
on the spectrum a particular approach sits.
Instead of looking to the language to provide guidance, a software
engineer must look to established practice to find out what works well
and what is best avoided. Interpreting established practice is not
always as easy as one might like and the effort, once made, is worth
preserving.
To preserve that effort on your author's part, this article brings
another installment in an
occasional series on Linux Kernel Design
Patterns and attempts to set out - with examples - the design patterns
in the Linux Kernel which effect an object-oriented style of
programming.
Rather than providing a brief introduction to the object-oriented
style, tempting though that is, we will assume the reader has a basic
knowledge of objects, classes, methods, inheritance, and similar
terms. For those as yet unfamiliar with these, there are plenty of
resources to be found elsewhere on the web.
Over two weeks we will look for patterns in just two areas: method
dispatch and data inheritance. Despite their apparent
simplicity they lead to some rich veins for investigation.
This first article will focus on method dispatch.
Method Dispatch
The large variety of styles of inheritance and rules for its usage in
languages today seems to suggest that there is no uniform
understanding of what "object-oriented" really means. The term is a bit like
"love": everyone thinks they know what it means but when you get down
to details people can find they have very different ideas.
While what it means to be "oriented" might not be clear, what we mean
by an "object" does seem to be uniformly agreed upon. It is simply an
abstraction comprising both state and behavior. An object is like a
record (Pascal) or struct (C), except that some of the names of members
refer to functions which act on the other fields in the object.
These function members are sometimes referred to a "methods".
The most obvious way to implement objects in C is to declare a
"struct" where some fields are pointers to functions which take a
pointer to the struct itself as their first argument. The calling
convention for method "foo" in object "bar" would simply be:
bar->foo(bar, ...args);
While this pattern is used in the Linux kernel it is not the dominant
pattern so we will leave discussion of it until a little later.
As methods (unlike state) are not normally changed on a per-object
basis, a more common and only slightly less obvious approach is to
collect all the methods for a particular class of objects into a
separate structure, sometimes known as a "virtual function table" or
vtable.
The object then has a single pointer to this table rather than a
separate pointer for each method, and consequently uses less memory.
This then leads to our first pattern - a pure vtable being a
structure which contains only function pointers where the first argument of
each is a pointer to some other structure (the object type) which itself
contains a pointer to this vtable.
Some simple examples of this in the Linux kernel are the
file_lock_operations
structure which contains two function pointers
each of which take a pointer to a struct file_lock, and the
seq_operations vtable which contains four function pointers
which each operate on a struct seq_file.
These two examples display an obvious naming pattern - the structure
holding a vtable is named for the structure holding the object
(possibly abbreviated) followed by "_operations". While this pattern is
common it is by no means universal. Around the time of 2.6.39
there are approximately 30 "*_operations" structures along with well over
100 "*_ops" structures, most if not all of which are vtables of some
sort. There are also several structs such as struct mdk_personality
which are essentially vtables but do not have particularly helpful
names.
Among these nearly 200 vtable structures there is plenty of variability
and so plenty of scope to look for interesting patterns. In
particular we can look for common variations from the "pure vtable"
pattern described above and determine how these variations contribute
to our understanding of object use in Linux.
NULL function pointers
The first observation is that some function pointers in some vtables
are allowed to be NULL. Clearly trying to call such a function would
be futile, so the code that calls into these methods generally
contains an explicit test for the pointer being NULL. There are a few
different reasons for these NULL pointers.
Probably easiest to justify is the incremental development
reason. Because of the way vtable structures are initialized, adding
a new function pointer to the structure definition causes all existing
table declarations to initialise that pointer to NULL. Thus it is
possible to add a caller of the new method before any instance
supports that method, and have it check for NULL and perform a default
behavior. Then as incremental development continues those vtable
instances which need it can get non-default methods.
A recent example is commit
77af1b2641faf4 adding
set_voltage_time_sel() to struct regulator_ops which acts on
struct regulator_dev. Subsequent commit
42ab616afe8844
defines that method
for a particular device. This is simply the most recent example of a
very common theme.
Another common reason is that certain methods are not particularly
meaningful in certain cases so the calling code simply tests for NULL
and returns an appropriate error when found. There are multiple
examples of this in the virtual filesystem (VFS) layer. For instance,
the create() function in
inode_operations
is only meaningful if the inode in question is a
directory. So inode_operations structures for non-directories
typically have NULL for the create() function (and many others) and
the calling code in vfs_create() checks for NULL and returns -EACCES.
A final reason that vtables sometimes contain NULL is that an element
of functionality might be being transitioned from one interface to
another. A good example of this is the ioctl() operation in
file_operations.
In 2.6.11, a new method, unlocked_ioctl() was added
which was called without the big kernel lock held. In 2.6.36, when all
drivers and filesystems had been converted to use unlocked_ioctl(),
the original ioctl() was finally removed. During this transition a file system
would typically define only one of two, leaving the other defaulting to NULL.
A slightly more subtle example of this is read() and aio_read(), also in
file_operations, and the corresponding write() and aio_write().
aio_read() was introduced to support asynchronous IO, and if it is
provided the regular synchronous read() is not needed (it is effected
using do_sync_read() which calls the aio_read() method). In this case
there appears to be no intention of ever removing read() - it will
remain for cases where async IO is not relevant such as special
filesystems like procfs and sysfs. So it is still the case that only one of each
pair need be defined by a filesystem, but it is not simply a transition, it is
a long-term state.
Though there seem to be several different reasons for a NULL function pointer,
almost every case is an example of one simple pattern - that of providing a
default implementation for the method. In the "incremental development"
examples and the non-meaningful method case, this is fairly straightforward.
e.g. the default for inode->create() is simply to return an error.
In the interface transition case it is only slightly less obvious. The default for
unlocked_ioctl() would be to take the kernel lock and then call
the ioctl() method. The
default for read() is exactly do_sync_read() and some filesystems such
as ext3
actually provide this value explicitly rather than using
"NULL" to indicate a default.
With that in mind, a little reflection suggests that if the real goal is to
provide a default, then maybe the best approach would be to explicitly give a
default rather than using the circuitous route of using a default of NULL and
interpreting it specially.
While NULL is certainly the easiest value to provide as a default - as the C
standard assures us that uninitialized members of a structure do get set to
NULL - it is not very much harder to set a more meaningful default.
I am indebted to LWN reader
wahern for the observation that
C99 allows fields in a structure to be initialized multiple times with only the
final value taking effect and that this allows easy setting of default values
such as by following the simple model:
#define FOO_DEFAULTS .bar = default_bar, .baz = default_baz
struct foo_operations my_foo = { FOO_DEFAULTS,
.bar = my_bar,
};
This will declare my_foo with a predefined default value for baz and a
localized value for bar. Thus for the small cost of defining a few "default"
functions and including a "_DEFAULTS" entry to each declaration, the
default value for any field can easily be chosen when the field is first
created, and automatically included in every use of the structure.
Not only are meaningful defaults easy to implement, they can lead to a more
efficient implementation. In those cases where the function pointer actually
is NULL it is probably faster to test and branch rather than to make an
indirect function call. However the NULL case is very often
the exception rather than the rule, and optimizing for an
exception is not normal practice. In the more common case when
the function pointer is not NULL, the test for NULL is simply a
waste of code space and a waste of execution time. If we
disallow NULLs we can make all call sites a little bit smaller and simpler.
In general, any testing performed by the caller before calling a method
can be seen as an instance of the "mid-layer mistake" discussed
in a previous article. It shows that the
mid-layer is making
assumptions about the behavior of the lower level driver rather than simply
giving the driver freedom to behave in whatever way is most
suitable. This may not always be an expensive mistake, but it
is still best avoided where possible.
Nevertheless there is a clear pattern in the Linux kernel that
pointers in vtables can sometimes be NULLable, typically though not
always to enable a transition, and the call sites should in these
cases test for NULL before proceeding with the call.
The observant reader will have noticed a hole in the above logic
denouncing the use NULL pointers for defaults. In the case where the
default is the common case and where performance is paramount, the
reasoning does not hold and a NULL pointer could well be justified.
Naturally the Linux kernel provides an example of such a case for our
examination.
One of the data structures used by the VFS for caching filesystem
information is the "dentry". A "dentry" represents a name in the
filesystem, and so each "dentry" has a parent, being the directory
containing it, and an "inode" representing the named file. The dentry
is separate from the inode because a single file can have multiple
names (so an "inode" can have multiple "dentry"s).
There is a dentry_operations vtable with a number of operations including,
for example, "d_compare" which will compare two names and "d_hash"
which will generate a hash for the name to guide the storage of the
"dentry" in a hash table. Most filesystems do not need this flexibility. They
treat names as uninterpreted strings of bytes so the default compare and hash
functions are the common case. A few filesystems define these to
handle case-insensitive names but that is not the norm.
Further, filename lookup is a common operation in Linux and so
optimizing it is a priority. Thus these two operations
appear to be good candidates where a test for NULL and an inlined
default operation might be appropriate. What we find though is that
when such an optimization is warranted it is not by itself enough.
The code that calls d_compare() and d_hash() (and a couple of other dentry
operations) does not test these functions for NULL directly. Rather
they require that a few flag bits (DCACHE_OP_HASH, DCACHE_OP_COMPARE)
in the "dentry" are set up to indicate whether the common default
should be used, or whether the function should be called. As the flag
field is likely to be in cache anyway, and the dentry_operations
structure will often be not needed at all, this avoids a memory fetch
in a hot path.
So we find that the one case where using a NULL function pointer to
indicate a default could be justified, it is not actually used; instead, a
different, more efficient,
mechanism is used to indicate that the default method is requested.
Members other than function pointers
While most vtable-like structures in the kernel contain exclusively
function pointers, there are a significant minority that have
non-function-pointer fields. Many of these appear on the surface
quite arbitrary and a few closer inspections suggest that some of them
result of poor design or bit-rot and their removal would only improve
the code.
There is one exception to the "functions only" pattern that occurs
repeatedly and provides real value, and so is worth exploring.
This pattern is seen in its most general form in
struct mdk_personality
which provides operations for a particular software
RAID level. In particular this structure contains an "owner", a
"name", and a "list". The "owner" is the module that provides the
implementation. The "name" is a simple identifier: some vtables have
string names, some have numeric names, and it is often called
something different like "version", "family", "drvname", or
"level". But conceptually it is still a name. In the present example
there are two names, a string and a numeric "level".
The "list", while part of the same functionality, is less
common. The mdk_personality structure has a struct list_head, as does
struct ts_ops.
struct file_system_type
has a simple pointer to the next
struct file_system_type.
The underlying idea here is that for any particular implementation of
an interface (or "final" definition of a class) to be usable, it must
be registered in some way so that it can be found. Further, once it
has been found it must be possible to ensure that the module holding
the implementation is not removed while it is in use.
There seem to be nearly as many styles of registration against an
interface in Linux as there are interfaces to register against, so
finding strong patterns there would be a difficult task. However it
is fairly common for a "vtable" to be treated as the primary handle on
a particular implementation of an interface and to have an "owner"
pointer which can be used to get a reference on the module which
provides the implementation.
So the pattern we find here is that a structure of function pointers
used as a "vtable" for object method dispatch should normally contain
only function pointers. Exceptions require clear justification. A
common exception allows a module pointer and possible other fields such
as a name and a list pointer. These fields are used to support
the registration protocol for the particular interface.
When there is no list pointer it is very likely that the entire vtable will
be treated as read-only. In this case the vtable will often be declared as
a const structure and so could even be stored in read-only memory.
Combining Methods for different objects
A final common deviation from the "pure vtable" pattern that we see in
the Linux kernel occurs when the first argument to the function is not
always the same object type. In a pure vtable which is referenced by
a pointer in a particular data structure, the first argument of
each function is exactly that data structure. What reason could there
be for deviating from that pattern? It turns out that there are few,
some more interesting than others.
The simplest and least interesting explanation is that, for no apparent
reason, the target data structure is listed elsewhere in the argument
list. For example all functions in
struct
fb_ops
take a struct fb_info. While in 18 cases that structure is the first argument, in
five cases it is the last. There is nothing obviously wrong with this
choice and it is unlikely to confuse developers. It is only a problem
for data miners like your author who need to filter it out as an
irrelevant pattern.
A slight deviation on this pattern is seen in
struct rfkill_ops
where two functions take a struct rkfill but the third - set_block() -
takes a void *data. Further investigation shows that this opaque
data is exactly that which is stored in rfkill->data, so set_block()
could easily be defined to take a struct rfkill and simply to follow
the ->data link itself. This deviation is sufficiently non-obvious
that it could conceivably confuse developers as well as data miners
and so should be avoided.
The next deviation in seen for example in
platform_suspend_ops,
oprofile_operations,
security_operations
and a few others. These
take an odd assortment of arguments with no obvious pattern. However
these are really very different sorts of vtable structures in that the
object they belong to are singletons. There is only one active
platform, only one profiler, only one security policy. Thus the
"object" on which these operations act is part of the global state and
so does not need to be included in the arguments of any functions.
Having filtered these two patterns out as not being very interesting
we are left with two that do serve to tell us something about object
use in the kernel.
quota_format_ops
and
export_operations
are two different operations structures that operate on a variety of different
data structures. In each case the apparent primary object (e.g. a
struct super_block or a struct dentry) already has a vtable
structure dedicated to it (such as super_operations or
dentry_operations) and these new structures add new operations. In
each case the new operations form a cohesive unit providing a related
set of functionality - whether supporting disk quotas or NFS export.
They don't all act on the same object simply because the functionality
in question depends on a variety of objects.
The best term from the language of object-oriented programming for this
is probably the "mixin".
Though the fit may not be perfect -
depending on what your exact understanding of mixin is - the idea of
bringing in a collection of functionality without using strict hierarchical
inheritance is very close to the purpose of quota_format_ops and
export_operations.
Once we know to be on the lookout for mixins like these we can find
quite a few more examples. The pattern to be alert for is not the one
that led us here - an operations structure that operates on a variety
of different objects - but rather the one we found where the functions
in an "operations" structure operate on objects that already have their
own "operations" structure. When an object has a large number of
operations that are relevant and these operations naturally group into
subsets, it makes a lot of sense to divide them into separate
vtable-like structures. There are several examples of this in the
networking code where for instance both
tcp_congestion_ops
and
inet_connection_sock_af_ops
operate (primarily) on a struct sock,
which itself has already got a small set of dedicated operations.
So the pattern of a "mixin" - at least as defined as a set of operations
which apply to one or more objects without being the primary
operations for those objects - is a pattern that is often found in the
kernel and appears to be quite valuable in allowing better
modularization of code.
The last pattern which explains non-uniform function targets is
probably the most interesting, particularly in its contrast to the
obvious application of object-oriented programming style.
Examples of this pattern abound with
ata_port_operations,
tty_operations,
nfs_rpc_ops
and
atmdev_ops
all appearing as
useful examples. However we will focus primarily on some examples
from the filesystem layer, particularly
super_operations
and
inode_operations.
There is a strong hierarchy of objects in the implementation of a
filesystem where the filesystem - represented by a "super_block" - has a
number of files (struct inode) which may have a number of names or
links (struct dentry). Further each file might store data in the page
cache (struct address_space) which comprises a number of individual
pages (struct page).
There is a sense in which all of these different objects belong to the
filesystem as a whole. If a page needs to be loaded with data from a
file, the filesystem knows how to do that, and it is probably the same
mechanism for every page in every file. Where it isn't always the
same, the filesystem knows that too. So we could conceivably store
every operation on every one of these objects in the
struct super_block, as it represents the filesystem and could know what to
do in each case.
In practice that extreme is not really helpful. It is quite likely
that while there are similarities between the storage of a regular file
and a directory, there are also important differences and being able
to encode those differences in separate vtables can be helpful.
Sometimes small symbolic links are stored directly in the inode while
larger links are stored like the contents of a regular file. Having
different readlink() operations for the two cases can make the code a
lot more readable.
While the extreme of every operation attached to the one central
structure is not ideal, it is equally true that the opposite extreme
is not ideal either. The struct page in Linux does not have a
vtable pointer at all - in part because we want to keep the structure
as small as possible because it is so populous. Rather the
address_space_operations structure contains the operations that act
on a page. Similarly the super_operations structure contains some
operations that apply to inodes, and inode_operations contains some
operations that apply to dentries.
It is clearly possible to have operations structures attached to a
parent of the target object - providing the target holds a reference
to the parent, which it normally does - though it is not quite so clear that
it is always beneficial. In the case of struct page which avoids
having a vtable pointer altogether the benefit is clear. In the case
of struct inode which has its own vtable pointer, the benefit of
having some operations (such as destroy_inode() or write_inode())
attached to the super_block is less clear.
As there are several vtable structures where any given function
pointer could be stored, the actual choice is in many cases little
more than historical accident. Certainly the proliferation
of struct dentry operations in inode_operations seems to be
largely due to the fact that some of them used to act directly
on the inode, but changes in the VFS eventually required this
to change. For example in 2.1.78-pre1, each of link(),
readlink(), followlink() (and some others which are now
defunct) were
changed
from taking a struct inode to
take a struct dentry instead. This set the scene for "dentry"
operations to be in inode_operations, so when setattr and getattr
were added for 2.3.48, it probably seemed completely natural to
include them in inode_operations despite the fact that they acted
primarily on a dentry.
Possibly we could simplify things by getting rid of
dentry_operations altogether. Some operations that act on dentries
are already in inode_operations and super_operations - why not move
them all there? While dentries are not as populous as struct page
there are still a lot of them and removing the "d_op" field could save
5% of the memory used by that structure (on x86-64).
With two exceptions, every active filesystem only has a single dentry
operations structure in effect. Some filesystem implementations like
"vfat" define two - e.g. one with case-sensitive matching and one with
case-insensitive matching - but there is only one active per
super-block. So it would seem that the operations in
dentry_operations could be moved to super_operations, or at
least accessed through "s_d_op".
The two exceptions are ceph and procfs. These filesystems use
different d_revalidate() operations in different parts of the
filesystem and - in the case of procfs - different d_release()
operations. The necessary distinctions could easily be made in
per-superblock versions of these operations. Do these cases justify the 5%
space cost? Arguably not.
Directly embedded function pointers
Finally it is appropriate to reflect on the alternate pattern
mentioned at the start, where function pointers are stored directly in
the object rather than in a separate vtable structure.
This pattern can be seen in
struct request_queue
which has nine function
pointers,
struct efi
which has ten function pointers, and
struct sock
which has six function pointers.
The cost of embedded pointers is obviously space.
When vtables are used, there is only one copy
of the vtable and multiple copies of an object (in most cases) so if
more than one function pointer is needed, a vtable would save space.
The cost of a vtable is an extra memory reference, though cache might
reduce much of this cost in some cases. A vtable also has a cost of
flexibility. When each object needs exactly the same set of operations
a vtable is good, but if there is a need to individually tailor some
of the operations for each object, then embedded function pointer can
provide that flexibility. This is illustrated quite nicely by the
comment with "zoom_video" in struct pcmcia_socket
/* Zoom video behaviour is so chip specific its not worth adding
this to _ops */
So where objects are not very populous, where the list of function
pointers is small, and where multiple mixins are needed,
embedded function pointers are used instead of a separate vtable.
Method Dispatch Summary
If we combine all the pattern elements that we have found in
Linux we find that:
Method pointers that operate on a particular type of object are
normally collected in a vtable associated directly with that
object, though they can also appear:
- In a mixin vtable that collects related functionality which
may be selectable independently of the base type of the
object.
- In the vtable for a "parent" object when doing so avoids
the need for a vtable pointer in a populous object
- Directly in the object when there are few method pointers,
or they need to be individually tailored to the particular
object.
These vtables rarely contain anything other than function
pointers, though fields needed to register the object class can
be appropriate. Allowing these function pointers to be NULL is
a common but not necessarily ideal technique for handling
defaults.
So in exploring the Linux Kernel code we have found that even
though it is not written in an object-oriented language, it
certainly contains objects, classes (represented as vtables),
and even mixins. It also contains concepts not normally
found in object-oriented languages such as delegating
object methods to a "parent" object.
Hopefully understanding these different patterns and the
reasons for choosing between them can lead to more uniform
application of the patterns across the kernel, and hence make
it easier for a newcomer to understand which pattern is being
followed.
In the second part of our examination of object
oriented patterns we will explore the various ways that
data inheritance is achieved in the Linux kernel
and discuss the strengths and weaknesses of each approach so
as to see where each is most appropriate.
Comments (125 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>