June 1, 2011
This article was contributed by Neil Brown
Despite the fact that the Linux Kernel is mostly written in C, it
makes broad use of some techniques from the field of object-oriented
programming.
Developers wanting to use these object-oriented techniques receive
little support or guidance from the language and so are left to fend
for themselves. As is often the case, this is a double-edged sword.
The developer has enough flexibility to do really cool things, and
equally the flexibility to do really stupid things, and it isn't
always clear at first glance which is which, or more accurately: where
on the spectrum a particular approach sits.
Instead of looking to the language to provide guidance, a software
engineer must look to established practice to find out what works well
and what is best avoided. Interpreting established practice is not
always as easy as one might like and the effort, once made, is worth
preserving.
To preserve that effort on your author's part, this article brings
another installment in an
occasional series on Linux Kernel Design
Patterns and attempts to set out - with examples - the design patterns
in the Linux Kernel which effect an object-oriented style of
programming.
Rather than providing a brief introduction to the object-oriented
style, tempting though that is, we will assume the reader has a basic
knowledge of objects, classes, methods, inheritance, and similar
terms. For those as yet unfamiliar with these, there are plenty of
resources to be found elsewhere on the web.
Over two weeks we will look for patterns in just two areas: method
dispatch and data inheritance. Despite their apparent
simplicity they lead to some rich veins for investigation.
This first article will focus on method dispatch.
Method Dispatch
The large variety of styles of inheritance and rules for its usage in
languages today seems to suggest that there is no uniform
understanding of what "object-oriented" really means. The term is a bit like
"love": everyone thinks they know what it means but when you get down
to details people can find they have very different ideas.
While what it means to be "oriented" might not be clear, what we mean
by an "object" does seem to be uniformly agreed upon. It is simply an
abstraction comprising both state and behavior. An object is like a
record (Pascal) or struct (C), except that some of the names of members
refer to functions which act on the other fields in the object.
These function members are sometimes referred to a "methods".
The most obvious way to implement objects in C is to declare a
"struct" where some fields are pointers to functions which take a
pointer to the struct itself as their first argument. The calling
convention for method "foo" in object "bar" would simply be:
bar->foo(bar, ...args);
While this pattern is used in the Linux kernel it is not the dominant
pattern so we will leave discussion of it until a little later.
As methods (unlike state) are not normally changed on a per-object
basis, a more common and only slightly less obvious approach is to
collect all the methods for a particular class of objects into a
separate structure, sometimes known as a "virtual function table" or
vtable.
The object then has a single pointer to this table rather than a
separate pointer for each method, and consequently uses less memory.
This then leads to our first pattern - a pure vtable being a
structure which contains only function pointers where the first argument of
each is a pointer to some other structure (the object type) which itself
contains a pointer to this vtable.
Some simple examples of this in the Linux kernel are the
file_lock_operations
structure which contains two function pointers
each of which take a pointer to a struct file_lock, and the
seq_operations vtable which contains four function pointers
which each operate on a struct seq_file.
These two examples display an obvious naming pattern - the structure
holding a vtable is named for the structure holding the object
(possibly abbreviated) followed by "_operations". While this pattern is
common it is by no means universal. Around the time of 2.6.39
there are approximately 30 "*_operations" structures along with well over
100 "*_ops" structures, most if not all of which are vtables of some
sort. There are also several structs such as struct mdk_personality
which are essentially vtables but do not have particularly helpful
names.
Among these nearly 200 vtable structures there is plenty of variability
and so plenty of scope to look for interesting patterns. In
particular we can look for common variations from the "pure vtable"
pattern described above and determine how these variations contribute
to our understanding of object use in Linux.
NULL function pointers
The first observation is that some function pointers in some vtables
are allowed to be NULL. Clearly trying to call such a function would
be futile, so the code that calls into these methods generally
contains an explicit test for the pointer being NULL. There are a few
different reasons for these NULL pointers.
Probably easiest to justify is the incremental development
reason. Because of the way vtable structures are initialized, adding
a new function pointer to the structure definition causes all existing
table declarations to initialise that pointer to NULL. Thus it is
possible to add a caller of the new method before any instance
supports that method, and have it check for NULL and perform a default
behavior. Then as incremental development continues those vtable
instances which need it can get non-default methods.
A recent example is commit
77af1b2641faf4 adding
set_voltage_time_sel() to struct regulator_ops which acts on
struct regulator_dev. Subsequent commit
42ab616afe8844
defines that method
for a particular device. This is simply the most recent example of a
very common theme.
Another common reason is that certain methods are not particularly
meaningful in certain cases so the calling code simply tests for NULL
and returns an appropriate error when found. There are multiple
examples of this in the virtual filesystem (VFS) layer. For instance,
the create() function in
inode_operations
is only meaningful if the inode in question is a
directory. So inode_operations structures for non-directories
typically have NULL for the create() function (and many others) and
the calling code in vfs_create() checks for NULL and returns -EACCES.
A final reason that vtables sometimes contain NULL is that an element
of functionality might be being transitioned from one interface to
another. A good example of this is the ioctl() operation in
file_operations.
In 2.6.11, a new method, unlocked_ioctl() was added
which was called without the big kernel lock held. In 2.6.36, when all
drivers and filesystems had been converted to use unlocked_ioctl(),
the original ioctl() was finally removed. During this transition a file system
would typically define only one of two, leaving the other defaulting to NULL.
A slightly more subtle example of this is read() and aio_read(), also in
file_operations, and the corresponding write() and aio_write().
aio_read() was introduced to support asynchronous IO, and if it is
provided the regular synchronous read() is not needed (it is effected
using do_sync_read() which calls the aio_read() method). In this case
there appears to be no intention of ever removing read() - it will
remain for cases where async IO is not relevant such as special
filesystems like procfs and sysfs. So it is still the case that only one of each
pair need be defined by a filesystem, but it is not simply a transition, it is
a long-term state.
Though there seem to be several different reasons for a NULL function pointer,
almost every case is an example of one simple pattern - that of providing a
default implementation for the method. In the "incremental development"
examples and the non-meaningful method case, this is fairly straightforward.
e.g. the default for inode->create() is simply to return an error.
In the interface transition case it is only slightly less obvious. The default for
unlocked_ioctl() would be to take the kernel lock and then call
the ioctl() method. The
default for read() is exactly do_sync_read() and some filesystems such
as ext3
actually provide this value explicitly rather than using
"NULL" to indicate a default.
With that in mind, a little reflection suggests that if the real goal is to
provide a default, then maybe the best approach would be to explicitly give a
default rather than using the circuitous route of using a default of NULL and
interpreting it specially.
While NULL is certainly the easiest value to provide as a default - as the C
standard assures us that uninitialized members of a structure do get set to
NULL - it is not very much harder to set a more meaningful default.
I am indebted to LWN reader
wahern for the observation that
C99 allows fields in a structure to be initialized multiple times with only the
final value taking effect and that this allows easy setting of default values
such as by following the simple model:
#define FOO_DEFAULTS .bar = default_bar, .baz = default_baz
struct foo_operations my_foo = { FOO_DEFAULTS,
.bar = my_bar,
};
This will declare my_foo with a predefined default value for baz and a
localized value for bar. Thus for the small cost of defining a few "default"
functions and including a "_DEFAULTS" entry to each declaration, the
default value for any field can easily be chosen when the field is first
created, and automatically included in every use of the structure.
Not only are meaningful defaults easy to implement, they can lead to a more
efficient implementation. In those cases where the function pointer actually
is NULL it is probably faster to test and branch rather than to make an
indirect function call. However the NULL case is very often
the exception rather than the rule, and optimizing for an
exception is not normal practice. In the more common case when
the function pointer is not NULL, the test for NULL is simply a
waste of code space and a waste of execution time. If we
disallow NULLs we can make all call sites a little bit smaller and simpler.
In general, any testing performed by the caller before calling a method
can be seen as an instance of the "mid-layer mistake" discussed
in a previous article. It shows that the
mid-layer is making
assumptions about the behavior of the lower level driver rather than simply
giving the driver freedom to behave in whatever way is most
suitable. This may not always be an expensive mistake, but it
is still best avoided where possible.
Nevertheless there is a clear pattern in the Linux kernel that
pointers in vtables can sometimes be NULLable, typically though not
always to enable a transition, and the call sites should in these
cases test for NULL before proceeding with the call.
The observant reader will have noticed a hole in the above logic
denouncing the use NULL pointers for defaults. In the case where the
default is the common case and where performance is paramount, the
reasoning does not hold and a NULL pointer could well be justified.
Naturally the Linux kernel provides an example of such a case for our
examination.
One of the data structures used by the VFS for caching filesystem
information is the "dentry". A "dentry" represents a name in the
filesystem, and so each "dentry" has a parent, being the directory
containing it, and an "inode" representing the named file. The dentry
is separate from the inode because a single file can have multiple
names (so an "inode" can have multiple "dentry"s).
There is a dentry_operations vtable with a number of operations including,
for example, "d_compare" which will compare two names and "d_hash"
which will generate a hash for the name to guide the storage of the
"dentry" in a hash table. Most filesystems do not need this flexibility. They
treat names as uninterpreted strings of bytes so the default compare and hash
functions are the common case. A few filesystems define these to
handle case-insensitive names but that is not the norm.
Further, filename lookup is a common operation in Linux and so
optimizing it is a priority. Thus these two operations
appear to be good candidates where a test for NULL and an inlined
default operation might be appropriate. What we find though is that
when such an optimization is warranted it is not by itself enough.
The code that calls d_compare() and d_hash() (and a couple of other dentry
operations) does not test these functions for NULL directly. Rather
they require that a few flag bits (DCACHE_OP_HASH, DCACHE_OP_COMPARE)
in the "dentry" are set up to indicate whether the common default
should be used, or whether the function should be called. As the flag
field is likely to be in cache anyway, and the dentry_operations
structure will often be not needed at all, this avoids a memory fetch
in a hot path.
So we find that the one case where using a NULL function pointer to
indicate a default could be justified, it is not actually used; instead, a
different, more efficient,
mechanism is used to indicate that the default method is requested.
Members other than function pointers
While most vtable-like structures in the kernel contain exclusively
function pointers, there are a significant minority that have
non-function-pointer fields. Many of these appear on the surface
quite arbitrary and a few closer inspections suggest that some of them
result of poor design or bit-rot and their removal would only improve
the code.
There is one exception to the "functions only" pattern that occurs
repeatedly and provides real value, and so is worth exploring.
This pattern is seen in its most general form in
struct mdk_personality
which provides operations for a particular software
RAID level. In particular this structure contains an "owner", a
"name", and a "list". The "owner" is the module that provides the
implementation. The "name" is a simple identifier: some vtables have
string names, some have numeric names, and it is often called
something different like "version", "family", "drvname", or
"level". But conceptually it is still a name. In the present example
there are two names, a string and a numeric "level".
The "list", while part of the same functionality, is less
common. The mdk_personality structure has a struct list_head, as does
struct ts_ops.
struct file_system_type
has a simple pointer to the next
struct file_system_type.
The underlying idea here is that for any particular implementation of
an interface (or "final" definition of a class) to be usable, it must
be registered in some way so that it can be found. Further, once it
has been found it must be possible to ensure that the module holding
the implementation is not removed while it is in use.
There seem to be nearly as many styles of registration against an
interface in Linux as there are interfaces to register against, so
finding strong patterns there would be a difficult task. However it
is fairly common for a "vtable" to be treated as the primary handle on
a particular implementation of an interface and to have an "owner"
pointer which can be used to get a reference on the module which
provides the implementation.
So the pattern we find here is that a structure of function pointers
used as a "vtable" for object method dispatch should normally contain
only function pointers. Exceptions require clear justification. A
common exception allows a module pointer and possible other fields such
as a name and a list pointer. These fields are used to support
the registration protocol for the particular interface.
When there is no list pointer it is very likely that the entire vtable will
be treated as read-only. In this case the vtable will often be declared as
a const structure and so could even be stored in read-only memory.
Combining Methods for different objects
A final common deviation from the "pure vtable" pattern that we see in
the Linux kernel occurs when the first argument to the function is not
always the same object type. In a pure vtable which is referenced by
a pointer in a particular data structure, the first argument of
each function is exactly that data structure. What reason could there
be for deviating from that pattern? It turns out that there are few,
some more interesting than others.
The simplest and least interesting explanation is that, for no apparent
reason, the target data structure is listed elsewhere in the argument
list. For example all functions in
struct
fb_ops
take a struct fb_info. While in 18 cases that structure is the first argument, in
five cases it is the last. There is nothing obviously wrong with this
choice and it is unlikely to confuse developers. It is only a problem
for data miners like your author who need to filter it out as an
irrelevant pattern.
A slight deviation on this pattern is seen in
struct rfkill_ops
where two functions take a struct rkfill but the third - set_block() -
takes a void *data. Further investigation shows that this opaque
data is exactly that which is stored in rfkill->data, so set_block()
could easily be defined to take a struct rfkill and simply to follow
the ->data link itself. This deviation is sufficiently non-obvious
that it could conceivably confuse developers as well as data miners
and so should be avoided.
The next deviation in seen for example in
platform_suspend_ops,
oprofile_operations,
security_operations
and a few others. These
take an odd assortment of arguments with no obvious pattern. However
these are really very different sorts of vtable structures in that the
object they belong to are singletons. There is only one active
platform, only one profiler, only one security policy. Thus the
"object" on which these operations act is part of the global state and
so does not need to be included in the arguments of any functions.
Having filtered these two patterns out as not being very interesting
we are left with two that do serve to tell us something about object
use in the kernel.
quota_format_ops
and
export_operations
are two different operations structures that operate on a variety of different
data structures. In each case the apparent primary object (e.g. a
struct super_block or a struct dentry) already has a vtable
structure dedicated to it (such as super_operations or
dentry_operations) and these new structures add new operations. In
each case the new operations form a cohesive unit providing a related
set of functionality - whether supporting disk quotas or NFS export.
They don't all act on the same object simply because the functionality
in question depends on a variety of objects.
The best term from the language of object-oriented programming for this
is probably the "mixin".
Though the fit may not be perfect -
depending on what your exact understanding of mixin is - the idea of
bringing in a collection of functionality without using strict hierarchical
inheritance is very close to the purpose of quota_format_ops and
export_operations.
Once we know to be on the lookout for mixins like these we can find
quite a few more examples. The pattern to be alert for is not the one
that led us here - an operations structure that operates on a variety
of different objects - but rather the one we found where the functions
in an "operations" structure operate on objects that already have their
own "operations" structure. When an object has a large number of
operations that are relevant and these operations naturally group into
subsets, it makes a lot of sense to divide them into separate
vtable-like structures. There are several examples of this in the
networking code where for instance both
tcp_congestion_ops
and
inet_connection_sock_af_ops
operate (primarily) on a struct sock,
which itself has already got a small set of dedicated operations.
So the pattern of a "mixin" - at least as defined as a set of operations
which apply to one or more objects without being the primary
operations for those objects - is a pattern that is often found in the
kernel and appears to be quite valuable in allowing better
modularization of code.
The last pattern which explains non-uniform function targets is
probably the most interesting, particularly in its contrast to the
obvious application of object-oriented programming style.
Examples of this pattern abound with
ata_port_operations,
tty_operations,
nfs_rpc_ops
and
atmdev_ops
all appearing as
useful examples. However we will focus primarily on some examples
from the filesystem layer, particularly
super_operations
and
inode_operations.
There is a strong hierarchy of objects in the implementation of a
filesystem where the filesystem - represented by a "super_block" - has a
number of files (struct inode) which may have a number of names or
links (struct dentry). Further each file might store data in the page
cache (struct address_space) which comprises a number of individual
pages (struct page).
There is a sense in which all of these different objects belong to the
filesystem as a whole. If a page needs to be loaded with data from a
file, the filesystem knows how to do that, and it is probably the same
mechanism for every page in every file. Where it isn't always the
same, the filesystem knows that too. So we could conceivably store
every operation on every one of these objects in the
struct super_block, as it represents the filesystem and could know what to
do in each case.
In practice that extreme is not really helpful. It is quite likely
that while there are similarities between the storage of a regular file
and a directory, there are also important differences and being able
to encode those differences in separate vtables can be helpful.
Sometimes small symbolic links are stored directly in the inode while
larger links are stored like the contents of a regular file. Having
different readlink() operations for the two cases can make the code a
lot more readable.
While the extreme of every operation attached to the one central
structure is not ideal, it is equally true that the opposite extreme
is not ideal either. The struct page in Linux does not have a
vtable pointer at all - in part because we want to keep the structure
as small as possible because it is so populous. Rather the
address_space_operations structure contains the operations that act
on a page. Similarly the super_operations structure contains some
operations that apply to inodes, and inode_operations contains some
operations that apply to dentries.
It is clearly possible to have operations structures attached to a
parent of the target object - providing the target holds a reference
to the parent, which it normally does - though it is not quite so clear that
it is always beneficial. In the case of struct page which avoids
having a vtable pointer altogether the benefit is clear. In the case
of struct inode which has its own vtable pointer, the benefit of
having some operations (such as destroy_inode() or write_inode())
attached to the super_block is less clear.
As there are several vtable structures where any given function
pointer could be stored, the actual choice is in many cases little
more than historical accident. Certainly the proliferation
of struct dentry operations in inode_operations seems to be
largely due to the fact that some of them used to act directly
on the inode, but changes in the VFS eventually required this
to change. For example in 2.1.78-pre1, each of link(),
readlink(), followlink() (and some others which are now
defunct) were
changed
from taking a struct inode to
take a struct dentry instead. This set the scene for "dentry"
operations to be in inode_operations, so when setattr and getattr
were added for 2.3.48, it probably seemed completely natural to
include them in inode_operations despite the fact that they acted
primarily on a dentry.
Possibly we could simplify things by getting rid of
dentry_operations altogether. Some operations that act on dentries
are already in inode_operations and super_operations - why not move
them all there? While dentries are not as populous as struct page
there are still a lot of them and removing the "d_op" field could save
5% of the memory used by that structure (on x86-64).
With two exceptions, every active filesystem only has a single dentry
operations structure in effect. Some filesystem implementations like
"vfat" define two - e.g. one with case-sensitive matching and one with
case-insensitive matching - but there is only one active per
super-block. So it would seem that the operations in
dentry_operations could be moved to super_operations, or at
least accessed through "s_d_op".
The two exceptions are ceph and procfs. These filesystems use
different d_revalidate() operations in different parts of the
filesystem and - in the case of procfs - different d_release()
operations. The necessary distinctions could easily be made in
per-superblock versions of these operations. Do these cases justify the 5%
space cost? Arguably not.
Directly embedded function pointers
Finally it is appropriate to reflect on the alternate pattern
mentioned at the start, where function pointers are stored directly in
the object rather than in a separate vtable structure.
This pattern can be seen in
struct request_queue
which has nine function
pointers,
struct efi
which has ten function pointers, and
struct sock
which has six function pointers.
The cost of embedded pointers is obviously space.
When vtables are used, there is only one copy
of the vtable and multiple copies of an object (in most cases) so if
more than one function pointer is needed, a vtable would save space.
The cost of a vtable is an extra memory reference, though cache might
reduce much of this cost in some cases. A vtable also has a cost of
flexibility. When each object needs exactly the same set of operations
a vtable is good, but if there is a need to individually tailor some
of the operations for each object, then embedded function pointer can
provide that flexibility. This is illustrated quite nicely by the
comment with "zoom_video" in struct pcmcia_socket
/* Zoom video behaviour is so chip specific its not worth adding
this to _ops */
So where objects are not very populous, where the list of function
pointers is small, and where multiple mixins are needed,
embedded function pointers are used instead of a separate vtable.
Method Dispatch Summary
If we combine all the pattern elements that we have found in
Linux we find that:
Method pointers that operate on a particular type of object are
normally collected in a vtable associated directly with that
object, though they can also appear:
- In a mixin vtable that collects related functionality which
may be selectable independently of the base type of the
object.
- In the vtable for a "parent" object when doing so avoids
the need for a vtable pointer in a populous object
- Directly in the object when there are few method pointers,
or they need to be individually tailored to the particular
object.
These vtables rarely contain anything other than function
pointers, though fields needed to register the object class can
be appropriate. Allowing these function pointers to be NULL is
a common but not necessarily ideal technique for handling
defaults.
So in exploring the Linux Kernel code we have found that even
though it is not written in an object-oriented language, it
certainly contains objects, classes (represented as vtables),
and even mixins. It also contains concepts not normally
found in object-oriented languages such as delegating
object methods to a "parent" object.
Hopefully understanding these different patterns and the
reasons for choosing between them can lead to more uniform
application of the patterns across the kernel, and hence make
it easier for a newcomer to understand which pattern is being
followed.
In the second part of our examination of object
oriented patterns we will explore the various ways that
data inheritance is achieved in the Linux kernel
and discuss the strengths and weaknesses of each approach so
as to see where each is most appropriate.
(
Log in to post comments)