Brief items
The current development kernel is 3.0-rc2,
released on June 6. "
It's been
reasonably quiet, although the btrfs update is bigger than I was hoping
for. Other than that, it's mostly driver fixes, some ubifs updates too, and
a few reverts for the early regressions." The short changelog is in
the announcement, or see
the
full changelog for the details.
Stable updates: no stable updates have been released in the last
week, and none are in the review process as of this writing.
Comments (none posted)
Bugs are like mushrooms - found one, look around for more...
--
Al Viro
Maximizing security is hard: whether a bug has security
implications is highly usecase and bug dependent, and the true
security impact of bugs is not discovered in the majority of
cases. I estimate that in *ANY* OS there's probably at least 10
times more bugs with some potential security impact than ever get a
CVE number...
So putting CVEs into the changelog is harmful, pointless,
misleading and would just create a fake "scare users" and "gain
attention" industry (coupled with a "delay bug fixes for a long
time" aspect, if paid well enough) that operates based on issuing
CVEs and 'solving' them - which disincentivises the *real* bugfixes
and the non-self-selected bug fixers.
I'd like to strengthen the natural 'bug fixing' industry, not the
security circus industry.
--
Ingo Molnar
Comments (47 posted)
By Jonathan Corbet
June 8, 2011
The next3 filesystem patch, which added snapshots to the ext3 filesystem,
appeared just over one year ago; LWN's
discussion of the patch at the time concluded
that it needed to move forward to ext4 before it could possibly be merged.
That change has been made, and recent
ext4
snapshot patches are starting to look close to being ready for merging
into the mainline. That has inspired the airing of new concerns which may
slow the process somewhat.
One complaint came from Josef Bacik:
I probably should have brought this up before, but why put all this
effort into shoehorning in such a big an invasive feature to ext4
when btrfs does this all already? Why not put your efforts into
helping btrfs become stable and ready and then use that, instead of
having to come up with a bunch of hacks to get around the myriad of
weird feature combinations you can get with ext4?
Snapshot developer Amir Goldstein's response is that his employer (CTERA Networks)
wanted the feature in ext4. The feature is shipping in products now, and
btrfs is still not seen as stable enough to use in that environment.
There are general concerns about merging another big feature into a
filesystem which is supposed to be stable and ready for production use.
Nobody wants to see the addition of serious bugs to ext4 at this time.
Beyond that, the snapshot feature does not currently work with all variants
of the ext4 on-disk format. There are a number of ext4 features which do
not currently play well together, leading Eric Sandeen to worry about where the filesystem is going:
If ext4 matches the lifespan of ext3, in 10 years I fear that it
will look more like a collection of various individuals' pet
projects, rather than any kind of well-designed, cohesive project.
How long can we really keep adding features which are semi- or
wholly- incompatible with other features?
Consider this a cry in the wilderness for less rushed feature
introduction, and a more holistic approach to ext4 design...
Ext4 maintainer Ted Ts'o has responded with
a rare (for the kernel community) admission that technical concerns are not
the sole driver of feature-merging decisions:
It's something I do worry about; and I do share your concern. At
the same time, the reality is that we are a little like the Old
Dutch Masters, who had take into account the preference of their
patrons (i.e., in our case, those who pay our paychecks :-).
In this case, he thinks that there are a lot of people who are interested
in the snapshot feature. He worried that
companies like CTERA could move away from ext4 if it can't be made to meet
their needs. So his plan is to merge snapshots once (1) the patches are good
enough and (2) it looks like there is a plan to address the remaining issues.
Comments (4 posted)
Kernel development news
By Jonathan Corbet
June 8, 2011
The "vsyscall" and "vDSO" segments are two mechanisms used to accelerate
certain system calls in Linux. While their basic function (provide fast access
to functionality which does not need to run in kernel mode) is the same,
there are some distinct differences between them. Recently vsyscall has
come to be seen as an enabler of security attacks, so some patches have
been put together to phase it out. The discussion of those patches shows
that the disagreement over how security issues are handled by the community
remains as strong as ever.
The vsyscall area is the older of these two mechanisms. It was added as a
way to execute specific system calls which do not need any real level of
privilege to run. The classic example is gettimeofday(); all it
needs to do is to read the kernel's idea of the current time. There are
applications out there that call gettimeofday() frequently, to the
point that they care about even a little bit of overhead. To address that
concern, the kernel allows the page containing the current time to be
mapped read-only into user space; that page also contains a fast
gettimeofday() implementation. Using this virtual system call,
the C library can provide a fast gettimeofday() which never
actually has to change into kernel mode.
Vsyscall has some limitations; among other things, there is only space for
a handful of virtual system calls. As those limitations were hit, the
kernel developers introduced the more flexible vDSO implementation. A
quick look on a contemporary system will show that both are still in use:
$ cat /proc/self/maps
...
7fffcbcb7000-7fffcbcb8000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
The key to the current discussion can be seen by typing the same command
again and comparing the output:
7fff379ff000-7fff37a00000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Note that the vDSO area has moved, while the vsyscall page remains at the
same location. The location of the vsyscall page is nailed down in the
kernel ABI, but the vDSO area - like most other areas in the user-space
memory layout - has its location randomized every time it is mapped.
Address-space layout randomization is a form of defense against security
holes. An attacker who is able to overrun the stack can often arrange for
a function in the target process to "return" to an arbitrary address.
Depending on what instructions are found at that address, this return can
cause almost anything to happen. Returning into the system()
function in the C library is an obvious example; it can be used to execute
arbitrary commands. If the location of the C library in memory is not
known, though, then it becomes difficult or impossible for an exploit to
jump into a useful place.
There is no system() function in the vsyscall page, but there are
several machine instructions that invoke system calls. With just a bit of setup, these
instructions might be usable in a stack overrun attack to invoke an
arbitrary system call with attacker-defined parameters - not a desirable
outcome. So it would be nice to get rid of - or at least randomize the
location of - the vsyscall page to thwart this type of attack.
Unfortunately, applications depend on the existence and exact address of
that page, so nothing can be done.
Except that Andrew Lutomirski found something
that could be done: remove all of the useful instructions from the
vsyscall page. One was associated with the vsyscall64 sysctl
knob, which is really only useful for user-mode Linux (and does not work
properly even there); it was simply deleted. Others weren't actually
system call instructions as such: the system time, if jumped into (and,
thus, executed as if it were code) when it held just
the right value, looks like a system call instruction. To address that problem,
variables have been moved into a separate page with execute permission
turned off.
The remaining code in the vsyscall page has simply been removed and
replaced by a special trap instruction. An application trying to call into
the vsyscall page will trap into the kernel, which will then emulate the
desired virtual system call in kernel space. The result is a kernel system
call emulating a virtual system call which was put there to avoid the
kernel system call in the first place. The result is a "vsyscall" which
takes a fraction of a microsecond longer to execute but, crucially, does
not break the existing ABI. In any case, the slowdown will only be seen if
the application is trying to use the vsyscall page instead of the vDSO.
Contemporary applications should not be doing that most of the time, except
for one little problem: glibc still uses the vsyscall version of
time(). That has been fixed in the glibc repository, but the fix
may not find its way out to users for a while; meanwhile, time()
calls will be a little slower than they were before. That should not
really be an issue, but one never knows, so Andy put in a configuration
option to preserve the old way of doing things. Anybody worried about the
overhead of an emulated vsyscall page can set
CONFIG_UNSAFE_VSYSCALLS to get the old behavior.
Nobody really objected to the patch series as a whole, but Linus hated the
name of the configuration option; he asked that it be called
CONFIG_LEGACY_VSYSCALLS instead. Or, even better, the change
could just be done unconditionally. That led to a fairly predictable
response
from the PaX developer on how the kernel community likes to hide security
problems, to which Linus said:
Calling the old vdso "UNSAFE" as a config option is just plain
stupid. It's a politicized name, with no good reason except for
your political agenda. And when I call it out as such, you just
spout the same tired old security nonsense.
Suffice to say that the conversation went downhill from there; interested
parties can follow the thread links in the messages cited above.
One useful point from that discussion is that the static vsyscall page is
not, in fact, a security vulnerability; it's simply a resource which can
make it easier for an attacker to exploit a vulnerability elsewhere in the
system. Whether that aspect makes that page "unsafe" or merely "legacy" is
left as an exercise for the reader. Either way, removing it is seen as a
good idea even though that removal might, arguably, cause real security
bugs to remain unfixed in the kernel; the argument is all about naming.
Final versions of the patches have not been posted as of this writing, but
the shape they will take is fairly clear. The static vsyscall page will
not continue to exist in its current form, and applications which still use
it will continue to work but will get a little bit slower. The
configuration option controlling this
behavior may or may not exist, but any distribution shipping a kernel
containing this change (presumably 3.1 or later) will also have a C library
which no longer tries to use the vsyscall page. And, with luck, exploiting
vulnerabilities will get just a little bit harder.
Comments (9 posted)
By Jonathan Corbet
June 7, 2011
Efforts to reduce power consumption on Linux systems have often focused on
the CPU; that emphasis does make sense, since the CPU draws a significant
portion of the power used on most systems. After years of effort to
improve the kernel's power behavior, add instrumentation to track wakeups,
and fix misbehaving applications, Linux does quite well when it comes to
CPU power. So now attention is moving to other parts of the system in
search of power savings to be had; one of those places is main memory.
Contemporary DRAM memory requires power for its self-refresh cycles even if
it is not being used; might there be a way to reduce its consumption?
One technology which is finding its way into some systems is called
"partial array self refresh" or PASR. On a PASR-enabled system, memory is
divided into banks, each of which can be powered down independently. If
(say) half of memory is not needed, that memory (and its self-refresh
mechanism) can be turned off; the result is a reduction in power use, but
also the loss of any data stored in the affected banks. The amount of
power actually saved is a bit unclear; estimates seem to run in the range
of 5-15% of the total power used by the memory subsystem.
The key to powering down a bank of memory, naturally, is to be sure that
there is no important data stored therein first. That means that the
system must either evacuate a bank to be powered down, or it must take care
not to allocate memory there in the first place. So the memory management
subsystem will have to become aware of the power topology of main memory
and take that information into account when satisfying allocation
requests. It will also have to understand the desired power management
policy and make decisions to power banks up or down depending on the
current level of memory pressure. This is going to be fun: memory
management is already a complicated set of heuristics which attempt to
provide reasonable results for any workload; adding power management into
the mix can only complicate things further.
A recent patch set from Ankita Garg does
not attempt to solve the whole problem; instead, it creates an initial
infrastructure which can be used for future power management decisions.
Before looking at that patch, though, a bit of background will be helpful.
The memory management subsystem already splits available memory at two
different levels. On non-uniform memory access (NUMA) systems, memory
which is local to a specific processor will be faster to access than memory
on a different processor. The kernel's memory management code takes NUMA nodes
into account to implement specific allocation policies. In many cases, the
system will try to keep a process and all of its memory on the same NUMA
node in the hope of maximizing the number of local accesses; other times,
it is better to spread allocations evenly across the system. The point is
that the NUMA node must be taken into account for all allocation and
reclaim decisions.
The other important concept is that of a "zone"; zones are present on all
systems. The primary use of zones is to categorize memory by
accessibility; 32-bit systems, for example, will have "low memory" and
"high memory" zones to contain memory which can and cannot (respectively)
be directly accessed by the kernel. Systems may have a zone
for memory accessible with a 32-bit address; many devices can only perform
DMA to such addresses. Zones are also used to separate memory which can
readily be relocated (user-space pages accessed through page tables, for
example) from memory which is hard to move (kernel memory for which there
may be an arbitrary number of pointers). Every NUMA node has a full set of
zones.
PASR has been on the horizon for a little while, so a few people have been
thinking about how to support it; one of the early works would appear to be
this paper by Henrik
Kjellberg, though that work didn't result in code submitted upstream.
Henrik pointed out that the kernel already has a couple of
mechanisms which could be used to support PASR.
One of those is memory hotplug, wherein memory can be physically removed
from the system. Turning off a bank of memory can be thought of as being
something close to removing that memory, so it makes sense to consider
hotplug. Hotplug is a heavyweight operation, though; it is not well suited
to power management, where decisions to power banks of memory up or down
may be made fairly often.
Another approach would be to use zones; the
system could set up a separate zone for each memory bank which could be
powered down independently. Powering down a bank would then be a matter of
moving needed data out of the associated zone and marking that zone so that
no further allocations would be made from it. The problem with this
approach is a number of important memory management operations happen at
the zone level; in particular, each zone has a set of limits on how many
free pages must exist. Adding more zones would increase memory management
overhead and create balancing problems which don't need to exist.
That is the approach that Ankita has taken, though; the patch adds another
level of description called "regions" interposed between nodes and zones,
essentially creating not just one new zone for each bank of memory, but a
complete set of zones for each. The page
allocator will always try to obtain pages from the lowest-numbered region
it can in the hope that the higher regions will remain vacant. Over time,
of course, this simple approach will not work and it will become necessary
to migrate pages out of regions before they can be powered down. The
initial patch does not address that issue, though - or any of the
associated policy issues that come up.
Your editor is not a memory management hacker, but ignorance has never kept
him from having an opinion on things. To a naive point of view, it would
almost seem like this design has been done backward - that regions should
really be contained within zones. That would avoid multiplying the number
of zones in the system and the associated balancing costs. Also,
importantly, it would allow regions to be controlled by the policy of a
single enclosing zone. In particular, regions inside a zone used for
movable allocations would be vacated with relative ease, allowing them to
be powered down when memory pressure is light. Placing multiple zones
within each region, instead, would make clearing a region harder.
The patch set has not gotten a lot of review attention; the people who know
what they are talking about in this area have mostly kept silent.
There are numerous
memory management patches circulating at the moment, so time for review is
probably scarce. Andrew Morton did ask
about the overhead of this work on machines which lack the PASR capability
and about how much power might actually be saved; answers to those
questions don't seem to be available at the moment. So one might conclude
that this patch set, while demonstrating an approach to memory power
management, will not be ready for mainline inclusion in the near future.
But, then, adding power management to such a tricky subsystem was never
going to be done in a hurry.
Comments (17 posted)
June 7, 2011
This article was contributed by Neil Brown
In the
first part
of this analysis we looked at how the polymorphic side of object-oriented
programming was implemented in the Linux kernel using regular C constructs. In
particular we
examined method dispatch, looked at the different forms that vtables could
take, and the circumstances where separate vtables were eschewed in preference
for storing function pointers directly in objects.
In this conclusion we will explore a second important aspect of object-oriented
programming - inheritance, and in particular data inheritance.
Data inheritance
Inheritance is a core concept of object-oriented programming, though it
comes in many forms, whether prototype inheritance, mixin inheritance,
subtype inheritance, interface inheritance etc., some of which overlap.
The form that is of interest when exploring the Linux kernel is most
like subtype inheritance, where a concrete or "final" type inherits
some data fields from a "virtual" parent type. We will call this "data
inheritance" to emphasize the fact that it is the data rather than the
behavior that is being inherited.
Put another way, a number of different implementations of a particular
interface share, and separately extend, a common data structure. They
can be said to inherit from that data structure.
There are three different approaches to this sharing and extending
that can be found in the Linux kernel, and all can be seen by
exploring the struct inode structure and its history, though they
are widely used elsewhere.
Extension through unions
The first approach, which is probably the most obvious but also the
least flexible, is to declare a union as one element of the common
structure and, for each implementation, to declare an entry in that
union with extra fields that the particular implementation needs.
This approach was
introduced
to struct inode in Linux-0.97.2 (August 1992) when
union {
struct minix_inode_info minix_i;
struct ext_inode_info ext_i;
struct msdos_inode_info msdos_i;
} u;
was added to struct inode. Each of these structures remained empty
until 0.97.5 when i_data was
moved
from struct inode to
struct ext_inode_info.
Over the years several more "inode_info" fields were added for
different filesystems, peaking at 28 different "inode_info" structures in
2.4.14.2 when
ext3 was added.
This approach to data inheritance is simple and straightforward, but
is also somewhat clumsy. There are two obvious problems.
Firstly, every new filesystem implementation needs to add an extra
field to the union "u". With 3 fields this may not seem like a
problem, with 28 it was well past "ugly". Requiring every filesystem to
update this one structure is a barrier to adding filesystems that is
unnecessary.
Secondly, every inode allocated will be the same size and will be
large enough to store the data for any filesystem. So a filesystem
that wants lots of space in its "inode_info" structure will impose
that space cost on every other filesystem.
The first of these issues is not an impenetrable barrier as we will see
shortly. The second is a real problem and the general ugliness of the
design encouraged change. Early in the 2.5 development series
this change began; it was completed by 2.5.7 when there were no
"inode_info" structures left in union u (though the union itself
remained until 2.6.19).
Embedded structures
The change that happened to inodes in early 2.5 was effectively an
inversion. The change which
removed
ext3_i from struct inode.u also
added
a struct inode, called vfs_inode, to
struct ext3_inode_info.
So instead of the private structure being embedded in the common data
structure, the common data structure is now embedded in the private
one.
This neatly avoids the two problems with unions; now each filesystem
needs to only allocate memory to store its own structure without any need
to know anything about what other filesystems might need. Of course
nothing ever comes for free and this change brought with it other
issues that needed to be solved, but the solutions were not costly.
The first difficulty is the fact that when the common filesystem code
- the VFS layer - calls into a specific filesystem it passes
a pointer to the common data structure, the struct inode. Using
this pointer, the filesystem needs to find a pointer to its own
private data structure. An obvious approach is to always place the
struct inode at the top of the private inode structure and simply
cast a pointer to one into a pointer to the other. While this can
work, it lacks any semblance of type safety and makes it harder to
arrange fields in the inode to get optimal performance - as some
kernel developers are wont to do.
The solution was to use the
list_entry()
macro to perform the
necessary pointer arithmetic, subtracting from the address of the
struct inode its offset in the private data structure and then casting
this appropriately.
The macro for this was called list_entry() simply because the
"list.h lists" implementation was the first to use this pattern of data
structure embedding. The list_entry() macro did exactly
what was needed and so it was used despite the strange name.
This practice lasted until 2.5.28 when a new container_of() macro was
added
which implemented the same functionality as list_entry(), though
with slightly more type safety and a more meaningful name.
With container_of() it is a simple
matter to map from an embedded data structure to the structure in which it is
embedded.
The second difficulty was that the filesystem had to be responsible
for allocating the inode - it could no longer be allocated by common
code as the common code did not have enough information to allocate
the correct amount of space. This simply involved
adding
alloc_inode() and destroy_inode() methods to the
super_operations structure and calling them as appropriate.
Void pointers
As noted earlier, the union pattern was not an impenetrable barrier to adding
new filesystems independently. This is because the union u had one
more field that was not an "inode_info" structure.
A generic pointer field
called generic_ip was
added
in Linux-1.0.5, but it was not used until 1.3.7.
Any file system that does not own a structure in struct inode itself
could define and allocate a separate structure and link it to the inode
through u.generic_ip. This approach addressed both of the problems
with unions as no changes are needed to shared declarations and each
filesystem only uses the space that it needs. However it again
introduced new problems of its own.
Using generic_ip, each filesystem required two allocations
for each inode instead of one and this could lead to more wastage
depending on how the structure size was rounded up for allocation; it also
required writing more error-handling code.
Also there was memory used for the generic_ip pointer and often for a
back pointer from the private structure to the common struct inode.
Both of these are wasted space compared with the union approach or
the embedding approach.
Worse than this though, an extra memory dereference was needed to
access the private structure from the common structure; such
dereferences are best avoided. Filesystem code will often need to
access both the common and the private structures. This either
requires lots of extra memory dereferences, or it requires holding the
address of the private structure in a register which increases
register pressure.
It was largely these concerns that stopped struct inode from ever
migrating to broad use of the generic_ip pointer. It was certainly
used, but not by the major, high-performance filesystems.
Though this pattern has problems it is still in wide use.
struct super_block has an s_fs_info pointer which serves the same
purpose as u.generic_ip (which has since been renamed to
i_private when the u union was finally removed - why it was
not completely removed is left as an exercise for the reader). This is the only
way to store filesystem-private data in a super_block. A simple search in
the Linux include files shows quite a collection of fields which are
void pointers named "private" or something similar. Many
of these are examples of the pattern of extending a data type by using
a pointer to a private extension, and most of these could be converted to using
the embedded-structure pattern.
Beyond inodes
While inodes serve as an effective vehicle to introduce these three patterns
they do not display the full scope of any of them so it is useful to look
further afield and see what else we can learn.
A survey of the use of unions elsewhere in the kernel shows that they are widely
used though in very different circumstances than in struct inode. The
particular aspect of inodes that is missing elsewhere is that a wide range of
different modules (different filesystems) each wanted to extend an inode in
different ways. In most places where unions are used there are a small fixed
number of subtypes of the base type and there is little expectation of more
being added. A simple example of this is
struct nfs_fattr
which stores file attribute information decoded out of an NFS reply. The
details of these attributes are slightly different for NFSv2 and NFSv3 so there
are effectively two subtypes of this structure with the difference encoded in a
union. As NFSv4 uses the same information as NFSv3 this is very unlikely to
ever be extended further.
A very common pattern in other uses of unions in Linux is for encoding messages
that are passed around, typically between the kernel and user-space.
struct siginfo
is used to convey extra information with a signal delivery.
Each signal type has a different type
of ancillary information, so struct siginfo has a union to
encode six different subtypes.
union inputArgs
appears to be the largest current union with 22 different subtypes. It is used
by the "coda" network file system to pass requests between the kernel module
and a user-space daemon which handles the network communication.
It is not clear whether these examples should be considered as the same pattern
as the original struct inode. Do they really represent different
subtypes of a base type, or is it just one type with internal variants?
The
Eiffel
object-oriented programming language
does not support variant types at all except through subtype inheritance so
there is clearly a school of thought that would want to treat all usages of
union as a form of subtyping. Many other languages, such as C++, provide both
inheritance and unions allowing the programmer to make a choice. So the answer
is not clear.
For our purposes it doesn't really matter what we call it as long as we know
where to use each pattern. The examples in the kernel fairly clearly show that
when all of the variants are understood by a single module, then a union is a
very appropriate mechanism for variants structures, whether you want to refer
to them as using data inheritance or not. When different subtypes are managed
by different modules, or at least widely separate pieces of code, then one of
the other mechanisms is preferred. The use of unions for this case has almost
completely disappeared with only
struct cycx_device
remaining as an example of a deprecated pattern.
Problems with void pointers
Void pointers are not quite so easy to classify. It would probably be fair to say
that void pointers are the modern equivalent of "goto" statements. They can be
very useful but they can also lead to very convoluted designs. A particular
problem is that when you look at a void pointer, like looking at a goto, you
don't really know what it is pointing at. A void pointer called
private is even worse - it is like a "goto destination"
command - almost meaningless without reading lots of context.
Examining all the different uses that void pointers can be put to would be well
beyond the scope of this article. Instead we will restrict our attention to
just one new usage which relates to data inheritance and illustrates how the
untamed nature of void pointers makes it hard to recognize their use in data
inheritance.
The example we will use to explain this usage is
struct
seq_file
used by the seq_file library which makes it easy to synthesize simple
text files like some of those in /proc. The "seq" part of seq_file
simply indicates that the file contains a sequence of lines corresponding to a
sequence of items of information in the kernel, so /proc/mounts is a
seq_file which walks through the mount table reporting each mount
on a single line.
When
seq_open()
is used to create a new seq_file it allocates a
struct seq_file and assigns it to the private_data field of
the struct file which is being opened. This is a straightforward
example of void pointer based data inheritance where the struct file
is the base type and the struct seq_file is a simple extension to that
type. It is a structure that never exists by itself but is always the
private_data for some file.
struct seq_file itself has a private field which is a void
pointer and it can be used by clients of seq_file to add extra state to the
file. For example
md_seq_open()
allocates a struct mdstat_info structure and attaches it via
this private field, using it to meet md's internal needs. Again,
this is simple data inheritance following the described pattern.
However the private field of struct seq_file is used by
svc_pool_stats_open()
in a subtly but importantly different way.
In this case the extra data needed is just a single pointer.
So rather than allocating a local data structure to refer to from
the private field, svc_pool_stats_open simply stores that pointer
directly in the private field itself.
This certainly seems like a sensible optimization - performing an allocation to
store a single pointer would be a waste - but it highlights exactly the source
of confusion that was suggested earlier: that when you look at a void
pointer you don't really know what is it pointing at, or why.
To make it a bit clearer what is happening here, it is helpful to imagine
"void *private" as being like a union of every different possible
pointer type. If the value that needs to be stored is a pointer, it can be
stored in this union following the "unions for data inheritance" pattern. If the
value is not a single pointer, then it gets stored in allocated space following
the "void pointers for data inheritance" pattern.
Thus when we see a void pointer being used it may not be obvious whether it is
being used to point to an extension structure for data inheritance, or being
used as an extension for data inheritance (or being used as something else
altogether).
To highlight this issue from a slightly different perspective it is instructive
to examine
struct v4l2_subdev
which represents a sub-device in a video4linux device, such as a sensor or
camera controller within a webcam.
According to the (rather helpful)
documentation
it is expected that this structure will normally be embedded in a larger
structure which contains extra state. However this structure still has not just one
but two void pointers, both with names suggesting that they are for private use
by subtypes:
/* pointer to private data */
void *dev_priv;
void *host_priv;
It is common that a v4l sub-device (a sensor, usually) will be realized by,
for example, an I2C
device (much as a block device which stores your filesystem might be realized
by an ATA or SCSI device). To allow for this common occurrence,
struct v4l2_subdev provides a void pointer (dev_priv), so
that the driver itself
doesn't need to define a more specific pointer in the larger
structure which struct v4l2_subdev would be embedded in.
host_priv is intended to point back to a "parent" device such as a
controller which acquires video data from the sensor. Of the three drivers
which use this field, one
appears to follow that intention while the other
two
use it to point to an allocated extension structure. So both of these
pointers are intended to be used following the "unions for data
inheritance" pattern, where a void pointer is playing the role of a union
of many other pointer types, but they are not always used that way.
It is not immediately clear that defining this void pointer in case it is
useful is actually a valuable service to provide given that the device
driver could easily enough define its own (type safe) pointer in its extension
structure.
What is clear is that an apparently "private" void pointer can be intended for various qualitatively different
uses and, as we have seen in two different circumstances, they may not be used exactly as expected.
In short, recognizing the "data inheritance through void pointers" pattern is
not easy. A fairly deep examination of the code is needed to determine the
exact purpose and usage of void pointers.
A diversion into struct page
Before we leave unions and void pointers behind a look at
struct
page
may be interesting. This structure uses both of these patterns, though they
are hidden somewhat due to historical baggage. This example is particularly
instructive because it is one case where struct embedding simply is not an option.
In Linux memory is divided into pages, and these pages are put to a
variety of different uses. Some are in the "page cache" used to store the
contents of files. Some are "anonymous pages" holding data used by
applications. Some are used as "slabs" and divided into pieces to answer
kmalloc() requests. Others are
simply part of a multi-page allocation or maybe are on a free list
waiting to be used.
Each of these different use cases could be seen as a subtype of the
general class of "page", and in most cases need some dedicated fields in
struct page,
such as a struct address_space pointer and index
when used in the page cache, or struct kmem_cache and
freelist pointers when used as a slab.
Each page always has the same struct page describing it, so if the
effective type of the page is to change - as it must as the demands for
different uses of memory change over time - the type of the struct page
must change within the lifetime of that structure. While many type systems are
designed assuming that the type of an object is immutable, we find here that
the kernel has a very real need for type mutability. Both unions and void pointers
allow types to change and as noted, struct page uses both.
At the first level of subtyping there are only a small number of different
subtypes as listed above; these are all known to the core memory management
code, so a union would be ideal here. Unfortunately struct page has three
unions with fields for some subtypes spread over all three, thus hiding the
real structure somewhat.
When the primary subtype in use has the page being used in the page cache, the
particular address_space that it belongs to may want to extend the
data structure further. For this purpose there is a private field
that can be used. However it is not a void pointer but is an unsigned
long.
Many places in the kernel assume an unsigned long and a void *
are the same size and this is one of them. Most users of this field actually
store a pointer here and have to cast it back and forth. The "buffer_head"
library provides macros
attach_page_buffers
and
page_buffers
to set and get this field.
So while struct page is not the most elegant example, it is an
informative example of a case where unions and void pointers are the only
option for providing data inheritance.
The details of structure embedding
Where structure embedding can be used, and where the list of possible subtypes
is not known in advance, it seems to be increasingly the preferred choice.
To gain a full understanding of it we will again need to explore a little bit
further than inodes and contrast data inheritance with other uses of structure
embedding.
There are essentially three uses for structure embedding - three reasons for
including a structure within another structure. Sometimes there is nothing
particularly interesting going on. Data items are collected together into
structures and structures within structures simply to highlight the closeness
of the relationships between the different items. In this case the address of
the embedded structure is rarely taken, and it is never mapped back to
the containing structure using container_of().
The second use is the data inheritance embedding that we have already
discussed. The third is like it but importantly different. This third use is
typified by struct list_head and other structs used as an
embedded anchor
when creating abstract data types.
The use of an embedded anchor like struct list_head can be seen as a
style of inheritance as the structure containing it "is-a" member of a list
by virtue of inheriting from struct list_head. However it is not a
strict subtype as a single object can have several struct list_heads
embedded - struct inode has six (if we include the
similar hlist_node). So it is probably best to think of this sort of
embedding more like a "mixin" style of inheritance. The struct
list_head provides a service - that of being included in a list - that can
be mixed-in to other objects, an arbitrary number of times.
A key aspect of data inheritance structure embedding that differentiates it from
each of the other two is the existence of a reference counter in the inner-most
structure. This is an observation that is tied directly to the fact that the
Linux kernel uses reference counting as the primary means of lifetime
management and so would not be shared by systems that used, for example,
garbage collection to manage lifetimes.
In Linux, every object with an independent existence will have a
reference counter, sometimes a simple atomic_t or even
an int, though often a more explicit
struct kref.
When an object is created using several levels of inheritance the reference
counter could be buried quite deeply.
For example a
struct usb_device
embeds a
struct device
which embeds
struct kobject
which has a
struct kref.
So usb_device (which might in turn be embedded in a structure for some
specific device) does have a reference counter, but it is contained several
levels down in the nest of structure embedding.
This contrasts quite nicely with a list_head and similar structures.
These have no reference counter, have no independent existence and simply
provide a service to other data structures.
Though it seems obvious when put this way, it is useful to remember that a
single object cannot have two reference counters - at least not two lifetime
reference counters (It is fine to have two counters like s_active
and s_count in struct super_block which count different
things).
This means that multiple inheritance in the "data inheritance" style is not
possible. The only form of multiple inheritance that can work is the mixin
style used by list_head as mentioned above.
It also means that, when designing a data structure, it is important to think
about lifetime issues and whether this data structure should have its own
reference counter or whether it should depend on something else for its
lifetime management. That is, whether it is an object in its own right, or
simply a service provided to other objects. These issues are not really new
and apply equally to void pointer inheritance. However an important difference
with void pointers is that it is relatively easy to change your mind later and
switch an extension structure to be a fully independent object. Structure
embedding requires the discipline of thinking clearly about the problem up
front and making the right decision early - a discipline that is worth
encouraging.
The other key telltale for data inheritance structure embedding is the set of rules
for allocating and initializing new instances of a structure, as has already
been hinted at.
When union or void pointer inheritance is used the main structure is usually
allocated and initialized by common code (the mid-layer) and then a device
specific open() or create() function is called which can
optionally allocate and initialize any extension object.
By contrast when structure embedding is used the structure needs to be
allocated by the lowest level device driver which then initializes its own
fields and calls in to common code to initialize the common fields.
Continuing the struct inode example from above which has an
alloc_inode() method in the super_block to request allocation, we find
that initialization is provided for with
inode_init_once()
and
inode_init_always()
support functions. The first of these is used when the previous use of a piece
of memory is unknown, the second is sufficient by itself when we know that the
memory was previously used for some other inode.
We see this same pattern of an initializer function separate from allocation in
kobject_init(),
kref_init(),
and
device_initialize().
So apart from the obvious embedding of structures, the pattern of "data
inheritance through structure embedding" can be recognized by the presence of a
reference counter in the innermost structure, by the delegation of structure
allocation to the final user of the structure, and by the provision of
initializing functions which initialize a previously allocated structure.
Conclusion
In exploring the use of method dispatch (last week) and data inheritance (this
week) in the
Linux kernel we find that while some patterns seem to dominate they are
by no means universal. While almost all data inheritance could be
implemented using structure embedding, unions provide real value in a
few specific cases. Similarly while simple vtables are common, mixin
vtables are very important and the ability to delegate methods to a
related object can be valuable.
We also find that there are patterns in use with little to recommend
them. Using void pointers for inheritance may have an initial
simplicity, but causes longer term wastage, can cause confusion, and could
nearly always be replaced by embedded inheritance. Using NULL pointers to indicate
default behavior is similarly a poor choice - when the default is
important there are better ways to provide for it.
But maybe the most valuable lesson is that the Linux kernel is not only a
useful program to run, it is also a useful document to study. Such study can
find elegant practical solutions to real problems, and some less elegant
solutions. The willing student can pursue the former to help improve their mind,
and pursue the latter to help improve the kernel itself.
With that in mind, the following exercises might be of interest to some.
Exercises
As inodes now use structure embedding for inheritance, void
pointers should not be necessary. Examine the consequences and
wisdom of removing "i_private" from "struct inode".
Rearrange the three unions in struct page to just one
union so that the enumeration of different subtypes is more explicit.
As was noted in the text, struct seq_file can be extended both
through "void pointer" and a limited form of "union" data inheritance.
Explain how seq_open_private() allows this structure to also be extended
through "embedded structure" data inheritance and give an example by
converting one usage in the kernel from "void pointer" to "embedded
structure".
Consider submitting a patch if this appears to be an improvement.
Contrast this implementation of embedded structure inheritance
with the mechanism used for inodes.
Though subtyping is widely used in the kernel, it is not uncommon
for a object to contain fields that not all users are interested
in. This can indicate that more fine grained subtyping is possible.
As very many completely different things can be represented by a
"file descriptor", it is likely that struct file could be a
candidate for further subtyping.
Identify the smallest set of fields that could serve as a generic
struct file and explore the implications of embedding that in
different structures to implement regular files, socket files, event
files, and other file types. Exploring more general use of the
proposed
open()
method for inodes might help here.
Identify an "object-oriented" language which has an object model
that would meet all the needs of the Linux kernel as identified
in these two articles.
Comments (27 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
- Mimi Zohar: EVM .
(June 3, 2011)
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>