Brief items
The 2.6.30 kernel is out,
released by Linus on June 9. Some of the bigger changes in
2.6.30 include a number of filesystem improvements, the integrity measurement patches,
the TOMOYO Linux security
module, reliable datagram
socket protocol support, object storage device support,
the FS-Cache filesystem caching layer, the nilfs filesystem, threaded interrupt handler
support, and much more. For more information, see the KernelNewbies 2.6.30
page, the
long-format changelog, and LWN's statistics for
this development cycle.
As of this writing, no changes have been merged for 2.6.31.
There have been no stable kernel updates in the last week. The
2.6.29.5 update is in the review
process (along with a 2.6.27 update),
though, with an expected release on or after June 11.
Comments (none posted)
Kernel development news
What does it mean to say that one is 48% slower? That's like
saying that a squirrel is 48% juicier than an orange - maybe it's
true, but anybody who puts the two in a blender to compare them is
kind of sick.
--
Linus Torvalds
The root cause of this problem: having something called "mode".
Any time we put a "mode" in the kernel, we get in a mess trying to
work out when to set it and to what.
--
Andrew Morton
Comments (4 posted)
By Jonathan Corbet
June 10, 2009
Performance counters.
Version 8 of the performance
counters patch has been posted. This code has come a long way since its
beginnings; among other things, there is now a user-space utility which
makes the full functionality of performance counters available in a single
application. See the announcement for more details on what's happening in
this area.
Host protected area. The HPA patches for the IDE
subsystem were mentioned here last week. Since
then, Bartlomiej Zolnierkiewicz has updated the
patches and posted them, under the title "IDE fixes," for merging into
the 2.6.30 kernel. This posting raised some eyebrows, seeing as the 2.6.30
release was imminent at the time. A number of developers objected, noting
that it was the wrong part of the release cycle for the addition of new
features.
Bartlomiej stuck to his request, though, claiming that the addition of HPA
support was really a bug fix. He went as far as to suggest that the real reason for opposition to
merging the patches in 2.6.30 was a fear that they would allow the IDE
layer to surpass the libata code in functionality. In the end, needless to
say, these arguments did not prevail. The HPA code should go in for
2.6.31, but 2.6.30 was released without it.
halt_delay. A crashing kernel often prints a fair amount of useful
information on its way down. That information is often key to diagnosing
and fixing the problem. It also, often, is sufficiently voluminous that
the most important part scrolls off the console before it can possibly be
read. In the absence of a debugging console, that information is
forevermore lost - annoying, to say the least.
This patch by Dave Young takes an
interesting approach to this problem. A new sysfs parameter
(/sys/module/printk/parameters/halt_delay specifies a delay value
in milliseconds. If the kernel is currently halting or restarting, every
line printed to the console will be accompanied by the requested amount of
delay. Set halt_delay high enough, and it should be possible to
grab the most important information from an oops listing before it scrolls
off the screen. It's a bit of a blunt tool, but sometimes blunt is what is
needed to get the job done when more precise implements are not available.
r8169 ping of death. A longstanding bug in the RTL8169 NIC driver
would lead to an immediate crash when receiving a packet larger than 1500
bytes in size as reported by Michael
Tokarev. It turns out that the NIC was being informed that frames up to
16K in size could be received, but the skb provided for the receive ring
buffer was only 1536 bytes long. This bug goes back to somewhere before 2.6.10,
thus into the mists of pre-git time. The problem was fixed by a patch from Eric Dumazet, which was one of the
final patches merged before the release 2.6.30.
Comments (12 posted)
By Jonathan Corbet
June 10, 2009
For most general-purpose Linux deployments, power management really only
comes into play at suspend and resume time. Embedded systems tend to have
dedicated code which keeps the device's power usage to a minimum, but
there's no real equivalent for Linux as obtained via a standard
distribution installation. A new patch from Rafael Wysocki may change that
situation, though, and, in the process, yield lower power usage and more
reliable power management on general-purpose machines.
Rafael's patch adds
infrastructure supporting a seemingly uncontroversial goal: let the kernel
power down devices which are not currently being used. To that end,
struct dev_pm_ops gains two new function pointers:
int (*autosuspend)(struct device *dev);
int (*autoresume)(struct device *dev);
The idea is simple: if the kernel decides that a specific device is not in
use, and that it can be safely powered down, the autosuspend()
function will be called. At some future point, autoresume() is
called to bring the device back up to full functionality. The driver
should, of course, take steps to save any necessary device state while the
device is sleeping.
The value in this infrastructure should be reasonably clear. If the kernel
can decide that system operation will not suffer if a given device is
powered down, it can call the relevant autosuspend() function and
save some power. But there's more to it than that; a kernel which
uses this infrastructure will exercise the suspend/resume functionality in
specific drivers in a regular way. That, in turn, should turn up obscure
suspend/resume bugs in ways which make them relatively easy to identify
and, hopefully fix. If something goes wrong when suspending or resuming a
system as a whole, determining the culprit can be hard to do. If
suspending a single idle device creates trouble, the source of the problem
will be rather more obvious. So this infrastructure should lead to more
reliable power management in general, which is a good thing.
A reading of the patch shows that one thing is missing: the code which
decides when to suspend and resume specific devices. Rafael left that part
as an exercise for the reader. As it turns out, different readers had
rather different ideas of how this decision needs to be made.
Matthew Garrett quickly suggested that
there would need to be a mechanism by which user space could set the policy
to be used when deciding whether to suspend a specific device. In his
view, it is simply not possible for the kernel to know, on its own, that
suspending a specific device is safe. One example he gives is eSATA; an
eSATA interface is not distinguishable (by the kernel) from an ordinary
SATA interface, but it supports hotplugging of disks. If the kernel
suspends an "unused" eSATA controller, it will fail to notice when the user
attaches a new device. Another example is keyboards; it may make sense to
suspend a keyboard while the screen saver is running - unless the user is
running an audio player and expects the volume keys to work. Situations
like this are hard for the kernel to figure out.
Ingo Molnar has taken a different approach, arguing that this kind of decision really
belongs in the kernel. The kernel should, he says, suspend devices by
default whenever it is safe to do so; there's no point in asking user space
first.
Again, the user wont enter anything, in 95% of the cases - in the
remaining 3% of cases what is entered is wrong and only in another
2% of cases is it correct ;-)
Sane kernel defaults are important and the kernel sure knows what
kind of hardware it runs on. This 'let the user decide policy' for
something as fundamental (and also as arcane) as power saving mode
is really a disease that has caused a lot of unnecessary pain in
Linux in the past 15 years.
The discussion went back and forth without much in the way of seeming
agreement. But the difference between the two positions is, perhaps, not
as large as it seems. Ingo argues for automatically suspending devices
where this operation is known to be safe, and he acknowledges that this may
not be true for the majority of devices out there. That is not too far
from Matthew's position that a variety of devices cannot be automatically
suspended in a safe way. But, Ingo says, we should go ahead and
apply automatic power management in the cases where we know it can be done,
and we should expect that the number of devices with sophisticated (and
correct) power management functionality will grow over time.
So the only real disagreement is over how the kernel can know whether a
given device can be suspended or not. Ingo would like to see the kernel
figure it out from the information which is already available; Matthew
thinks that some sort of new policy-setting interface will be required.
Answering that question may not be possible until somebody has tried to
implement a fully in-kernel policy, and that may take a while. In the mean
time, though, at least we will have the infrastructure which lets
developers begin to play with potential solutions.
Comments (9 posted)
June 10, 2009
This article was contributed by Goldwyn Rodrigues
Reducing the memory footprint of a binary is important for improving
performance. Poke-a-hole (pahole) and other binary object file
analysis programs developed by Arnaldo Carvalho de Melo help in
analyzing the object files for finding inefficiencies such as holes in
data structures, or functions declared inlined being eventually
un-inlined functions in the object code.
Poke-a-hole
Poke-a-hole (pahole) is an object-file analysis tool to find the size
of the data structures, and the holes caused due to aligning the data
elements to the word-size of the CPU by the compiler. Consider a simple
data structure:
struct sample {
char a[2];
long l;
int i;
void *p;
short s;
};
Adding the size of individual elements of the structure, the expected size
of the sample data structure is:
2*1 (char) + 4 (long) + 4 (int) + 4 (pointer) + 2 (short) = 16 bytes
Compiling this on a 32-bit architecture (ILP32, or Int-Long-Pointer 32
bits) reveals that the size is actually
20 bytes. The additional
bytes are inserted by the compiler to make the data elements aligned
to word size of the CPU. In this case, two bytes padding is added after
char a[2], and another two bytes are added after
short s. Compiling the
same program on a 64-bit machine (LP64, or Long-Pointer 64 bits) results
in struct sample occupying 40 bytes. In this case, six bytes are added
after char a[2], four bytes after int i, and
six bytes after short 2.
Pahole was developed to narrow down on such holes
created by word-size alignment by the compiler. To analyze the object files,
the source must be compiled with the debugging flag "-g". In the
kernel, this is activated by CONFIG_DEBUG_INFO, or "Kernel Hacking >
Compile the kernel with debug info".
Analyzing the object file generated by the program with struct sample
on a i386 machine results in:
i386$ pahole sizes.o
struct sample {
char c[2]; /* 0 2 */
/* XXX 2 bytes hole, try to pack */
long int l; /* 4 4 */
int i; /* 8 4 */
void * p; /* 12 4 */
short int s; /* 16 2 */
/* size: 20, cachelines: 1, members: 5 */
/* sum members: 16, holes: 1, sum holes: 2 */
/* padding: 2 */
/* last cacheline: 20 bytes */
};
Each data element of the structure has two numbers listed in C-style
comments. The first number represents the offset of the data element from
the start of the structure and the second number represents the size in
bytes. At the end of the structure, pahole summarizes the details of the
size and the holes in the structure.
Similarly, analyzing the object file generated by the program with
struct sample on a x86_64 machine results in:
x86_64$ pahole sizes.o
struct sample {
char c[2]; /* 0 2 */
/* XXX 6 bytes hole, try to pack */
long int l; /* 8 8 */
int i; /* 16 4 */
/* XXX 4 bytes hole, try to pack */
void * p; /* 24 8 */
short int s; /* 32 2 */
/* size: 40, cachelines: 1, members: 5 */
/* sum members: 24, holes: 2, sum holes: 10 */
/* padding: 6 */
/* last cacheline: 40 bytes */
};
Notice that there is a new hole introduced after int i, which was not
present in the object compiled for the 32-bit machine. Compiling a source
code developed
on i386 but compiled on x86_64 might be wasting more space because of
such alignment problems because long and pointer graduated to being
eight bytes wide while integer remained as four bytes. Ignoring data structure
re-structuring is a common mistake developers do when porting
applications from i386 to x86_64. This results in larger memory
footprint of the program than expected. A larger data structure leads
to more cacheline reads than required and hence decreasing
performance.
Pahole is capable of suggesting an alternate compact data structure
reorganizing the data elements in the data structure, by
using the --reorganize option. Pahole also accepts an optional
--show_reorg_steps to show the steps taken to compress the data
structure.
x86_64$ pahole --show_reorg_steps --reorganize -C sample sizes.o
/* Moving 'i' from after 'l' to after 'c' */
struct sample {
char c[2]; /* 0 2 */
/* XXX 2 bytes hole, try to pack */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
short int s; /* 24 2 */
/* size: 32, cachelines: 1, members: 5 */
/* sum members: 24, holes: 1, sum holes: 2 */
/* padding: 6 */
/* last cacheline: 32 bytes */
}
/* Moving 's' from after 'p' to after 'c' */
struct sample {
char c[2]; /* 0 2 */
short int s; /* 2 2 */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
/* size: 24, cachelines: 1, members: 5 */
/* last cacheline: 24 bytes */
}
/* Final reorganized struct: */
struct sample {
char c[2]; /* 0 2 */
short int s; /* 2 2 */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
/* size: 24, cachelines: 1, members: 5 */
/* last cacheline: 24 bytes */
}; /* saved 16 bytes! */
The --reorganize algorithm tries to compact the structure by moving
the data elements from the end of the struct to fill holes. It makes
an attempt to move the padding at the end of the struct. Pahole demotes
the bit fields to a smaller basic type when the type being used has
more bits that required by the element in the bit field. For example,
int flag:1 will be demoted to char.
Being over-zealous in compacting a data structure sometimes may reduce
performance. Writes to data elements may flush the cachelines of other
data elements being read from the same cacheline. So, some structures
are defined with ____cacheline_aligned in order to force them to start
from the beginning of a fresh cacheline. An example output of structure
which used ____cacheline_aligned from drivers/net/e100.c is:
struct nic {
/* Begin: frequently used values: keep adjacent for cache
* effect */
u32 msg_enable ____cacheline_aligned;
struct net_device *netdev;
struct pci_dev *pdev;
struct rx *rxs ____cacheline_aligned;
struct rx *rx_to_use;
struct rx *rx_to_clean;
struct rfd blank_rfd;
enum ru_state ru_running;
spinlock_t cb_lock ____cacheline_aligned;
spinlock_t cmd_lock;
<output snipped>
Analyzing the nic structure using pahole results in holes just before
the cacheline boundary, the data elements before rxs and cb_lock.
x86_64$ pahole -C nic /space/kernels/linux-2.6/drivers/net/e100.o
struct nic {
u32 msg_enable; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct net_device * netdev; /* 8 8 */
struct pci_dev * pdev; /* 16 8 */
/* XXX 40 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct rx * rxs; /* 64 8 */
struct rx * rx_to_use; /* 72 8 */
struct rx * rx_to_clean; /* 80 8 */
struct rfd blank_rfd; /* 88 16 */
enum ru_state ru_running; /* 104 4 */
/* XXX 20 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
spinlock_t cb_lock; /* 128 4 */
spinlock_t cmd_lock; /* 132 4 */
<output snipped>
Besides finding holes, pahole can be used for the data field sitting
at a particular offset from the start of the data structure. Pahole
can also list the sizes of all the data structures:
x86_64$ pahole --sizes linux-2.6/vmlinux | sort -k3 -nr | head -5
tty_struct 1328 10
vc_data 432 9
request_queue 2272 8
net_device 1536 8
mddev_s 792 8
The first field represents data structure name, the second represents
the current size of the data structure and the final field represents
the number of holes present in the structure.
Similarly, to get the summary of possible data structure that can be
packed to save the size of the data structure:
x86_64$ pahole --packable sizes.o
sample 40 24 16
The first field represents the data structure, the second represents
the current size, the third represents the packed size and the fourth
field represents the total number of bytes saved by packing the holes.
Pfunct
The pfunct tool shows the function aspects in the object code. It is
capable of showing the number of goto labels used, number of
parameters to the functions, the size of the functions etc. Most
popular usage however is finding the number of functions declared inline but
not inlined, or the number of function declared uninlined but are
eventually inlined. The compiler tends to optimize the functions by
inlining or uninlining the functions depending on the size.
x86_64$ pfunct --cc_inlined linux-2.6/vmlinux | tail -5
run_init_process
do_initcalls
zap_identity_mappings
clear_bss
copy_bootdata
The compiler may also choose to uninline functions which have been
specifically declared inline. This may be caused by multiple
reasons, such as recursive functions for which inlining will cause
infinite regress. pfunct --cc_uninlined shows functions which are
declared inline but have been uninlined by the compiler. Such functions are
good
candidates for a second look, or for removing the inline declaration altogether.
Fortunately, pfunct --cc_uninlined on vmlinux (only) did not list
any functions.
Debug Info
The utilities rely on the debug_info section of the object file, when
the source code is compiled using the debug option. These utilities
rely on the DWARF standard or Compact
C-Type Format (CTF) which are common debugging file format used by
most compilers. Gcc uses the DWARF format.
The debugging data is organized under the debug_info section of ELF
(Executable and Linkage Format), in the form of tags with values such
as representing variables, parameters of a function, placed in
hierarchical nested format. To read raw information, you may use
readelf provided by binutils, or eu-readelf provided by elfutils.
Common standard distributions do not compile the packages with
debuginfo because it tends to make the binaries pretty big. Instead
they include this information as debuginfo packages, which
contain the debuginfo information which can be analyzed through these
tools or gdb.
Utilities discussed in this article were initially developed to
analyze kernel object files. However, these utilities are not limited to kernel
object files and can be used with any userspace programs generating
debug information. The source code of pahole utilities are maintained at
git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git
More information about pahole and other utilities to analyze debug
object files can be found in the PDF about 7
dwarves.
Comments (4 posted)
June 8, 2009
This article was contributed by Neil Brown
One of the topics of ongoing interest in the kernel community is that
of maintaining quality. It is trivially obvious that we need to
maintain and even improve quality. It is less obvious how best to do
so.
One broad approach that has found some real success is to increase the
visibility of various aspects of the kernel. This makes the quality
of those aspects more apparent, so this tends to lead to an
improvement of the quality.
This increase in visibility takes many forms:
- The checkpatch.pl script highlights many deviations
from accepted code formatting style. This encourages people (who
remember to use the script) to fix those style errors. So, by
increasing the visibility of the style guide, we increase the
uniformity of appearance and so, to some extent, the quality.
- The "lockdep" system (when enabled) dynamically measures
dependencies between locks (and related states such as whether
interrupts are enabled). It then reports anything that looks odd.
These oddities will not always mean a deadlock or similar problem is
possible, but in many cases they do, and the deadlock possibility
can be removed. So by increasing the visibility of the lock
dependency graph, quality can be increased.
- The kernel contains various other enhancements to visibility such
as the "poisoning" of unused areas of memory so invalid access will
be more apparent, or simply the use of symbolic names rather than
plain hexadecimal addresses in stack traces so that bug reports are more
useful.
- At a much higher level, the "git" revision tracking software that
is used for tracking kernel changes makes it quite easy to see who
did what and when. The fact that it encourages a comment to be
attached to each patch makes it that much easier to answer the
question "Why is the code this way". This visibility can improve
understanding and that is likely to improve quality as more
developers are better informed.
There are plenty of other areas where increasing visibility does, or
could, improve quality. In this series we will explore one
particular area where your author feels visibility could be increased
in a way that could well lead to qualitative improvements. That area
is the enunciation of kernel-specific design patterns.
Design Patterns
A "design pattern" is a concept that was first expounded in the field of
Architecture and was brought to computer engineering, and particularly
the Object Oriented Programming field, though the 1994 publication
Design Patterns: Elements of Reusable Object-Oriented Software.
Wikipedia has further
useful background information on the topic.
In brief, a design pattern describes a particular class of design
problem, and details an approach to solving that class of problem that
has proven effective in the past. One particular benefit of a design
pattern is that it combines the problem description and the solution
description together and gives them a name. Having a simple and
memorable name for a pattern is particularly valuable. If both
developer and reviewer know the same names for the same patterns, then
a significant design decision can be communicated in one or two words,
thus making the decision much more visible.
In the Linux kernel code base there are many design patterns that have
been found to be effective. However most of them have never been
documented so they are not easily available to other developers. It
is my hope that by explicitly documenting these patterns, I can help them
to be
more widely used and, thus, developers will be able to achieve effective
solutions to common problems more quickly.
In the remainder of this series we will be looking at three problem
domains and finding a variety of design patterns of greatly varying
scope and significance. Our goal in doing so is to not only to enunciate
these patterns, but also to show the range and value of such
patterns so that others might make the effort to enunciate patterns
that they have seen.
A number of examples from the Linux kernel will be presented
throughout this series as examples are an important part of
illuminating any pattern. Unless otherwise stated they are all from
2.6.30-rc4.
Reference Counts
The idea of using a reference counter to manage the lifetime of an
object is fairly common. The core idea is to have a counter which is
incremented whenever a new reference is taken and decremented when a
reference is released. When this counter reaches zero any resources
used by the object (such as the memory used to store it) can be
freed.
The mechanisms for managing reference counts seem quite straightforward.
However there are some subtleties that make it quite easy to
get the mechanisms wrong. Partly for this reason, the Linux kernel
has (since 2004) a data type known as "kref" with associated support
routines (see Documentation/kref.txt,
<linux/kref.h>, and
lib/kref.c). These encapsulate some of those subtleties and, in
particular, make it clear that a given counter is being used as a
reference counter in a particular way. As noted above, names for
design patterns are very valuable and just providing that name for
kernel developers to use is a significant benefit for reviewers.
In the
words of Andrew Morton:
I care more about being able to say "ah, it uses kref. I understand
that refcounting idiom, I know it's well debugged and I know that
it traps common errors". That's better than "oh crap, this thing
implements its own refcounting - I need to review it for the usual
errors".
This inclusion of kref in the Linux kernel gives both a tick and a
cross to the kernel in terms of explicit support for design patterns.
A tick is deserved as the kref clearly embodies an important design
pattern, is well documented, and is clearly visible in the code when
used.
It deserves a cross however because the kref only encapsulates part of
the story about reference counting. There are some usages of
reference counting that do not fit well into the kref model as we will
see shortly. Having a "blessed" mechanism for reference counting that
does not provide the required functionality can actually encourage
mistakes as people might use a kref where it doesn't belong and so
think it should just work where in fact it doesn't.
A useful step to understanding the complexities of reference counting
is to understand that there are often two distinctly different sorts of
references to an object. In truth there can be three or even more, but
that is very uncommon and can usually be understood by generalizing
the case of two.
We will call the two types of references "external" and "internal",
though in some cases "strong" and "weak" might be more appropriate.
An "external" reference is the sort of reference we are probably most
accustomed to think about. They are counted with "get" and "put" and
can be held by subsystems quite distant from the subsystem that
manages the object. The existence of a counted external reference has
a strong and simple meaning: This object is in use.
By contrast, an "internal" reference is often not counted, and is only
held internally to the system that manages the object (or some close
relative). Different internal references can have very different
meanings and hence very different implications for implementation.
Possibly the most common example of an internal reference is a cache
which provides a "lookup by name" service. If you know the name of an
object, you can apply to the cache to get an external reference,
providing the object actually exists in the cache. Such a cache
would hold each object on a list, or on one of a number of lists under
e.g. a hash table. The presence of the object on such a list is a
reference to the object. However it is likely not a counted
reference. It does not mean "this object is in use" but only "this
object is hanging around in case someone wants it". Objects are not
removed from the list until all external references have been dropped,
and possibly they won't be removed immediately even then.
Clearly the existence and nature of internal references can have
significant implications on how reference counting is implemented.
One useful way to classify different reference counting styles is by
the required implementation of the "put" operation.
The "get" operation is always the same. It takes an
external reference and produces another external reference.
It is implemented by something like:
assert(obj->refcount > 0) ; increment(obj->refcount);
or, in Linux-kernel C:
BUG_ON(atomic_read(&obj->refcnt)) ; atomic_inc(&obj->refcnt);
Note that "get" cannot be used on an unreferenced object. Something
else is needed there.
The "put" operation comes in three variations. While there can be
some overlap in use cases, it is good to keep them separate to help
with clarity of the code. The three options, in Linux-C, are:
1 atomic_dec(&obj->refcnt);
2 if (atomic_dec_and_test(&obj->refcnt)) { ... do stuff ... }
3 if (atomic_dec_and_lock(&obj->refcnt, &subsystem_lock)) {
..... do stuff ....
spin_unlock(&subsystem_lock);
}
The "kref" style
Starting in the middle, option "2" is the style used for a kref.
This style is appropriate when the object does not outlive its last
external reference. When that reference count becomes zero, the object
needs to be freed or otherwise dealt with, hence the need to test for
the zero condition with atomic_dec_and_test().
Objects that fit this style often do not have any internal references
to worry about, as is the case with most objects in sysfs, which is a
heavy user of kref.
If, instead, an object using the kref style does have internal references, it
cannot be allowed to create an external reference from an internal
reference unless there are known to be other external references.
If this is necessary, a primitive is available:
atomic_inc_not_zero(&obj->refcnt);
which increments a value providing it isn't zero, and returns a result
indicating success or otherwise.
atomic_inc_not_zero() is a relatively recent invention in the linux
kernel, appearing in late 2005 as part of the lockless page cache
work. For this reason it isn't widely used and some code that could
benefit from it uses spinlocks instead.
Sadly, the kref package does not make use of this primitive either.
An interesting example of this style of reference that does not use
kref, and does not even use atomic_dec_and_test() (though it could
and
arguably should) are the two ref counts in struct super:
s_count and s_active.
s_active fits the kref style of reference counts exactly. A
superblock starts life with s_active being 1 (set in
alloc_super()), and,
when s_active becomes zero, further external references cannot be
taken. This rule is encoded in grab_super(), though this is not
immediately clear. The current code (for historical reasons) adds a
very large value (S_BIAS) to s_count whenever s_active is non-zero,
and grab_super() tests for s_count exceeding S_BIAS rather than for
s_active being zero. It could just as easily do the latter test using
atomic_inc_not_zero(), and avoid the use of spinlocks.
s_count provides for a different sort of reference which has both
"internal" and "external" aspects. It is internal in that its
semantic is much weaker than that of s_active-counted references.
References counted by s_count just mean "this superblock cannot be
freed just now" without asserting that it is actually active.
It is external in that it is much like a kref starting life
at 1 (well, 1*S_BIAS actually), and when it becomes zero (in
__put_super()) the superblock is destroyed.
So these two reference counts could be replaced by two krefs,
providing:
- S_BIAS was set to 1
- grab_super() used atomic_inc_not_zero() rather than
testing against
S_BIAS
and a number of spinlock calls could go away. The details are left as
an exercise for the reader.
The "kcref" style
The Linux kernel doesn't have a "kcref" object, but that is a name
that seems suitable to propose for the next style of reference count.
The "c" stands for "cached" as this style is very often used in
caches. So it is a Kernel Cached REFerence.
A kcref uses atomic_dec_and_lock() as given in option 3 above. It does
this because, on the last put, it needs to be freed or checked to see if any other
special handling is needed. This needs to be done under a lock to
ensure no new reference is taken while the current state is being
evaluated.
A simple example here is the i_count reference counter in
struct inode.
The important part of iput() reads:
if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
iput_final(inode);
where iput_final() examines the state of the inode and decides if it
can be destroyed, or left in the cache in case it could get reused
soon.
Among other things, the inode_lock prevents new external references
being created from the internal references of the inode hash table.
For this reason converting internal references to external references
is only permitted while the inode_lock is held. It is no accident
that the function supporting this is called iget_locked() (or
iget5_locked()).
A slightly more complex example is in struct dentry, where
d_count is
managed like a kcref. It is more complex because two locks need to be
taken before we can be sure no new reference can be taken - both
dcache_lock and de->d_lock. This requires that either we hold one
lock, then atomic_dec_and_lock() the other (as in
prune_one_dentry()), or
that we atomic_dec_and_lock() the first, then claim the second and
retest the refcount - as in dput().
This is good example of the fact that you can never assume you have
encapsulated all possible reference counting styles. Needing two
locks could hardly be foreseen.
An even more complex kcref-style refcount is mnt_count in
struct vfsmount. The complexity here is the interplay of the two refcounts
that this structure has: mnt_count, which is a fairly straightforward
count of external references, and mnt_pinned, which counts
internal references from the process accounting module. In particular
it counts the number of accounting files that are open on the
filesystem (and as such could use a more meaningful name). The
complexity comes from the fact that when there are only internal
references remaining, they are all converted to external references.
Exploring the details of this is again left as an exercise for the
interested reader.
The "plain" style
The final style for refcounting involves just decrementing the
reference count (atomic_dec()) and not doing anything else.
This style is relatively uncommon in the kernel, and for good reason.
Leaving unreferenced objects just lying around isn't a good idea.
One use of this style is in struct buffer_head, managed by
fs/buffer.c
and <linux/buffer_head.h>. The put_bh() function
is simply:
static inline void put_bh(struct buffer_head *bh)
{
smp_mb__before_atomic_dec();
atomic_dec(&bh->b_count);
}
This is OK because buffer_heads have lifetime rules that are closely
tied to a page. One or more buffer_heads get allocated to a page to
chop it up into smaller pieces (buffers). They tend to remain there
until the page is freed at which point all the buffer_heads will be
purged (by drop_buffers() called from try_to_free_buffers()).
In general, the "plain" style is suitable if it is known that there
will always be an internal reference so that the object doesn't get
lost, and if there is some process whereby this internal reference
will eventually get used to find and free the object.
Anti-patterns
To wrap up this little review of reference counting as an introduction
to design patterns, we will discuss the related concept of an
anti-pattern.
While design patterns are approaches that have been shown to work and
should be encouraged, anti-patterns are approaches that history shows
us do not work well and should be discouraged.
Your author would like to suggest that the use of a "bias" in a
refcount is an example of an anti-pattern.
A bias in this context is a large value that is added to, or
subtracted from, the reference count and is used to effectively
store one bit of information.
We have already glimpsed the idea of a bias in the management of
s_count for superblocks. In this case the presence of the bias
indicates that the value of s_active is non-zero, which is easy enough
to test directly. So the bias adds no value here and only obscures the
true purpose of the code.
Another example of a bias is in the management of
struct sysfs_dirent,
in fs/sysfs/sysfs.h and fs/sysfs/dir.c. Interestingly,
sysfs_dirent has two refcounts just like superblocks, also called
s_count and s_active. In this case s_active has a large negative bias
when the entry is being deactivated. The same bit of information
could be stored just as effectively and much more clearly in the
flag word s_flags. Storing single bits of information in flags is
much easier to understand that storing them as a bias in a counter,
and should be preferred.
In general, using a bias does not add any clarity as it is not a
common pattern. It cannot add more functionality than a single flag
bit can provide, and it would be extremely rare that memory is so
tight that one bit cannot be found to record whatever would otherwise
be denoted by the presence of the bias. For these reasons, biases in
refcounts should be considered anti-patterns and avoided if at all
possible.
Conclusion
This brings to a close our exploration of the various design patterns
surrounding reference counts. Simply having terminology such a "kref"
versus "kcref" and "external" versus "internal" references can be very
helpful in increasing the visibility of the behaviour of different
references and counts. Having code to embody this as we do with kref
and could with kcref, and using this code at every opportunity, would
be a great help both to developers who might find it easy to choose
the right model first time, and to reviewers who can see more clearly
what is intended.
The design patterns we have covered in this article are:
- kref:
When the lifetime of an object extends only to the moment that
the last external reference is dropped, a kref is appropriate.
If there are any internal reference to the object, they can only
be promoted to external references with atomic_inc_not_zero().
Examples: s_active and s_count in struct
super_block.
- kcref:
With this the lifetime of an object can extend beyond the dropping
of the last external reference, the kcref with its
atomic_dec_and_lock() is appropriate. An internal reference can
only be converted to an external reference will the subsystem
lock is held.
Examples: i_count in struct inode.
- plain:
When the lifetime of an object is subordinate to some other
object, the plain reference pattern is appropriate. Non-zero
reference counts on the object must be treated as internal
reference to the parent object, and converting internal references
to external references must follow the same rules as for the
parent object.
Examples: b_count in struct buffer_head.
- biased-reference:
When you feel the need to use add a large bias
to the value in a reference count to indicate some particular
state, don't. Use a flag bit elsewhere. This is an
anti-pattern.
Next week we will move on to another area where the Linux kernel has
proved some successful design patterns and explore the slightly
richer area of complex data structures. (Part 2 and part 3 of this series are now
available).
Exercises
As your author has been reminded while preparing this series, there is
nothing like a directed study of code to clarify understanding of
these sorts of issues. With that in mind, here are some exercises for
the interested reader.
- Replace s_active and s_count in struct super with krefs, discarding
S_BIAS in the process. Compare the result with the original using
the trifecta of Correctness, Maintainability, and Performance.
- Choose a more meaningful name for mnt_pinned and related
functions that manipulate it.
- Add a function to the kref library that makes use of
atomic_inc_not_zero(), and using it (or otherwise) remove the use of
atomic_dec_and_lock() on a kref in
net/sunrpc/svcauth.c - a usage
which violates the kref abstraction.
- Examine the _count reference count in struct page
(see mm_types.h
for example) and determine whether it behaves most like a kref or a
kcref (hint: it is not "plain"). This should involve identifying
any and all internal references and related locking rules.
Identify why the page cache (struct address_space.page_tree) owns
a counted reference or explain why it should not.
This will involve understanding page_freeze_refs() and its usage in
__remove_mapping(), as well as
page_cache_{get,add}_speculative().
Bonus credit: provide a series of minimal self-contained patches to
implement any changes that the above investigations proved useful.
Comments (16 posted)
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>