Linux kernel design patterns - part 1
This increase in visibility takes many forms:
- The checkpatch.pl script highlights many deviations
from accepted code formatting style. This encourages people (who
remember to use the script) to fix those style errors. So, by
increasing the visibility of the style guide, we increase the
uniformity of appearance and so, to some extent, the quality.
- The "lockdep" system (when enabled) dynamically measures
dependencies between locks (and related states such as whether
interrupts are enabled). It then reports anything that looks odd.
These oddities will not always mean a deadlock or similar problem is
possible, but in many cases they do, and the deadlock possibility
can be removed. So by increasing the visibility of the lock
dependency graph, quality can be increased.
- The kernel contains various other enhancements to visibility such
as the "poisoning" of unused areas of memory so invalid access will
be more apparent, or simply the use of symbolic names rather than
plain hexadecimal addresses in stack traces so that bug reports are more
useful.
- At a much higher level, the "git" revision tracking software that is used for tracking kernel changes makes it quite easy to see who did what and when. The fact that it encourages a comment to be attached to each patch makes it that much easier to answer the question "Why is the code this way". This visibility can improve understanding and that is likely to improve quality as more developers are better informed.
There are plenty of other areas where increasing visibility does, or could, improve quality. In this series we will explore one particular area where your author feels visibility could be increased in a way that could well lead to qualitative improvements. That area is the enunciation of kernel-specific design patterns.
Design Patterns
A "design pattern" is a concept that was first expounded in the field of Architecture and was brought to computer engineering, and particularly the Object Oriented Programming field, though the 1994 publication Design Patterns: Elements of Reusable Object-Oriented Software. Wikipedia has further useful background information on the topic.
In brief, a design pattern describes a particular class of design problem, and details an approach to solving that class of problem that has proven effective in the past. One particular benefit of a design pattern is that it combines the problem description and the solution description together and gives them a name. Having a simple and memorable name for a pattern is particularly valuable. If both developer and reviewer know the same names for the same patterns, then a significant design decision can be communicated in one or two words, thus making the decision much more visible.
In the Linux kernel code base there are many design patterns that have been found to be effective. However most of them have never been documented so they are not easily available to other developers. It is my hope that by explicitly documenting these patterns, I can help them to be more widely used and, thus, developers will be able to achieve effective solutions to common problems more quickly.
In the remainder of this series we will be looking at three problem domains and finding a variety of design patterns of greatly varying scope and significance. Our goal in doing so is to not only to enunciate these patterns, but also to show the range and value of such patterns so that others might make the effort to enunciate patterns that they have seen.
A number of examples from the Linux kernel will be presented throughout this series as examples are an important part of illuminating any pattern. Unless otherwise stated they are all from 2.6.30-rc4.
Reference Counts
The idea of using a reference counter to manage the lifetime of an object is fairly common. The core idea is to have a counter which is incremented whenever a new reference is taken and decremented when a reference is released. When this counter reaches zero any resources used by the object (such as the memory used to store it) can be freed.
The mechanisms for managing reference counts seem quite straightforward. However there are some subtleties that make it quite easy to get the mechanisms wrong. Partly for this reason, the Linux kernel has (since 2004) a data type known as "kref" with associated support routines (see Documentation/kref.txt, <linux/kref.h>, and lib/kref.c). These encapsulate some of those subtleties and, in particular, make it clear that a given counter is being used as a reference counter in a particular way. As noted above, names for design patterns are very valuable and just providing that name for kernel developers to use is a significant benefit for reviewers.
In the words of Andrew Morton:
This inclusion of kref in the Linux kernel gives both a tick and a cross to the kernel in terms of explicit support for design patterns. A tick is deserved as the kref clearly embodies an important design pattern, is well documented, and is clearly visible in the code when used. It deserves a cross however because the kref only encapsulates part of the story about reference counting. There are some usages of reference counting that do not fit well into the kref model as we will see shortly. Having a "blessed" mechanism for reference counting that does not provide the required functionality can actually encourage mistakes as people might use a kref where it doesn't belong and so think it should just work where in fact it doesn't.
A useful step to understanding the complexities of reference counting is to understand that there are often two distinctly different sorts of references to an object. In truth there can be three or even more, but that is very uncommon and can usually be understood by generalizing the case of two. We will call the two types of references "external" and "internal", though in some cases "strong" and "weak" might be more appropriate.
An "external" reference is the sort of reference we are probably most accustomed to think about. They are counted with "get" and "put" and can be held by subsystems quite distant from the subsystem that manages the object. The existence of a counted external reference has a strong and simple meaning: This object is in use.
By contrast, an "internal" reference is often not counted, and is only held internally to the system that manages the object (or some close relative). Different internal references can have very different meanings and hence very different implications for implementation.
Possibly the most common example of an internal reference is a cache which provides a "lookup by name" service. If you know the name of an object, you can apply to the cache to get an external reference, providing the object actually exists in the cache. Such a cache would hold each object on a list, or on one of a number of lists under e.g. a hash table. The presence of the object on such a list is a reference to the object. However it is likely not a counted reference. It does not mean "this object is in use" but only "this object is hanging around in case someone wants it". Objects are not removed from the list until all external references have been dropped, and possibly they won't be removed immediately even then. Clearly the existence and nature of internal references can have significant implications on how reference counting is implemented.
One useful way to classify different reference counting styles is by the required implementation of the "put" operation. The "get" operation is always the same. It takes an external reference and produces another external reference. It is implemented by something like:
assert(obj->refcount > 0) ; increment(obj->refcount);
or, in Linux-kernel C:
BUG_ON(!atomic_read(&obj->refcnt)) ; atomic_inc(&obj->refcnt);
Note that "get" cannot be used on an unreferenced object. Something else is needed there.
The "put" operation comes in three variations. While there can be some overlap in use cases, it is good to keep them separate to help with clarity of the code. The three options, in Linux-C, are:
1 atomic_dec(&obj->refcnt); 2 if (atomic_dec_and_test(&obj->refcnt)) { ... do stuff ... } 3 if (atomic_dec_and_lock(&obj->refcnt, &subsystem_lock)) { ..... do stuff .... spin_unlock(&subsystem_lock); }
The "kref" style
Starting in the middle, option "2" is the style used for a kref. This style is appropriate when the object does not outlive its last external reference. When that reference count becomes zero, the object needs to be freed or otherwise dealt with, hence the need to test for the zero condition with atomic_dec_and_test().
Objects that fit this style often do not have any internal references to worry about, as is the case with most objects in sysfs, which is a heavy user of kref. If, instead, an object using the kref style does have internal references, it cannot be allowed to create an external reference from an internal reference unless there are known to be other external references. If this is necessary, a primitive is available:
atomic_inc_not_zero(&obj->refcnt);
which increments a value providing it isn't zero, and returns a result indicating success or otherwise. atomic_inc_not_zero() is a relatively recent invention in the linux kernel, appearing in late 2005 as part of the lockless page cache work. For this reason it isn't widely used and some code that could benefit from it uses spinlocks instead. Sadly, the kref package does not make use of this primitive either.
An interesting example of this style of reference that does not use kref, and does not even use atomic_dec_and_test() (though it could and arguably should) are the two ref counts in struct super: s_count and s_active.
s_active fits the kref style of reference counts exactly. A superblock starts life with s_active being 1 (set in alloc_super()), and, when s_active becomes zero, further external references cannot be taken. This rule is encoded in grab_super(), though this is not immediately clear. The current code (for historical reasons) adds a very large value (S_BIAS) to s_count whenever s_active is non-zero, and grab_super() tests for s_count exceeding S_BIAS rather than for s_active being zero. It could just as easily do the latter test using atomic_inc_not_zero(), and avoid the use of spinlocks.
s_count provides for a different sort of reference which has both "internal" and "external" aspects. It is internal in that its semantic is much weaker than that of s_active-counted references. References counted by s_count just mean "this superblock cannot be freed just now" without asserting that it is actually active. It is external in that it is much like a kref starting life at 1 (well, 1*S_BIAS actually), and when it becomes zero (in __put_super()) the superblock is destroyed.
So these two reference counts could be replaced by two krefs, providing:
- S_BIAS was set to 1
- grab_super() used atomic_inc_not_zero() rather than testing against S_BIAS
and a number of spinlock calls could go away. The details are left as an exercise for the reader.
The "kcref" style
The Linux kernel doesn't have a "kcref" object, but that is a name that seems suitable to propose for the next style of reference count. The "c" stands for "cached" as this style is very often used in caches. So it is a Kernel Cached REFerence.
A kcref uses atomic_dec_and_lock() as given in option 3 above. It does this because, on the last put, it needs to be freed or checked to see if any other special handling is needed. This needs to be done under a lock to ensure no new reference is taken while the current state is being evaluated.
A simple example here is the i_count reference counter in struct inode. The important part of iput() reads:
if (atomic_dec_and_lock(&inode->i_count, &inode_lock)) iput_final(inode);
where iput_final() examines the state of the inode and decides if it can be destroyed, or left in the cache in case it could get reused soon.
Among other things, the inode_lock prevents new external references being created from the internal references of the inode hash table. For this reason converting internal references to external references is only permitted while the inode_lock is held. It is no accident that the function supporting this is called iget_locked() (or iget5_locked()).
A slightly more complex example is in struct dentry, where d_count is managed like a kcref. It is more complex because two locks need to be taken before we can be sure no new reference can be taken - both dcache_lock and de->d_lock. This requires that either we hold one lock, then atomic_dec_and_lock() the other (as in prune_one_dentry()), or that we atomic_dec_and_lock() the first, then claim the second and retest the refcount - as in dput(). This is good example of the fact that you can never assume you have encapsulated all possible reference counting styles. Needing two locks could hardly be foreseen.
An even more complex kcref-style refcount is mnt_count in struct vfsmount. The complexity here is the interplay of the two refcounts that this structure has: mnt_count, which is a fairly straightforward count of external references, and mnt_pinned, which counts internal references from the process accounting module. In particular it counts the number of accounting files that are open on the filesystem (and as such could use a more meaningful name). The complexity comes from the fact that when there are only internal references remaining, they are all converted to external references. Exploring the details of this is again left as an exercise for the interested reader.
The "plain" style
The final style for refcounting involves just decrementing the reference count (atomic_dec()) and not doing anything else. This style is relatively uncommon in the kernel, and for good reason. Leaving unreferenced objects just lying around isn't a good idea.
One use of this style is in struct buffer_head, managed by fs/buffer.c and <linux/buffer_head.h>. The put_bh() function is simply:
static inline void put_bh(struct buffer_head *bh) { smp_mb__before_atomic_dec(); atomic_dec(&bh->b_count); }
This is OK because buffer_heads have lifetime rules that are closely tied to a page. One or more buffer_heads get allocated to a page to chop it up into smaller pieces (buffers). They tend to remain there until the page is freed at which point all the buffer_heads will be purged (by drop_buffers() called from try_to_free_buffers()).
In general, the "plain" style is suitable if it is known that there will always be an internal reference so that the object doesn't get lost, and if there is some process whereby this internal reference will eventually get used to find and free the object.
Anti-patterns
To wrap up this little review of reference counting as an introduction to design patterns, we will discuss the related concept of an anti-pattern. While design patterns are approaches that have been shown to work and should be encouraged, anti-patterns are approaches that history shows us do not work well and should be discouraged.
Your author would like to suggest that the use of a "bias" in a refcount is an example of an anti-pattern. A bias in this context is a large value that is added to, or subtracted from, the reference count and is used to effectively store one bit of information. We have already glimpsed the idea of a bias in the management of s_count for superblocks. In this case the presence of the bias indicates that the value of s_active is non-zero, which is easy enough to test directly. So the bias adds no value here and only obscures the true purpose of the code.
Another example of a bias is in the management of struct sysfs_dirent, in fs/sysfs/sysfs.h and fs/sysfs/dir.c. Interestingly, sysfs_dirent has two refcounts just like superblocks, also called s_count and s_active. In this case s_active has a large negative bias when the entry is being deactivated. The same bit of information could be stored just as effectively and much more clearly in the flag word s_flags. Storing single bits of information in flags is much easier to understand that storing them as a bias in a counter, and should be preferred.
In general, using a bias does not add any clarity as it is not a common pattern. It cannot add more functionality than a single flag bit can provide, and it would be extremely rare that memory is so tight that one bit cannot be found to record whatever would otherwise be denoted by the presence of the bias. For these reasons, biases in refcounts should be considered anti-patterns and avoided if at all possible.
Conclusion
This brings to a close our exploration of the various design patterns surrounding reference counts. Simply having terminology such a "kref" versus "kcref" and "external" versus "internal" references can be very helpful in increasing the visibility of the behaviour of different references and counts. Having code to embody this as we do with kref and could with kcref, and using this code at every opportunity, would be a great help both to developers who might find it easy to choose the right model first time, and to reviewers who can see more clearly what is intended.
The design patterns we have covered in this article are:
- kref:
When the lifetime of an object extends only to the moment that
the last external reference is dropped, a kref is appropriate.
If there are any internal reference to the object, they can only
be promoted to external references with atomic_inc_not_zero().
Examples: s_active and s_count in struct
super_block.
- kcref:
With this the lifetime of an object can extend beyond the dropping
of the last external reference, the kcref with its
atomic_dec_and_lock() is appropriate. An internal reference can
only be converted to an external reference will the subsystem
lock is held.
Examples: i_count in struct inode.
- plain:
When the lifetime of an object is subordinate to some other
object, the plain reference pattern is appropriate. Non-zero
reference counts on the object must be treated as internal
reference to the parent object, and converting internal references
to external references must follow the same rules as for the
parent object.
Examples: b_count in struct buffer_head.
- biased-reference: When you feel the need to use add a large bias to the value in a reference count to indicate some particular state, don't. Use a flag bit elsewhere. This is an anti-pattern.
Next week we will move on to another area where the Linux kernel has proved some successful design patterns and explore the slightly richer area of complex data structures. (Part 2 and part 3 of this series are now available).
Exercises
As your author has been reminded while preparing this series, there is nothing like a directed study of code to clarify understanding of these sorts of issues. With that in mind, here are some exercises for the interested reader.
- Replace s_active and s_count in struct super with krefs, discarding
S_BIAS in the process. Compare the result with the original using
the trifecta of Correctness, Maintainability, and Performance.
- Choose a more meaningful name for mnt_pinned and related
functions that manipulate it.
- Add a function to the kref library that makes use of
atomic_inc_not_zero(), and using it (or otherwise) remove the use of
atomic_dec_and_lock() on a kref in
net/sunrpc/svcauth.c - a usage
which violates the kref abstraction.
- Examine the _count reference count in struct page (see mm_types.h for example) and determine whether it behaves most like a kref or a kcref (hint: it is not "plain"). This should involve identifying any and all internal references and related locking rules. Identify why the page cache (struct address_space.page_tree) owns a counted reference or explain why it should not. This will involve understanding page_freeze_refs() and its usage in __remove_mapping(), as well as page_cache_{get,add}_speculative().
Bonus credit: provide a series of minimal self-contained patches to
implement any changes that the above investigations proved useful.
Index entries for this article | |
---|---|
Kernel | Development model/Patterns |
GuestArticles | Brown, Neil |
Posted Jun 9, 2009 4:36 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (3 responses)
This trick is useful in cases where it is only necessary to actually test the counter for zero-or-not in special situations, such as when attempting to remove or reconfigure the I/O device. When such a situation arises, one subtracts out the bias, and then runs inefficiently until the I/Os drain and the reference count goes to zero, at which point the device may be safely removed or reconfigured. The bias may be added back in to return to normal operating conditions.
A very specialized use of a reference count, perhaps, but one that has actually appeared in real code in past lives.
Posted Jun 9, 2009 5:39 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (2 responses)
Yes, I wouldn't be surprised if reference counts that are spread across multiple per-cpu variables would have very different trade-offs and hence different Patterns to the more traditional kinds!
I'm having trouble picturing exactly how the counters you describe would work (and particularly exactly what happens when the per-cpu counter hits zero) but it seems possible that the "one bit of information" that I claim a bias stores would, in this case, be the bit "Someone cares about a precise total value" and that one bit could conceivably be stored in a read-mostly
cache line that could be easily shared among multiple processors.
So the code might look like:
Is there any chance that would achieve the same result?
It may well be a case where you don't want to pay the price of an
extra bit though.
Posted Jun 9, 2009 13:50 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (1 responses)
But it has been a good ten years since I messed with this, so I should take another look at it.
In any case, I very much agree with your overall premise that higher-level primitives are a very great improvement over continually re-inventing the wheel, most especially for those wheels with a strong history of being re-invented badly!!!
Posted Dec 23, 2011 5:30 UTC (Fri)
by atiqure (guest, #81951)
[Link]
Posted Jun 9, 2009 10:24 UTC (Tue)
by ebiederm (subscriber, #35028)
[Link] (1 responses)
Placing the extra bit of information in s_flags does not work because that breaks the atomic nature of the current test.
Furthermore you have missed one of the most interesting bits about the reference counting in sysfs. Holding just about any lock when removing a sysfs file and by extention a kobject or struct device is a fun way to dead lock the kernel without lockdep having a clue.
Posted Jun 9, 2009 12:15 UTC (Tue)
by neilbrown (subscriber, #359)
[Link]
Hi.
Thanks for your comments!
Your "furthermore" observation about locks is probably quite true. However I cannot see exactly how it is relevant. I'm not advocating adding any locks in there.
Your statement that placing the information in s_flags "does not work" is one that I don't agree with. Seeing I didn't include this among the exercises, I will give the reason here.
Let us hypothesise a flag SYS_FLAG_ACTIVE which is set at the same time
that s_active is set to 0. This flag indicates that the entry is 'active'
and it implies a counted reference on the sysfs_dirent. So instead
of initialising s_active to 0, we initialise it to 1 (which is more in
keeping with the kref pattern anyway).
In place of the (rather complex) loop in sysfs_get_active, we have the
code
sysfs_deactivate then becomes
Finally sysfs_put_active becomes:
Providing support for atomic_inc_not_zero were added to kref (which
I have already advocated), this would allow s_active to become
a kref.
I believe the net result of these changes would be that the code is easier
to understand, and hence harder to break at a later date. They do not
change the functionality at all.
Posted Jun 9, 2009 14:11 UTC (Tue)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
Heh. I suggested introducing kref_get_not_zero() (based on atomic_inc_not_zero()) back in January, but got a lot of resistance. Better luck to the next guy.
Posted Jun 9, 2009 20:42 UTC (Tue)
by neilbrown (subscriber, #359)
[Link]
Thanks for that reference.
I cannot comment on the code under the microscope in that thread as I am not familiar with it at all. However your observation that kref_get_not_zero() can be safe when used under a lock that excludes the object from being freed is one that I completely agree with. It is therefore tempting to call the function kref_get_locked() to match the inode_get_locked that already exists.
However it can be valid to use atomic_inc_not_zero() when not holding a lock,
as in the example in a previous thread. In that case it is safe not because a lock is held, but because a secondary reference is held (s_count).
Either way, the complete pattern description for kref would need to spell
out how and when to use kref_get_not_zero() (or whatever it gets called).
I must say that Greg's 'ick' in the reply to your linked post surprised me.
kref_get_not_zero is (in my view) much less 'ick' than kref_set which is part of the current kref API (exercise: find all 3 places that use kref_set and replace them with calls to other parts of the api).
Posted Jun 9, 2009 14:21 UTC (Tue)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Please correct this confusing usage. Is there a missing "not"?
Posted Jun 9, 2009 14:44 UTC (Tue)
by jake (editor, #205)
[Link]
oops ... fixed now, thanks!
jake
Posted Jun 10, 2009 20:32 UTC (Wed)
by intgr (subscriber, #39733)
[Link] (2 responses)
Posted Jun 11, 2009 0:56 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (1 responses)
Posted Apr 15, 2012 23:47 UTC (Sun)
by alison (subscriber, #63752)
[Link]
Humorously, I just had a look at the Wikipedia article. "Huh, I thought I understood the comma operator!" Copied the code in question out, compiled and tested it, and sure enough the latest revision to the article was erroneous. So yeah, I guess the Comma Operator causes some confusion!
-- Alison
Posted Jun 11, 2009 14:41 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (2 responses)
Just for fun: http://lwn.net/Articles/306843/
Sometimes I dream of a kernel written in a high-level programming language...
Posted Jun 12, 2009 12:05 UTC (Fri)
by intgr (subscriber, #39733)
[Link] (1 responses)
Obviously none of these is usable as a desktop OS though.
Posted Jun 12, 2009 12:39 UTC (Fri)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Nov 10, 2022 7:23 UTC (Thu)
by dafnaf (guest, #158251)
[Link] (2 responses)
BUG_ON(atomic_read(&obj->refcnt)) ; atomic_inc(&obj->refcnt);
shouldn't it be BUG_ON(!atomic_read(&obj->refcnt)) //(bug if refcount is 0)
Posted Nov 10, 2022 23:47 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Yes it should. Thanks for pointing that out.
Posted Nov 11, 2022 0:13 UTC (Fri)
by corbet (editor, #1)
[Link]
One use for reference-count biases...
One use for reference-count biases...
if (dec_and_test(per_cpu_counter))
if (__test_bit(flag_name, &flag_variable))
do some costly cross-cpu thing;
Hmmmm... I was thinking more in terms of something like the following:One use for reference-count biases...
preempt_disable();
if (per_cpu_counter > 0)
per_cpu_counter++;
else
do some costly global-lock-and-variable thing
preempt_enable();
One use for reference-count biases...
Linux Kernel Design Patterns - Part 1
Linux Kernel Design Patterns - Part 1
if (test_bit(SYS_FLAG_ACTIVE, &sd->s_flags)
&& atomic_inc_not_zero(&sd->s_active))
return sd;
else
return NULL;
For me, that is easier to understand.
sd->s_sibling = (void *)&wait;
clear_bit(SYS_FLAG_ACTIVE, &sd->s_flags);
if (! atomic_dec_and_test(&sd->s_active))
wait_for_completion(&wait);
sd->s_sibling = NULL;
Note that clearing the ACTIVE bit requires that we drop the reference
that it implies. If that was the last reference we don't need to
wait for any completion.
if (atomic_dec_and_test(&sd->s_active)) {
struct completion *cmpl = (void*)sd->s_sibling;
complete(cmpl);
}
A few moments thought shows that this now allows us to make sysfs_deactivate even
simpler.
sd->s_sibling = (void *)&wait;
clear_bit(SYS_FLAG_ACTIVE, &sd->s_flags);
sysfs_put_active(sd);
wait_for_completion(&wait);
This calls "complete" and then "wait_for_completion" in the same
thread, which is a bit unusual, but should work perfectly.
kref_get_not_zero()
kref_get_not_zero()
Sadly, the kref package does make use of this primitive either. [in section The "kref" style at the end of paragraph three.]
Unfortunate wording
Unfortunate wording
Typo
3 if (atomic_dec_and_lock(&obj->refcnt), &subsystem_lock) {
Either we've got variable argument if statements now, or someone put the closing parenthese in the wrong place.
It's a comma operator.
It's a comma operator.
"the infamous http://en.wikipedia.org/wiki/Comma_operator".
Linux Kernel Design Patterns - Part 1
Ha! Now I know why kernel development is an order of magnitude harder than developing user level applications. This is not just the intrinsic problem of most bugs crashing the whole system. This is also the problem that advanced software engineering concepts like code reuse are only starting to reach these shores!
</troll>
Linux Kernel Design Patterns - Part 1
There are many such projects, including Inferno from Bell Labs, Singularity from Microsoft (!) and a whole lot of others.
It's an interesting research area because it completely throws away decades-old CPU memory protection, because typesafe virtual machines guarantee that you can't access anything you don't have a reference to.
Linux Kernel Design Patterns - Part 1
Linux kernel design patterns - part 1
Linux kernel design patterns - part 1
I've applied the fix, thanks. The error has only been there for 13 years...
Fixed