Finding locking bugs with Smatch
Smatch is a GPL-licensed static-analysis tool for C that has a lot of specialized checks for the kernel. Smatch has been used in the kernel for more than 20 years; Dan Carpenter, its primary author, decided last year that some details of its plugin system were due for a rewrite. He spoke at Linaro Connect 2025 about his work on Smatch, the changes to its implementation, and how those changes enabled him to easily add additional checks for locking bugs in the kernel.
Video of the talk is
available, and
Carpenter's slides can be found on
Linaro's website.
Carpenter began by apologizing for the relative complexity of this talk,
compared to some of his presentations about Smatch in prior years. "We're
running out of easy checks to write,
" he explained. Smatch is designed to
permit writing project-specific checks; over the years, a large number of
kernel-specific checks have been added to the code,
so the latest work has moved on to more
complicated topics, such as locking.
One of the things that sets Smatch apart from other static-analysis tools,
Carpenter said, is its support for control-flow analysis and cross-function
analysis. He frequently uses both of these features to understand new
subsystems; Smatch can "tell you where a variable is set, where callers of a
function are, and what a function can return,
" among other things. For
example, Smatch might show that a particular function has three callers, all of
which hold a particular lock when they call it. From that, the programmer can
infer the implicit locking requirements of the function.
That kind of report requires cross-function analysis, to trace nested calls with a lock held, but it also requires some starting knowledge of which functions acquire or release a lock. In the kernel, Smatch obtains that information from a hand-written table of every lock-related function, and then propagates that information through the different call trees. Rebuilding the complete database of calls takes multiple passes and five to six hours, which he said he does every night.
Once the database is built, however, it allows files to be easily checked for common locking problems. The most common mistake is to fail to unlock a lock in a function's error path, he said. Smatch finds places where this has occurred by looking for functions that acquire a lock, and then have some possible control-flow path along which it is not released. That is slightly more complicated than it sounds.
"It's harder than you might think to know what the start state is,
" he
explained. If a function makes a call to spin_lock(), one might
reasonably assume that the lock was not held before that point. But some
functions behave differently depending on whether a lock is already held, so
that is control-flow-sensitive as well. Also, sometimes a lock is referred to by
multiple names, being locked by one name and unlocked via another. This
complexity had resulted in Smatch's lock-tracking code slowly becoming an
unreadable mess. "And I'm the one who wrote it.
"
So, in the past year, Carpenter has rewritten everything. The reimplementation of the locking checks provides a blueprint for how to write modular Smatch checks, he said. Checks can now call add_lock_hook() and add_unlock_hook() to be informed when Smatch finds that a function call acquires or releases a lock somewhere in its call tree. Locks are also now tracked by type, instead of by name, in order to reduce problems with one lock being referred to by more than one name.
There's a slight wrinkle with code that uses C's
cleanup attribute to
automatically unlock locks. On the one hand, it mostly eliminates bugs related
to forgotten unlocks; on the other hand, it's a "headache for static
analysis
" since it makes the lock object harder to access and track.
Ultimately, since they avoid many locking bugs, Smatch can
"mostly just ignore them
".
Carpenter has used the new structure to write checks for double lock and unlock
bugs as well. Unlike other static-analysis projects, Smatch focuses
less on "universal static properties
" and more on "the actual bugs
people are writing
". Smatch will not catch every possible double lock,
double unlock, or forgotten unlock. That increases the number of false negatives
from the tool, but it results in a much bigger reduction in false positives,
he said.
I asked how he found the classes of bugs that people actually write, in order to target them with Smatch checks. He explained that he reviews patches sent to the linux-stable mailing list in order to find bugs that could have been found earlier with static analysis. He encouraged other people to try the same thing, as he has found it educational.
In the future, Carpenter wants to extend Smatch's double-lock checks to operate across function boundaries, to take advantage of Clang's upcoming support for tracking the relationship between locks and their data, and to handle lock-ordering bugs.
As time wound down, one member of the audience wanted to know how Smatch compared to Cppcheck, Coccinelle, and other static-analysis tools. Other open-source tools do not have good control-flow analysis, and practically no cross-function analysis, Carpenter said. Smatch does those things better, but it has its own weaknesses. The main problem is that it hasn't really been tested outside the kernel, he explained, so it's not clear how well Smatch will handle other styles of code. Smatch is also relatively slow.
Coccinelle is fast, and can generate fixes for many problems. Sparse is good at
finding endianness bugs and user-space pointers dereferenced in kernel space. "But in
terms of flow analysis, Smatch is the only tool that we have in open
source.
"
After Carpenter's talk, I ran the tool against my own copy of the kernel; in that process, I learned that Smatch is best run from source. The last release predates Carpenter's rewrite, and there are a number of useful scripts included in the source distribution that are not present in the distribution packages. Smatch's terse documentation covers how to build its analysis database and run existing checks, but not how to query the database in a more free-form manner. The smatch_data/db/smdb.py script included in the source distribution can be used for that purpose.
[Thanks to Linaro for sponsoring my travel to Linaro Connect.]
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Static analysis |
| Conference | Linaro Connect/2025 |
Posted Jun 11, 2025 22:30 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (5 responses)
Posted Jun 12, 2025 7:09 UTC (Thu)
by error27 (subscriber, #8346)
[Link] (4 responses)
> Does your current code track RCU read locking as well as the more normal locks?
Yes.
> And does it follow flow across function pointers in operations structs?
Yep. When you call a function pointer, then it's recorded as "(struct inode_operations)->atomic_open" in the caller_info table. There is a function_ptr table which lists the functions implement that. The command `smdb.py functions inode_operations atomic_open` gives the list. So I could do:
$ echo "select distinct(ptr) from function_ptr;" | sqlite3 smatch_db.sqlite | grep inode_operations | tee ptrs
I ran this and it says nothing is holding the rcu_read lock... Huh. When I run "echo "select * from caller_info where type = 9030;" | sqlite3 smatch_db.sqlite | grep rcu_read_lock" it shows that it's being recorded in other places. One thing is that I'm not currently using the caller information to print warnings. It's stored in the database, but I haven't actually used that information yet so I haven't debugged it.
> So could you, for example, query the database to see what inode_operations are called without rcu_read_lock held, and then which implementations of those functions use rcu_dereference as though the lock were held?
Yeah. That's *supposed* to be easy but since I haven't debugged that part I'm probably going to run into surprises. What I do is I start with the simplest, stupidest test and then re-write it from there. So my first draft will be to complain if we call rcu_read_lock_held() and we're not holding the RCU read lock. Then from I'll go through and silence the false positives. There might be places where we check if the read lock is held and don't immediately print an error message, right? It's a stupid check. But the first draft is never going to work anyway. Maybe I'll print a different warning if we call rcu_read_lock_held() and Smatch thinks it can't possibly be held and I wouldn't publish that, but it might be useful for debugging.
Probably the actual check is to only complain if we call rcu_read_lock_held() from a rcu_dereference*() macro.
Posted Jun 20, 2025 16:12 UTC (Fri)
by error27 (subscriber, #8346)
[Link] (3 responses)
It turns out that rcu_read_locks are nestable and the locking code doesn't handle that correctly. I might be able to copy and paste the Smatch preempt code to handle this. It's slightly different because the preempt code warns if we ever have preempt disabled and in this case it's the opposite where we want to want if we're ever missing the lock. You want to lean towards which ever way gives fewer false positives. There are places where recursion could confuse it for example. Writing special modules to track rcu_read_lock() instead or re-using the smatch_locking.c code is a lot of typing but it's doable.
I'm a bit confused by rcu_dereference_rtnl(). If we're holding the rtnl_lock(), does that mean the rcu_read_lock() is held?
Posted Jun 20, 2025 20:37 UTC (Fri)
by johill (subscriber, #25196)
[Link]
Posted Jun 20, 2025 22:34 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Thanks for digging in to this! Don't try too hard on my account. It was little more than an idle thought because I had seen a d_compare function which uses rcu_dereference but can sometimes be called without rcu_read_locking() - but in practice almost never is (proc_sys_compare) and I wondered how hard it would be for smatch to notice. It seems "not impossible but not trivial" which is a good answer. In this case the anomaly can actually be found simply using sparse as the pointer in question is not marked __rcu.
BTW I tend to think of rcu_read_lock() like a refcount rather than like a read-lock. It is a refcount on "everything". So maybe copying smatch_locking.c wouldn't be the best approach. Does smatch track how many references the code owns to objects? Or does that sound too much like borrow-checking? Could it track references to the RCUniverse? Would that be any easier?
Thanks!
Posted Jun 23, 2025 10:39 UTC (Mon)
by error27 (subscriber, #8346)
[Link]
This is like preempt. Every spinlock bumps the refcount and people often hold multiple spinlocks at the same time.
With preempt the return states are tracked the same way described here. The one subtlety is that if you take multiple locks it only counts as one refcount bump. It doesn't matter the exact number of refcounts we're holding, only that it's greater than one and that the inc/dec functions pair correctly. So it's simpler to just say +1.
For the caller side, the preempt count is tracked as a number. But again it's not an exact number. If any of the callers are holding one or more spinlocks then that's +1. If we take another spinlock then we pump the count. Say we were holding two locks so we started as +1 and then we drop both locks then we'd go to negative and that's fine too. I guess we might miss a bug but that's not as bad as a false positive.
The difference between preempt count and rcu_read_lock() is that for preempt we only need to find one caller which has preempt disabled, but for rcu_read_lock() we need to find all the callers, otherwise it triggers a false positive. It's a much higher bar.
The way that recursion works in Smatch means we're always going to miss some callers. I guess one idea is that I could periodically delete all the simple recursive calls from the database where a function calls itself? It doesn't solve the halting problem but it probably silences quite a few false positives. But the main thing is that I'd have to create a list of functions which need to be manually silenced. There are about 400 warnings so it's probably a day's work to silence all the warnings.
RCU and function pointers
Does your current code track RCU read locking as well as the more normal locks? And does it follow flow across function pointers in operations structs?
So could you, for example, query the database to see what inode_operations are called without rcu_read_lock held, and then which implementations of those functions use rcu_dereference as though the lock were held?
RCU and function pointers
$ IFS="
" ; for i in $(cat ptrs) ; do smdb.py $i ; done | grep LOCK
RCU and function pointers
as for rcu_dereference_rtnl():
RCU and function pointers
* rcu_dereference_rtnl - rcu_dereference with debug checking
* @p: The pointer to read, prior to dereferencing
*
* Do an rcu_dereference(p), but check caller either holds rcu_read_lock()
* or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference()
You need either, not both. Both would be a bit silly.
There are a a lot of calls to rcu_dereference_check() elsewhere which are basically "either RCU or the lock protecting the thing".
The reason (IMHO) it says rtnl_dereference() or rcu_dereference() is preferable is that then you have selected either one of the contexts (holding the lock - RTNL in this case - or under RCU protection), but there's always going to be some code that doesn't know the context in whatever call stack it has.
RCU and function pointers
RCU and function pointers
