LWN: Comments on "Kernel quality control, or the lack thereof" https://lwn.net/Articles/774114/ This is a special feed containing comments posted to the individual LWN article titled "Kernel quality control, or the lack thereof". en-us Sat, 11 Oct 2025 01:14:46 +0000 Sat, 11 Oct 2025 01:14:46 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Kernel quality control, or the lack thereof https://lwn.net/Articles/776066/ https://lwn.net/Articles/776066/ PaulMcKenney <div class="FormattedComment"> Especially in my part of the Linux kernel, there is great value in preventing problems from reaching the -tip tree, let alone Linus's tree, let alone distributions, let alone customers. This great value stems from the fact that RCU bugs tend to be a bit difficult to reproduce and track down. It is therefore quite important to test the common cases.<br> <p> Nevertheless, your last sentence is spot on. It is precisely because rcutorture forces rare code paths and rare race conditions to execute more frequently that the number of RCU bugs reaching customers is kept down to a dull roar.<br> </div> Sat, 05 Jan 2019 22:29:35 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/776057/ https://lwn.net/Articles/776057/ joseph.h.garvin <div class="FormattedComment"> I think you have things backwards. If there is a bug in commonly executed code it's going to be exposed even if there isn't a test. It's the infrequently executed code that tends to contain bugs (e.g. handling error conditions). Testing the frequently executed code still has value in that it can prevent problems from reaching customers, but bugs in frequently executed code will tend to be discovered very quickly. In a sense what the entire point of tests is is to make it so some code paths are more frequently executed.<br> </div> Sat, 05 Jan 2019 18:06:03 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775737/ https://lwn.net/Articles/775737/ PaulMcKenney <div class="FormattedComment"> Assuming that timing considerations permitted it, one approach would be to run the code in a simulator that provided fault-injection for memory. That said, to your point, you would have to inject the fault rather carefully to trigger that particular set of assertions.<br> <p> But if the point was in fact to warn about unreliable memory, mightn't this sort of fault injection nevertheless be quite useful?<br> </div> Sun, 30 Dec 2018 15:44:34 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775733/ https://lwn.net/Articles/775733/ GoodMirek <div class="FormattedComment"> There are safety nets triggered only in cases impossible to reach via software means, but possible to happen in case of HW failure.<br> I saw that multiple times while working on embedded systems.<br> <p> E.g.:<br> explosiveness=255;<br> if explosiveness !=255 then assert;<br> <p> In theory, it should never assert. In reality, it is desirable to minimize a risk that 'explosiveness' variable is stored in a failed memory cell, prior that cell is used for indication of explosiveness of any kind.<br> <p> Or this case:<br> if &lt;green&gt; then<br> explosiveness=0;<br> else<br> explosiveness=255;<br> if (explosiveness!=0 and explosiveness!=255) then assert;<br> <p> It is very rare to trigger and almost impossible to test such assertions, but when I saw them triggered in reality, even once in a lifetime, I appreciated their merit.<br> </div> Sun, 30 Dec 2018 11:22:33 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775567/ https://lwn.net/Articles/775567/ PaulMcKenney <div class="FormattedComment"> The safety-critical software that I know of (admittedly an obsolete and vanishingly small fraction of the total) limited "if" statements for exactly this reason.<br> <p> But yes, Murphy will always be with us. So even in safety critical code, at the end of the day it is about reducing risk rather than completely eliminating it.<br> <p> And to your point about Ariane 5's failed proof of correctness... Same issue as the classic failed proof of correctness for the binary search algorithm! Sadly, a proof of correctness cannot prove the assumptions on which it is based. So Murphy will always find a way, but it is nevertheless our job to thwart him. :-)<br> </div> Tue, 25 Dec 2018 00:05:47 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775560/ https://lwn.net/Articles/775560/ anton I would expect that defensive coding practices that lead to unreachable code (and thus &lt;100% coverage) are particularly widespread in safety-critical software. I.e., you cannot trigger this particular safety-net code, and you are pretty sure that it cannot be triggered, but not absolutely sure; or even if you are absolutely sure, you foresee that the safety net might become triggerable after maintenance. Will you remove the safety net to increase your coverage metric? <p>OTOH, how do you test your safety net? Remember that Ariane 5 was exploded by a safety net that was supposed (and proven) to never trigger. Mon, 24 Dec 2018 20:42:02 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775294/ https://lwn.net/Articles/775294/ Wol <div class="FormattedComment"> That reminds me of a bug I found in Fortran 77 for Univac ... which led to data loss.<br> <p> The guys at the company I was contracted to were trying to abstract out the OS-specific features, like in the OPEN statement. You could declare a file as temporary, which led to it disappearing when it got closed - quite a nice feature IF IMPLEMENTED CORRECTLY!<br> <p> So, I opened a temporary file on a FUNIT, then later on re-used the same FUNIT for a permanent file. THE OS DIDN'T CLEAR THE TEMPORARY STATUS. So when I closed the permanent file, it disappeared ... :-)<br> <p> Cheers,<br> Wol<br> </div> Thu, 20 Dec 2018 12:26:58 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775280/ https://lwn.net/Articles/775280/ PaulMcKenney <div class="FormattedComment"> Dr. Who has to do that every few years as the actors change?<br> </div> Thu, 20 Dec 2018 04:37:01 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775277/ https://lwn.net/Articles/775277/ neilbrown <div class="FormattedComment"> Isn't it a simple matter of writing a virus (or a worm .. or a worm with a virus) which hunts out all bugs on the Internet and removes them. I remember Doctor Who did that once to get rid of all the photos of himself, so it can't be too hard.<br> </div> Thu, 20 Dec 2018 03:45:16 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775231/ https://lwn.net/Articles/775231/ PaulMcKenney <div class="FormattedComment"> In the old days, I would have agreed. But these days, "rm" can often be undone using "git checkout", or, in more extreme cases, "git clone". ;-)<br> </div> Wed, 19 Dec 2018 16:04:38 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775210/ https://lwn.net/Articles/775210/ mathstuf <div class="FormattedComment"> I prefer `rm` for that job. Works really well. ;)<br> </div> Wed, 19 Dec 2018 13:43:22 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775143/ https://lwn.net/Articles/775143/ PaulMcKenney <div class="FormattedComment"> If you didn't know better, you might think that there is no magic bullet to take out all bugs. :-)<br> </div> Tue, 18 Dec 2018 14:05:57 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/775138/ https://lwn.net/Articles/775138/ error27 <div class="FormattedComment"> Single return coding style introduces "forgot to set the error code" bugs. A "goto out;" might do a ton of things or it might not do anything so it is a mystery, but a "return -EINVAL;" is unambiguous.<br> <p> </div> Tue, 18 Dec 2018 11:46:02 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774624/ https://lwn.net/Articles/774624/ marcH <div class="FormattedComment"> Interesting, now the question is: how much did/do xfstests offer for the two specific features reported above?<br> <p> </div> Tue, 11 Dec 2018 20:59:21 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774623/ https://lwn.net/Articles/774623/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; Hey, you asked! :-)</font><br> <p> Sincere thanks!<br> </div> Tue, 11 Dec 2018 20:53:55 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774597/ https://lwn.net/Articles/774597/ nix <div class="FormattedComment"> Indeed if there is any part of the kernel this has really happened for, filesystems and in particular XFS must be it, and probably have the best test coverage of all. I mean, xfstests is called that for a *reason*. :) (I tell a lie: RCU has gone to the next step beyond this, formal model verification. Coming up with a formal model of XFS would be... a big job!)<br> <p> </div> Tue, 11 Dec 2018 17:37:54 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774559/ https://lwn.net/Articles/774559/ PaulMcKenney <div class="FormattedComment"> I have seen 80% used with some decent results. But it really depends on the code and its user base. 100% of your commonly executed code really does need to be covered. But of course the more users your code has, the larger the fraction of your code is commonly executed.<br> <p> If only (say) 30% of your code is tested, you very likely need to substantially increase your coverage. If (say) 90% of your code is tested, there is a good chance that there is some better use of your time than getting to 91%. But for any rule of thumb like these, there will be a great many exceptions, for example, the safety-critical code mentioned earlier.<br> <p> Hey, you asked! :-)<br> </div> Tue, 11 Dec 2018 14:11:49 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774536/ https://lwn.net/Articles/774536/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; Managers look at the % numbers and it causes developers to rearrange code to have a single return</font><br> <p> Interesting, could you share a simplified example?<br> <p> <font class="QuotedText">&gt; My experience with code coverage metrics has been mostly negative.</font><br> <p> While error-handling code, corner cases and... backup configurations are notoriously untested, I agree there are diminishing returns and better trade-offs past some point. Curious what is experts' guestimation of where that percentage typically is.<br> </div> Tue, 11 Dec 2018 07:37:27 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774532/ https://lwn.net/Articles/774532/ JdGordy <div class="FormattedComment"> Meanwhile, MISRA coding standards require single returns... :/<br> </div> Tue, 11 Dec 2018 05:51:38 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774513/ https://lwn.net/Articles/774513/ PaulMcKenney <div class="FormattedComment"> And anything less than 100% race coverage similarly guarantees a hole in your testing. As does anything less than 100% configuration-combination coverage. As does anything less than 100% input coverage. As does anything less than 100% hardware-configuration testing. As does ...<br> <p> For most types of software, at some point it becomes more important to test more races, more configurations, more input sequences, and more hardware configurations than to provide epsilon increase in coverage by triggering that next assertion. After all, testing and coverage is about reducing risk given the time and resources at hand. Therefore, over-emphasizing one form of testing (such as coverage) will actually increase overall risk due to the consequent neglect of some other form of testing.<br> <p> Of course, there are some types of software where 100% coverage is reasonable, for example, certain types of safety-critical software. But in this case, you will be living under extremely strict coding standards so as to (among a great many other things) make 100% coverage affordable.<br> </div> Mon, 10 Dec 2018 21:50:04 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774510/ https://lwn.net/Articles/774510/ NAR But anything less than 100% coverage <b>guarantees</b> that some part of the code is not tested... Mon, 10 Dec 2018 21:18:20 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774504/ https://lwn.net/Articles/774504/ sandeen <div class="FormattedComment"> "One hopes that these test will be added to a well maintained test project"<br> <p> You may wish to subscribe to fstests@vger.kernel.org or peruse git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git if this sort of thing is of interest to you.<br> </div> Mon, 10 Dec 2018 20:35:49 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774502/ https://lwn.net/Articles/774502/ PaulMcKenney <div class="FormattedComment"> I certainly have seen this. And it can be even worse, due to penalizing good assertions and discouraging debugging code. So oddly enough, one reason why coverage is not a panacea is exactly because some people believe that it is in fact a panacea! :-) <br> </div> Mon, 10 Dec 2018 20:21:38 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774500/ https://lwn.net/Articles/774500/ shemminger <div class="FormattedComment"> My experience with code coverage metrics has been mostly negative.<br> Managers look at the % numbers and it causes developers to rearrange code to have a single return to maximize the numbers.<br> <p> </div> Mon, 10 Dec 2018 19:59:55 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774497/ https://lwn.net/Articles/774497/ PaulMcKenney <div class="FormattedComment"> Code coverage is table stakes in this game, if that. Even 100% code coverage won't guarantee that you exercised the relevant race conditions, patterns of data on mass storage, combinations of kernel-boot parameters and Kconfig options, and so forth. For but one example, 100% code coverage in a CONFIG_PREEMPT=n kernel would not find a bug due to inopportune preemption that could happen in a CONFIG_PREEMPT=y kernel.<br> <p> Don't get me wrong, code coverage is fine as far as it goes, and it seems likely that the Linux kernel community would do well to do more of it, but it is not a panacea. In particular, beyond a certain point, it is probably not the best place to put your testing effort.<br> </div> Mon, 10 Dec 2018 18:42:16 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774446/ https://lwn.net/Articles/774446/ xorbe <div class="FormattedComment"> They need code coverage metrics, not just "billions of operations over a period of days."<br> </div> Mon, 10 Dec 2018 14:43:08 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774439/ https://lwn.net/Articles/774439/ metan <div class="FormattedComment"> Exactly there is nothing that would replace well thought tests written by senior developers, we only need to throw more manpower on the problem, which seems to be happening albeit slowly.<br> </div> Mon, 10 Dec 2018 13:14:54 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774438/ https://lwn.net/Articles/774438/ nix <div class="FormattedComment"> Yeah -- and lcov won't help with the sorts of things this LWN post is talking about anyway. 100%-coverage filesystems could easily still have all these bugs, because they relate to specific states of the filesystem, and *no* coverage system could *possibly* track whether we got complete coverage of all possible corrupted filesystems! (Or, indeed, all possible states of the program: for all but the most trivial programs there are far too many.)<br> <p> There is no alternative to thinking about these problems, I'm afraid. There is no magic automatable road to well-tested software of this complexity.<br> <p> </div> Mon, 10 Dec 2018 12:57:27 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774431/ https://lwn.net/Articles/774431/ metan <div class="FormattedComment"> The tool to generate coverage for kernel is maintained at <a href="https://github.com/linux-test-project/lcov">https://github.com/linux-test-project/lcov</a> it should work, but I haven't tried it.<br> <p> However I can pretty much say that the main problems I see are various corner cases that are rarely hit (i.e. mostly failures and error propagation) and drivers. My take on this is that there is no point in doing coverage analysis when the gaps we have are enormous and easy to spot. Just have a look at our backlog of missing coverage in LTP at the moment <a href="https://github.com/linux-test-project/ltp/labels/missing%20coverage">https://github.com/linux-test-project/ltp/labels/missing%...</a>, and these are just scratching the surface with most obviously missing syscalls. We may try to proceed with the coverage analysis once we are out of work there, which will hopefully happen at some point.<br> <p> The problems with corner cases can be likely caught by combination of unit testing and fuzzing. Drivers testing is more problematic though, there is only so much you can do with qemu and emulated hardware. Proper driver testing needs a reasonably sized lab stacked with hardware and it's much more problematic to set up and maintain which is not going to happen unless somebody invests reasonable amount of resources into it. But there is light at the end of the tunnel as well, as far as I know Linaro has a big automated lab stacked with embedded hardware to run tests on, we are trying to tackle automated server grade hardware lab here in SUSE, and I'm pretty sure there is a lot more outside there just not that visible to the general public.<br> </div> Mon, 10 Dec 2018 09:49:40 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774401/ https://lwn.net/Articles/774401/ marcH <div class="FormattedComment"> Features or security? Sad but the priority has to be the former to do business. Have fewer more secure features and you lose in the market place almost every time.<br> <p> Thinking of it computer security is a bit like... healthcare: extremely opaque and nearly impossible for customers to make educated choices about it. From a legal perspective I suspect it's even worse, breach after breach and absolutely zero liability. To top it up class actions are no longer, killed by arbitration clauses in all Terms and Conditions. Brands might be more useful in security though.<br> <p> <p> <a href="https://www.google.com/search?q=site%3Aschneier.com+liability">https://www.google.com/search?q=site%3Aschneier.com+liabi...</a><br> <p> <p> </div> Sun, 09 Dec 2018 17:28:55 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774393/ https://lwn.net/Articles/774393/ saffroy <div class="FormattedComment"> Fuzzing is extremely useful, but it still needs a *thinking* developer to help it generate interesting cases in reasonable time.<br> <p> Besides tests themselves, it helps a LOT to have some kind of test coverage report, just to remind you of which parts of the code are never touched by any of your current tests.<br> <p> Do people publish such coverage reports for the kernel?<br> <p> </div> Sun, 09 Dec 2018 14:20:29 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774392/ https://lwn.net/Articles/774392/ mupuf <div class="FormattedComment"> <font class="QuotedText">&gt; Validation and automation have a lesser reputation than development and tend to attract less talent. One possible and extremely simple way to address this is to treat the *development* of tests and automation to the same open-source and code review standards.</font><br> <p> This is what we do in the i915 community. No feature lands in DRM without a test in IGT, and CI developers are part of the same team.<br> <p> My view on this is that good quality comes from:<br> 1) Well written driver code, peer reviewed to catch architectural issues<br> 2) Good tests exercising the main use case, and corner cases. Tests are considered at the same level as driver code.<br> 3) Good understand of the CI system that will execute these tests<br> 4) Good following of the bugs filed when these tests fail<br> <p> Point 1) is pretty much well done in the Linux community.<br> <p> Point 2) is hard to justify when tests are not executed, but comes more naturally when we have a good CI system<br> <p> Point 3) is probably the biggest issue for the Linux CI systems: The driver usually covers a wide variety of HW and configuration which cannot all be tested in CI at all time. This leads to complexity in the CI system that needs to be understood by developers in order to prevent regressions. This is why our CI is maintained and developed in the same team developing the driver.<br> <p> Point 4) is coming pretty naturally when introducing a filtering system for CI failures. Some failures are known and pending fixing, and we do not want these to be considered as blocking for a patch series. We have been using bugs to create a forum of discussion for developers to discuss how to fix these issues. These bugs are associated to CI failures by a tool doing pattern matching (<a href="https://intel-gfx-ci.01.org/cibuglog/">https://intel-gfx-ci.01.org/cibuglog/</a>). The problem is that these bugs are now every developer's responsibility to fix, and that requires a change in the development culture to hold up some new features until some more important bugs are fixed.<br> <p> I guess we are getting quite good at CI, and I am really looking forward to us in the CI team to have more time to share our knowledge and tools for others to replicate! We have already started working on an open source toolbox for CI (<a href="https://gitlab.freedesktop.org/gfx-ci">https://gitlab.freedesktop.org/gfx-ci</a>), as discussed at XDC 2018 (<a href="https://xdc2018.x.org/slides/GFX_Testing_Workshop.pdf">https://xdc2018.x.org/slides/GFX_Testing_Workshop.pdf</a>).<br> </div> Sun, 09 Dec 2018 13:32:05 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774383/ https://lwn.net/Articles/774383/ iabervon <div class="FormattedComment"> I think, for this case, fuzzing is probably more useful that developer-written tests. If a developer misses the code for some checks necessary to maintain security constraints, what are the chances they'll write tests that verify that using the API in a way they didn't intend doesn't violate security constraints they didn't think about? I'd be more convinced if they taught a fuzzing framework how to call their API and set it loose on a filesystem with a lot of interesting cases. I care somewhat less that it does what it's supposed to do than that whatever it actually does is something the caller is allowed to do.<br> <p> </div> Sun, 09 Dec 2018 11:17:28 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774373/ https://lwn.net/Articles/774373/ luto <div class="FormattedComment"> This reminds me of a bug I found once. In a -rc4 kernel, I decided to play with the new O_TMPFILE feature. My first basic experiment resulted in fs corruption:<br> <p> <a href="https://lwn.net/Articles/562296/">https://lwn.net/Articles/562296/</a><br> </div> Sun, 09 Dec 2018 01:46:34 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774357/ https://lwn.net/Articles/774357/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; And, tests that aren't run don't really exists, for all practical purposes.</font><br> <p> Agreed 200%, this is the core issue:<br> <p> <font class="QuotedText">&gt; &gt; We ended up here because we *trusted* that ...</font><br> <p> Either tests already exist and it's just the matter of the extra mile to automate them and share their results.<br> <p> Or there's no decent, repeatable and re-usable test coverage and new features should simply not be added until there is. "Thanks your patches looks great, now where are your tests results please?". Not exactly ground-breaking software engineering.<br> <p> Exceptions could be tolerated for hardware-specific or pre-silicon drivers which require very specific test environments and for which vendors can only hurt themselves anyway. That clearly doesn't seem the case of XFS or the VFS.<br> <p> Validation and automation have a lesser reputation than development and tend to attract less talent. One possible and extremely simple way to address this is to treat the *development* of tests and automation to the same open-source and code review standards.<br> </div> Sat, 08 Dec 2018 16:45:31 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774305/ https://lwn.net/Articles/774305/ vomlehn <div class="FormattedComment"> One hopes that these test will be added to a well maintained test project, though such projects are even less common than a well maintained development project. LTP comes to mind but there may be other possibilities. And, tests that aren't run don't really exists, for all practical purposes.<br> </div> Sat, 08 Dec 2018 01:20:00 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774289/ https://lwn.net/Articles/774289/ mgross <div class="FormattedComment"> This is a hard problem. I'm glad to see it getting attention.<br> </div> Fri, 07 Dec 2018 20:38:57 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774288/ https://lwn.net/Articles/774288/ johannbg <div class="FormattedComment"> Ah the inevitable cutting the ( legacy ) fat off and calling a flag day draws near with everyone must upgrading to that kernel release when that day comes.<br> <p> Dare I predict 2020 will be the year when that day comes in the spirit of Jon predictions will soon be upon us ;)<br> </div> Fri, 07 Dec 2018 20:27:21 +0000 Kernel quality control, or the lack thereof https://lwn.net/Articles/774285/ https://lwn.net/Articles/774285/ zblaxell <div class="FormattedComment"> <font class="QuotedText">&gt; users could overwrite a setuid file without resetting the setuid bits, time stamps would not be updated, </font><br> <p> I always have to read these patch sets to make sure that when the above are features, they aren't getting removed.<br> <p> There's a "if (!is_dedupe)" in the code, so it's all good.<br> </div> Fri, 07 Dec 2018 19:21:52 +0000