Kernel quality control, or the lack thereof
While tracking down some problems with the XFS file cloning and deduplication features (the FICLONERANGE and FIDEDUPERANGE ioctl() calls in particular), the developers noticed that, in fact, many aspects of those interfaces did not work correctly. Resource limits were not respected, users could overwrite a setuid file without resetting the setuid bits, time stamps would not be updated, maximum file sizes would be ignored, and more. Many of these problems were fixed in XFS itself, but others affected all filesystems offering those features and needed to be fixed at the virtual filesystem (VFS) level. The result was a series of pull requests including this one for 4.19-rc7, this one for the 4.20 merge window, and this one for 4.20-rc4.
More recently, similar problems have been discovered with the copy_file_range() system call, resulting in this patch set full of fixes. Once again, issues include the ability to overwrite setuid files, overwrite swap files, change immutable files, and overshoot resource limits. Time stamps are not updated, overlapping copies are not caught, and behavior between filesystems is inconsistent. Chinner's patch set contains another set of changes, almost all at the VFS level, to straighten these issues out.
Mainline quality assurance
The discovery of these bugs has brought a fair amount of disappointment with it. It seems clear that these new features were not extensively tested before being added to the kernel; certainly no automated tests had been added to the xfstests suite to verify them. Dave Chinner put it this way:
In time, these bugs will be fixed and users of all filesystems should benefit. Tests are being added to help ensure that these features continue to work in the future. This is clearly a necessary effort; Chinner and Darrick Wong are performing a service for the kernel community as a whole by taking it on. This work does raise a couple of interesting issues, though.
The first of those is the prospect of regressions in programs that have
come to depend on the behavior of these system calls as it is supported in
current kernels on specific filesystems. Hyrum's law suggests that such users
are likely to exist. Or, as Chinner put it: "the
API implementation is so broken right now that fixing it is almost
guaranteed to break something somewhere
". That could lead to some
interesting discussions if users start to complain that kernel updates have
broken their programs.
Stable updates
The other question that arises concerns the backporting of all these fixes to the stable kernel updates; indeed, it was the selection of one of the VFS fixes for backporting that set off the current conversation. The XFS developers have long been hostile to the automatic inclusion of their patches in stable updates, feeling that the work to validate those patches in older kernels has not been done and that the risk of creating new regressions is too high. As a result, XFS patches are not normally considered eligible for backporting, but that exclusion does not extend to fixes at the VFS layer.
In this case, Chinner stated that the current set of fixes has been validated for the mainline with a testing regime that runs billions of operations over a period of days; anything less risks not exposing some of the harder-to-hit bugs. Backporting those fixes to a different kernel would require the same level of testing to create the needed confidence that they don't create new problems, he said, and the XFS developers are too busy still fixing bugs to do that testing now.
Chinner followed
up with a lengthy indictment of the kernel development process as a
whole, saying that it is focused on speed and patch quantity rather than
the quality of the final result. The stable kernel process, in particular, is
"optimised to shovel as much change as possible with /as little
effort as possible/ back into older code bases
". He pointed out
that changes often appear in stable releases before they show up in a real
mainline release (as opposed to an -rc release), which doesn't leave a
whole lot of time for real stabilization. It is not, he feels, a small
problem:
Sasha Levin responded that the current process is the best that we can do at the moment:
For the time being, the VFS and XFS patches will not be included in the stable kernel updates. Once the fixes are complete and the filesystem test suites have been filled out, Wong said, it should be possible to safely backport the whole set. At that point, this particular issue will be solved, but that is not likely to happen until after the 4.21/5.0 kernel release.
For the longer term, there is still the problem that, as Wong put it:
"New features show up in the vfs without a lot of design
documentation, incomplete userspace interface manuals, and not much beyond
trivial testing
". One might well argue that this problem extends
beyond VFS features. The kernel community has never had much of a process
around the addition of APIs visible to user space; there are no real
requirements to ensure adequate documentation, testing, or consistency
between interfaces. The results can be seen in our released kernels, and
in the API mistakes that just barely escape release because the right
developer happened to notice them in time.
Over the years, the kernel community has matured considerably in a number
of ways. One need only look back to the days when we had no source-code
management system, no rules on regressions, and no release-management
discipline to see how much things have improved. The last few years have
seen some big improvements
around automated testing in particular. For all of our problems, the
quality of our
releases is quite a bit higher than it once was, even if it is not what it
should be. Given time, it is
reasonable to expect that we can build on that base to further focus our
processes on the quality of the kernels we release, if that is something
that the community decides it wants to do.
Index entries for this article | |
---|---|
Kernel | Development model/Kernel quality |
Kernel | System calls/copy_file_range() |
Posted Dec 7, 2018 19:21 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link]
I always have to read these patch sets to make sure that when the above are features, they aren't getting removed.
There's a "if (!is_dedupe)" in the code, so it's all good.
Posted Dec 7, 2018 20:27 UTC (Fri)
by johannbg (guest, #65743)
[Link]
Dare I predict 2020 will be the year when that day comes in the spirit of Jon predictions will soon be upon us ;)
Posted Dec 7, 2018 20:38 UTC (Fri)
by mgross (guest, #38112)
[Link]
Posted Dec 8, 2018 1:20 UTC (Sat)
by vomlehn (guest, #45588)
[Link] (11 responses)
Posted Dec 8, 2018 16:45 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (9 responses)
Agreed 200%, this is the core issue:
> > We ended up here because we *trusted* that ...
Either tests already exist and it's just the matter of the extra mile to automate them and share their results.
Or there's no decent, repeatable and re-usable test coverage and new features should simply not be added until there is. "Thanks your patches looks great, now where are your tests results please?". Not exactly ground-breaking software engineering.
Exceptions could be tolerated for hardware-specific or pre-silicon drivers which require very specific test environments and for which vendors can only hurt themselves anyway. That clearly doesn't seem the case of XFS or the VFS.
Validation and automation have a lesser reputation than development and tend to attract less talent. One possible and extremely simple way to address this is to treat the *development* of tests and automation to the same open-source and code review standards.
Posted Dec 9, 2018 11:17 UTC (Sun)
by iabervon (subscriber, #722)
[Link] (7 responses)
Posted Dec 9, 2018 14:20 UTC (Sun)
by saffroy (guest, #43999)
[Link] (5 responses)
Besides tests themselves, it helps a LOT to have some kind of test coverage report, just to remind you of which parts of the code are never touched by any of your current tests.
Do people publish such coverage reports for the kernel?
Posted Dec 10, 2018 9:49 UTC (Mon)
by metan (subscriber, #74107)
[Link] (4 responses)
However I can pretty much say that the main problems I see are various corner cases that are rarely hit (i.e. mostly failures and error propagation) and drivers. My take on this is that there is no point in doing coverage analysis when the gaps we have are enormous and easy to spot. Just have a look at our backlog of missing coverage in LTP at the moment https://github.com/linux-test-project/ltp/labels/missing%..., and these are just scratching the surface with most obviously missing syscalls. We may try to proceed with the coverage analysis once we are out of work there, which will hopefully happen at some point.
The problems with corner cases can be likely caught by combination of unit testing and fuzzing. Drivers testing is more problematic though, there is only so much you can do with qemu and emulated hardware. Proper driver testing needs a reasonably sized lab stacked with hardware and it's much more problematic to set up and maintain which is not going to happen unless somebody invests reasonable amount of resources into it. But there is light at the end of the tunnel as well, as far as I know Linaro has a big automated lab stacked with embedded hardware to run tests on, we are trying to tackle automated server grade hardware lab here in SUSE, and I'm pretty sure there is a lot more outside there just not that visible to the general public.
Posted Dec 10, 2018 12:57 UTC (Mon)
by nix (subscriber, #2304)
[Link] (3 responses)
There is no alternative to thinking about these problems, I'm afraid. There is no magic automatable road to well-tested software of this complexity.
Posted Dec 10, 2018 13:14 UTC (Mon)
by metan (subscriber, #74107)
[Link] (2 responses)
Posted Dec 11, 2018 17:37 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Dec 11, 2018 20:59 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
Posted Dec 9, 2018 17:28 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Thinking of it computer security is a bit like... healthcare: extremely opaque and nearly impossible for customers to make educated choices about it. From a legal perspective I suspect it's even worse, breach after breach and absolutely zero liability. To top it up class actions are no longer, killed by arbitration clauses in all Terms and Conditions. Brands might be more useful in security though.
https://www.google.com/search?q=site%3Aschneier.com+liabi...
Posted Dec 9, 2018 13:32 UTC (Sun)
by mupuf (subscriber, #86890)
[Link]
This is what we do in the i915 community. No feature lands in DRM without a test in IGT, and CI developers are part of the same team.
My view on this is that good quality comes from:
Point 1) is pretty much well done in the Linux community.
Point 2) is hard to justify when tests are not executed, but comes more naturally when we have a good CI system
Point 3) is probably the biggest issue for the Linux CI systems: The driver usually covers a wide variety of HW and configuration which cannot all be tested in CI at all time. This leads to complexity in the CI system that needs to be understood by developers in order to prevent regressions. This is why our CI is maintained and developed in the same team developing the driver.
Point 4) is coming pretty naturally when introducing a filtering system for CI failures. Some failures are known and pending fixing, and we do not want these to be considered as blocking for a patch series. We have been using bugs to create a forum of discussion for developers to discuss how to fix these issues. These bugs are associated to CI failures by a tool doing pattern matching (https://intel-gfx-ci.01.org/cibuglog/). The problem is that these bugs are now every developer's responsibility to fix, and that requires a change in the development culture to hold up some new features until some more important bugs are fixed.
I guess we are getting quite good at CI, and I am really looking forward to us in the CI team to have more time to share our knowledge and tools for others to replicate! We have already started working on an open source toolbox for CI (https://gitlab.freedesktop.org/gfx-ci), as discussed at XDC 2018 (https://xdc2018.x.org/slides/GFX_Testing_Workshop.pdf).
Posted Dec 10, 2018 20:35 UTC (Mon)
by sandeen (guest, #42852)
[Link]
You may wish to subscribe to fstests@vger.kernel.org or peruse git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git if this sort of thing is of interest to you.
Posted Dec 9, 2018 1:46 UTC (Sun)
by luto (guest, #39314)
[Link] (1 responses)
Posted Dec 20, 2018 12:26 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
The guys at the company I was contracted to were trying to abstract out the OS-specific features, like in the OPEN statement. You could declare a file as temporary, which led to it disappearing when it got closed - quite a nice feature IF IMPLEMENTED CORRECTLY!
So, I opened a temporary file on a FUNIT, then later on re-used the same FUNIT for a permanent file. THE OS DIDN'T CLEAR THE TEMPORARY STATUS. So when I closed the permanent file, it disappeared ... :-)
Cheers,
Posted Dec 10, 2018 14:43 UTC (Mon)
by xorbe (guest, #3165)
[Link] (21 responses)
Posted Dec 10, 2018 18:42 UTC (Mon)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (20 responses)
Don't get me wrong, code coverage is fine as far as it goes, and it seems likely that the Linux kernel community would do well to do more of it, but it is not a panacea. In particular, beyond a certain point, it is probably not the best place to put your testing effort.
Posted Dec 10, 2018 19:59 UTC (Mon)
by shemminger (subscriber, #5739)
[Link] (13 responses)
Posted Dec 10, 2018 20:21 UTC (Mon)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Dec 11, 2018 5:51 UTC (Tue)
by JdGordy (subscriber, #70103)
[Link] (6 responses)
Posted Dec 18, 2018 11:46 UTC (Tue)
by error27 (subscriber, #8346)
[Link] (5 responses)
Posted Dec 18, 2018 14:05 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (4 responses)
Posted Dec 19, 2018 13:43 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Dec 19, 2018 16:04 UTC (Wed)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
Posted Dec 20, 2018 3:45 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Posted Dec 20, 2018 4:37 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Dec 11, 2018 7:37 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (4 responses)
Interesting, could you share a simplified example?
> My experience with code coverage metrics has been mostly negative.
While error-handling code, corner cases and... backup configurations are notoriously untested, I agree there are diminishing returns and better trade-offs past some point. Curious what is experts' guestimation of where that percentage typically is.
Posted Dec 11, 2018 14:11 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (3 responses)
If only (say) 30% of your code is tested, you very likely need to substantially increase your coverage. If (say) 90% of your code is tested, there is a good chance that there is some better use of your time than getting to 91%. But for any rule of thumb like these, there will be a great many exceptions, for example, the safety-critical code mentioned earlier.
Hey, you asked! :-)
Posted Dec 11, 2018 20:53 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
Sincere thanks!
Posted Jan 5, 2019 18:06 UTC (Sat)
by joseph.h.garvin (guest, #64486)
[Link] (1 responses)
Posted Jan 5, 2019 22:29 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Nevertheless, your last sentence is spot on. It is precisely because rcutorture forces rare code paths and rare race conditions to execute more frequently that the number of RCU bugs reaching customers is kept down to a dull roar.
Posted Dec 10, 2018 21:18 UTC (Mon)
by NAR (subscriber, #1313)
[Link] (5 responses)
Posted Dec 10, 2018 21:50 UTC (Mon)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (4 responses)
For most types of software, at some point it becomes more important to test more races, more configurations, more input sequences, and more hardware configurations than to provide epsilon increase in coverage by triggering that next assertion. After all, testing and coverage is about reducing risk given the time and resources at hand. Therefore, over-emphasizing one form of testing (such as coverage) will actually increase overall risk due to the consequent neglect of some other form of testing.
Of course, there are some types of software where 100% coverage is reasonable, for example, certain types of safety-critical software. But in this case, you will be living under extremely strict coding standards so as to (among a great many other things) make 100% coverage affordable.
Posted Dec 24, 2018 20:42 UTC (Mon)
by anton (subscriber, #25547)
[Link] (3 responses)
OTOH, how do you test your safety net? Remember that Ariane 5 was exploded by a safety net that was supposed (and proven) to never trigger.
Posted Dec 25, 2018 0:05 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
But yes, Murphy will always be with us. So even in safety critical code, at the end of the day it is about reducing risk rather than completely eliminating it.
And to your point about Ariane 5's failed proof of correctness... Same issue as the classic failed proof of correctness for the binary search algorithm! Sadly, a proof of correctness cannot prove the assumptions on which it is based. So Murphy will always find a way, but it is nevertheless our job to thwart him. :-)
Posted Dec 30, 2018 11:22 UTC (Sun)
by GoodMirek (guest, #101902)
[Link] (1 responses)
E.g.:
In theory, it should never assert. In reality, it is desirable to minimize a risk that 'explosiveness' variable is stored in a failed memory cell, prior that cell is used for indication of explosiveness of any kind.
Or this case:
It is very rare to trigger and almost impossible to test such assertions, but when I saw them triggered in reality, even once in a lifetime, I appreciated their merit.
Posted Dec 30, 2018 15:44 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
But if the point was in fact to warn about unreliable memory, mightn't this sort of fault injection nevertheless be quite useful?
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
1) Well written driver code, peer reviewed to catch architectural issues
2) Good tests exercising the main use case, and corner cases. Tests are considered at the same level as driver code.
3) Good understand of the CI system that will execute these tests
4) Good following of the bugs filed when these tests fail
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Wol
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Managers look at the % numbers and it causes developers to rearrange code to have a single return to maximize the numbers.
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
But anything less than 100% coverage guarantees that some part of the code is not tested...
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
I would expect that defensive coding practices that lead to unreachable code (and thus <100% coverage) are particularly widespread in safety-critical software. I.e., you cannot trigger this particular safety-net code, and you are pretty sure that it cannot be triggered, but not absolutely sure; or even if you are absolutely sure, you foresee that the safety net might become triggerable after maintenance. Will you remove the safety net to increase your coverage metric?
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
Kernel quality control, or the lack thereof
I saw that multiple times while working on embedded systems.
explosiveness=255;
if explosiveness !=255 then assert;
if <green> then
explosiveness=0;
else
explosiveness=255;
if (explosiveness!=0 and explosiveness!=255) then assert;
Kernel quality control, or the lack thereof