Looking forward to 2.7
[Posted October 15, 2003 by corbet]
Some attention has been given to the "
2.7
thoughts" list which has been circulating on linux-kernel. Looking
forward to what can be done in the next development series can be an
interesting exercise. In this case, though, the exercise has mostly been
carried out by people who will not actually be doing the work; as a result,
the list has been dismissed by a few kernel hackers; one called it
"crackpot wishlist gunk."
So what are the crackpots wishing for? Some of the items they want (marked
"mandatory features" on the list) are already in the works; these include
support for CPU hotplugging, full NTFS support and virtual machine
support. Others are somewhat vague, including "complete user quota
centralization" and "improve kobject model for security, quota rendering."
And some will never happen; there is just not a whole lot of call for
features like an in-kernel Gopher server or a /proc implementation
of the loadable module tools.
Kernel hackers have far more respect for code (and those who produce it)
than they do for list makers. The 2.7 thoughts list may yet inspire
somebody to do some hacking, but its influence on the development process
is likely to remain small.
A more interesting view into what could happen with 2.7 might be found in a
conversation between Linus and Joel Becker of Oracle. The discussion turned
to what information was needed from the kernel to perform direct I/O, which lead to this outburst from Linus:
Have you ever noticed that O_DIRECT is a piece of crap? The
interface is fundamentally flawed, it has nasty security issues, it
lacks any kind of sane synchronization, and it exposes stuff that
shouldn't be exposed to user space.
Linus went on to wish an early death upon disk-based databases; he seems to
think that all but the largest databases should just be done in-memory.
Direct I/O does bring its share of problems. It is hard to keep the kernel
page cache in a coherent condition when I/O operations are allowed to
circumvent it; page cache confusion can lead to corrupted data. Getting
good performance out of direct I/O is hard unless asynchronous I/O is used
as well. Direct I/O can also confuse the disk I/O scheduler by creating
request patterns (especially overlapping requests) which don't otherwise
happen. In other words, the direct I/O idea is hard to get right for both
kernel and user space.
But systems like Oracle do need some of the capabilities that direct I/O
provides. They need to be able to move large amounts of data without
polluting the page cache with stuff that will not be used. Databases which
use shared storage need to be able to force data to be reread from disk
when another system has changed it. Large applications also tend to have a
better idea of how their access patterns work than the kernel does; they
know when a particular block of data will not be used any more. The need
for the level of control and performance direct I/O can provide will
persist, whether it is a "piece of crap" or not.
Linus seems to understand this need; he would just like to push development
toward what he sees as a better interface. Such an interface would work
with the page cache, rather than trying to circumvent it. Some of his
thoughts, as expressed in this posting,
include:
- A mechanism for moving pages between user space and the page cache.
An application wishing to do a direct write would then just transfer
ownership of the pages containing the data to the kernel, which would
put them into the page cache. A simple flush finishes the job.
- A way for an application to tell the kernel that certain pages in the
cache are stale and should not be used. This mechanism could also be
used to tell the kernel about pages which are no longer needed and can
be dropped from the cache. The fadvise() system call already
does part of this task.
- The ability to mark I/O on a particular file descriptor (or by a
particular process) as being a one-shot affair that should not be
cached. This idea was suggested in response to a description of performance
problems triggered by the PostgreSQL vacuum operation, which
touches much of the database exactly once.
Much time and effort over the 2.5 development series went into making
direct I/O work well. This work helped to close a gap between Linux and
some proprietary Unix systems. It could well be that, in 2.7, that effort
goes into coming up with a better way of solving the problem altogether.
(
Log in to post comments)