LWN.net Logo

Looking forward to 2.7

Some attention has been given to the "2.7 thoughts" list which has been circulating on linux-kernel. Looking forward to what can be done in the next development series can be an interesting exercise. In this case, though, the exercise has mostly been carried out by people who will not actually be doing the work; as a result, the list has been dismissed by a few kernel hackers; one called it "crackpot wishlist gunk."

So what are the crackpots wishing for? Some of the items they want (marked "mandatory features" on the list) are already in the works; these include support for CPU hotplugging, full NTFS support and virtual machine support. Others are somewhat vague, including "complete user quota centralization" and "improve kobject model for security, quota rendering." And some will never happen; there is just not a whole lot of call for features like an in-kernel Gopher server or a /proc implementation of the loadable module tools.

Kernel hackers have far more respect for code (and those who produce it) than they do for list makers. The 2.7 thoughts list may yet inspire somebody to do some hacking, but its influence on the development process is likely to remain small.

A more interesting view into what could happen with 2.7 might be found in a conversation between Linus and Joel Becker of Oracle. The discussion turned to what information was needed from the kernel to perform direct I/O, which lead to this outburst from Linus:

Have you ever noticed that O_DIRECT is a piece of crap? The interface is fundamentally flawed, it has nasty security issues, it lacks any kind of sane synchronization, and it exposes stuff that shouldn't be exposed to user space.

Linus went on to wish an early death upon disk-based databases; he seems to think that all but the largest databases should just be done in-memory.

Direct I/O does bring its share of problems. It is hard to keep the kernel page cache in a coherent condition when I/O operations are allowed to circumvent it; page cache confusion can lead to corrupted data. Getting good performance out of direct I/O is hard unless asynchronous I/O is used as well. Direct I/O can also confuse the disk I/O scheduler by creating request patterns (especially overlapping requests) which don't otherwise happen. In other words, the direct I/O idea is hard to get right for both kernel and user space.

But systems like Oracle do need some of the capabilities that direct I/O provides. They need to be able to move large amounts of data without polluting the page cache with stuff that will not be used. Databases which use shared storage need to be able to force data to be reread from disk when another system has changed it. Large applications also tend to have a better idea of how their access patterns work than the kernel does; they know when a particular block of data will not be used any more. The need for the level of control and performance direct I/O can provide will persist, whether it is a "piece of crap" or not.

Linus seems to understand this need; he would just like to push development toward what he sees as a better interface. Such an interface would work with the page cache, rather than trying to circumvent it. Some of his thoughts, as expressed in this posting, include:

  • A mechanism for moving pages between user space and the page cache. An application wishing to do a direct write would then just transfer ownership of the pages containing the data to the kernel, which would put them into the page cache. A simple flush finishes the job.

  • A way for an application to tell the kernel that certain pages in the cache are stale and should not be used. This mechanism could also be used to tell the kernel about pages which are no longer needed and can be dropped from the cache. The fadvise() system call already does part of this task.

  • The ability to mark I/O on a particular file descriptor (or by a particular process) as being a one-shot affair that should not be cached. This idea was suggested in response to a description of performance problems triggered by the PostgreSQL vacuum operation, which touches much of the database exactly once.

Much time and effort over the 2.5 development series went into making direct I/O work well. This work helped to close a gap between Linux and some proprietary Unix systems. It could well be that, in 2.7, that effort goes into coming up with a better way of solving the problem altogether.


(Log in to post comments)

If they know it better, why don't they do it?

Posted Oct 16, 2003 14:29 UTC (Thu) by NAR (subscriber, #1313) [Link]

They need to be able to move large amounts of data without polluting the page cache with stuff that will not be used. Databases which use shared storage need to be able to force data to be reread from disk when another system has changed it. Large applications also tend to have a better idea of how their access patterns work than the kernel does; they know when a particular block of data will not be used any more.

Why don't they create their own operating system? Of course, this operating system would be based on an existing OS like *BSD or Linux, they could remove the unused features (e.g. file systems, if they only use direct I/O) and add their very specific features. I'm not sure it's a good idea to try to run exactly the same OS for such a different tasks like huge database servers and ordinary laptops.

Bye,NAR

If they know it better, why don't they do it?

Posted Oct 16, 2003 20:15 UTC (Thu) by iabervon (subscriber, #722) [Link]

Database machines normally have all of the software on a normal filesystem, and only have the database tables on a special partition; furthermore, they generally also run primarily normal software, with only the database as an important special program.

For that matter, the features that big databases want overlap significantly with the features that people who watch movies, listen to music, or burn CDs on their laptops want. In particular, people want to listen to music from files on their laptops without evicting all of their files from cache.

direct I/O vs cache replacement policy

Posted Oct 18, 2003 20:07 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

People frequenty view direct I/O as a cache replacement policy -- something to make file I/O use the cache more efficiently (by not using it at all). That is an entirely inappropriate use of direct I/O.

If you just don't want to pollute the cache, all you need is an adjustment to the way the system decides when to throw a page out of the cache. I.e. a smarter cache replacement policy. Even a policy that says "for this open, never keep anything in cache" is better than O_DIRECT. There is a subtle difference.

What direct I/O is good for is 1) moving traditional OS function into user space for engineering reasons -- maybe in some instance it's easier to build advanced cache management function into a database program than to add it to Linux -- and 2) shared filesystems.

It's OK to use O_DIRECT to make a program go faster because it happens to work. Just remember that's not what O_DIRECT is for.

Looking forward to 2.7

Posted Oct 19, 2003 17:40 UTC (Sun) by komarek (guest, #7295) [Link]

Given the price of storage right now, would it be realistic to require that O_DIRECT only be used on a dedicated device? If so, wouldn't this remove many of the truly hard parts of O_DIRECT?

-Paul Komarek

"CPU hotplugging"?

Posted Oct 20, 2003 18:10 UTC (Mon) by bjn (guest, #2179) [Link]

I saw that one and said, "<Feiss>Ehh?</Feiss>" :-)

Are there actually motherboards that allow a CPU to be removed and replaced... or does that
feature mean something else?

"CPU hotplugging"?

Posted Oct 23, 2003 17:03 UTC (Thu) by jwb (guest, #15467) [Link]

Many machines are able to have their CPUs replaced while the system is running. Of course, the system must be designed for this. The up-market offerings from Sun, IBM, and HP tend to have this feature. CPUs can and do fail, and replacing them without disrupting service is a valuable feature.

The only trick is the software support. The kernel must be able to understand that the CPU is going away.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds