LWN.net Logo

Speeding up ext2

The 2.5 kernel development process has put a strong emphasis on scalability and performance issues. So it is somewhat interesting that the core Linux filesystems - ext2 and ext3 - have seen relatively little scalability work in 2.5. That is beginning to change, at least for ext2, but this work is raising some interesting questions about what the role of these two filesystems really is.

Alex Tomas has recently been working on performance bottlenecks in ext2. His first concurrent block allocation patch attacks the problem of allocating blocks within a filesystem. The current ext2 code takes out the superblock lock before performing block allocation; this means that only one thread can be trying to allocate space in a given filesystem at a time. The first patch created a separate "allocation lock" which protects the small piece of code which actually makes allocation decisions; a later revision creates a separate lock for each block group within the filesystem, thus reducing lock contention further.

The patch was greeted with positive reviews. William Lee Irwin reported a throughput increase from 62 MB/s to 104 MB/s on a benchmark he ran, and exclaimed "This patch is a godsend. Whoever's listening, please apply!. Martin Bligh, instead, said "SDET on my machine (16x NUMA-Q) has fallen in love with your patch, and has decided to elope with it to a small desert island." Not bad for a patch which is really a pretty straightforward exercise in finer-grained locking.

The block allocation patch was quickly joined by a concurrent inode allocation patch and a distributed counters patch. None of these have found their way into the mainline kernel yet, but they offer enough performance benefits that they will likely get there eventually. Assuming the block allocation patch can be coaxed back from its desert island experience, that is.

A question was raised, however: is ext2 the right place for this sort of work? ext2 is generally thought of as the relatively simple Linux filesystem; ext3 is the place for fancy new stuff. There are a couple of reasons why this sort of work tends to find its way into ext2 first, though.

One of those reasons is the simple fact that ext3 still has bigger scaling problems. The ext3 filesystem is one of the few places in the Linux kernel that still makes heavy use of the big kernel lock (BKL). As a result, ext3 does not scale well to large systems, and tweaking things like block allocation will not help the real problem. Until the BKL dependency is removed from ext3, most other performance work will not make much sense. Removing the BKL is apparently a somewhat tricky job; at this point, it may well not happen before 2.6 is released.

The other reason is that, large-systems scaling issues notwithstanding, ext3 is developing into the default Linux filesystem. For most users, there is little or no incentive to prefer ext2 over ext3; all it takes is one power failure to make the advantages of a journaling filesystem clear. So, as Daniel Phillips put it:

Ext2 is growing into the role of experimental filesystem; Ext3 is now the stable filesystem. Hopefully, the experiments will make Ext2 smaller, cleaner and at the same time, more powerful, over time. Sort of like the role that RAMFS plays: besides being useful, Ext2 should be thought of as a showcase for best filesystem coding practices

The role reversal, it seems, is nearly complete. Soon, it will be the ext2 users who are living on the bleeding edge.


(Log in to post comments)

ext2 for "in flight" data and ext3 for the "stable stuff"?

Posted Mar 20, 2003 23:42 UTC (Thu) by im14u2c (subscriber, #5246) [Link]

Sounds like these changes would tune ext2 more for transient data (stuff like /tmp, and scratch areas used by apps for works-in-progress), and ext3 more appropriate for "persistent" data.

Basically, anything that you could easily regenerate or mke2fs away across a reboot would live on whizzy-fast ext2 partitions.

I'm not saying that ext2 is unreliable. Rather, I'm saying you'd use ext3 wherever you don't want large fsck times and greater data protection guarantees, and ext2 on filesystems that are homes for high-bandwidth stuff that you don't mind losing if your machine crashes. You avoid the long fsck times for the ext2 filesystems across a crash by simply re-making them.

Squid caches, /tmp, web browser caches, GIMP caches, partially computed data sets, CDDA images, etc. all go in those 'scratch' areas on ext2 filesystems. In contrast, /home, /usr, /var, etc... all go in ext3.

Does that sound sensible?

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds