LWN.net Logo

Big changes to ext3 and journaling

The ext3 filesystem is, for many, the standard journaling filesystem for the Linux kernel. So it has been somewhat embarrassing that ext3 still uses a number of deprecated interfaces, including the big kernel lock and sleep_on(). The big kernel lock (BKL) is a holdover from the initial Linux symmetric multiprocessing implementation, when it was not safe for more than one processor to run in the kernel at the same time. Its presence in ext3 is not just considered archaic and inelegant; it is also a serious performance constraint on larger SMP systems.

As of 2.5.73, the BKL has been abolished from ext3, thanks to a lengthy series of patches by Andrew Morton and Alex Tomas. These patches never did show up on linux-kernel, but they have been part of the -mm kernel tree for some time. Says Andrew:

My gut feeling is that there should be one, maybe two bugs left in it, but no problems have been discovered...

So, as with all development kernels, a bit of caution is called for.

Removing the BKL from ext3 was actually a simple thing to do. That filesystem, itself, had no need for the BKL - it is the generic journaled block device (JBD) layer that required that protection. So the first step was to push the BKL down a layer, and ext3 was BKL-free. Of course, that didn't solve the real problem, but it was a start. While ext3 was being worked on, a few other patches went in:

  • Concurrent block and inode allocation, much like ext2 has had for some time. This patch puts a separate spinlock on each cylinder group in a filesystem, allowing allocation to happen in multiple groups simultaneously.

  • "Fuzzy counters," which implements approximate counters for free blocks and inodes using per-CPU variables.

  • The ext3 "data=journal" mode has been fixed. This mode, which journals all data written to the disk (rather than just the metadata) has been broken for a long time.

With ext3 done, it was time to fix up the JBD layer. This job was not done halfway - a lengthy series of patches adds several locks and a whole, complicated, fine-grained scheme. Each transaction gets two separate locks (t_handle_lock and t_jcb_lock) controlling access to various data structures. There is another set for the journal: j_state_lock for scalar state information, j_list_lock for lists and buffers, and j_revoke_lock for the list of revoked blocks. Two more locks protect aspects of the buffer head/journal head combination. And, of course, there is a whole set of ordering rules to control which locks must be taken before which others. Believe it or not, there is even a certain amount of documentation in the code comments describing which locks protect which data structures.

The whole body of work clearly needs wider testing (and benchmarking), so it's probably a good time for it to go into the mainline kernel. Hopefully there won't be too many surprises lurking for the unwary (or unbacked-up). As this work stabilizes, however, another big item can be scratched off the "must-fix" list.


(Log in to post comments)

Larry M warned about this

Posted Jun 19, 2003 16:16 UTC (Thu) by pflugstad (subscriber, #224) [Link]

Sounds like this part of the code is falling into the ever increasing fine-grained locking that Larry M warned about...

A seperate question: does this matter on non-SMP, non-preemptable kernels,
or do all these locks go away (become no-ops) in that case?

Larry M warned about this

Posted Jun 19, 2003 16:19 UTC (Thu) by corbet (editor, #1) [Link]

Spinlocks vanish entirely on uniprocessor kernels, so there will be no performance hit there.

typo + noise-reduction suggestion

Posted Jun 20, 2003 2:58 UTC (Fri) by roelofs (subscriber, #2599) [Link]

It's presence in ext3 ...

"It's" == "it is" or "it has," nothing else. This is possessive.

Since there are often little typos like this (including the ones in the SCO quarterly report, which made the original text somewhat incomprehensible), and fixit comments contribute absolutely nothing to the discussion beyond (hopefully) getting the typos fixed, would you consider adding a "typos" contact method of some sort? Either an e-mail address or just a new modifier to the regular comment page (e.g., Comment type: [X] Article commentary (public) [ ] Editorial correction (private)) would suffice. The latter could easily be filtered into a special e-mail bin or whatever, which would allow you to set your priority appropriately for reading them. (And there would never be any need to reply to such things. ;-) )

Just a thought.

Greg

typo + noise-reduction suggestion

Posted Jun 20, 2003 15:35 UTC (Fri) by ris (editor, #5) [Link]

Send typos to lwn@lwn.net and they will get seen and fixed.

typo + noise-reduction suggestion

Posted Jun 21, 2003 0:39 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

That's good information, but it doesn't respond to the suggestion.

Most readers who want to report typos will continue to use the comment facility because they will not have seen this comment.

Maybe a line could be added to the comment instructions: "If you're just reporting a typo, please email instead of commenting."

I'd like to add that sometimes I find something in an article incomprehensible for reasons other than a typo. Missing background information, undefined terms, etc. If LWN editors are willing to fix things like that after publication, email for that should also be encouraged. And it would be nice if the emailer got a personal reply describing the fix, since he's not going to read the article again.

typo + noise-reduction suggestion

Posted Jun 21, 2003 0:44 UTC (Sat) by corbet (editor, #1) [Link]

"That's good information, but it doesn't respond to the suggestion."

I've been watching the conversation. Some sort of "feedback" button makes a great deal of sense, and I'll implement it. At least, I've just put it on the list, it may be a bit before I can hack it up. Also on the list for a while now is making it easier to find our contact info in general.

jon

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds