Big changes to ext3 and journaling
[Posted June 18, 2003 by corbet]
The ext3 filesystem is, for many, the standard journaling filesystem for
the Linux kernel. So it has been somewhat embarrassing that ext3 still
uses a number of deprecated interfaces, including the big kernel lock and
sleep_on(). The big kernel lock (BKL) is a holdover from the
initial Linux symmetric multiprocessing implementation, when it was not
safe for more than one processor to run in the kernel at the same time.
Its presence in ext3 is not just considered archaic and inelegant; it is
also a serious performance constraint on larger SMP systems.
As of 2.5.73, the BKL has been abolished from ext3, thanks to a lengthy
series of patches by Andrew Morton and Alex Tomas. These patches never did
show up on linux-kernel, but they have been part of the -mm kernel tree for
some time. Says Andrew:
My gut feeling is that there should be one, maybe two bugs left in
it, but no problems have been discovered...
So, as with all development kernels, a bit of caution is called for.
Removing the BKL from ext3 was actually a simple thing to do. That
filesystem, itself, had no need for the BKL - it is the generic journaled
block device (JBD) layer that required that protection. So the first step
was to push the BKL
down a layer, and ext3 was BKL-free. Of course, that didn't solve the real
problem, but it was a start. While ext3 was being worked on, a few other
patches went in:
- Concurrent block and inode allocation, much like ext2 has had for
some time. This patch puts a separate spinlock on each cylinder group
in a filesystem, allowing allocation to happen in multiple groups
simultaneously.
- "Fuzzy counters," which implements approximate counters for free
blocks and inodes using per-CPU variables.
- The ext3 "data=journal" mode has been fixed. This mode,
which journals all data written to the disk (rather than just the
metadata) has been broken for a long time.
With ext3 done, it was time to fix up the JBD layer. This job was not done
halfway - a lengthy series of patches adds several locks and a whole,
complicated, fine-grained scheme. Each transaction gets two separate locks
(t_handle_lock and t_jcb_lock) controlling access to
various data structures. There is another set for the journal:
j_state_lock for scalar state information, j_list_lock
for lists and buffers, and j_revoke_lock for the list of revoked
blocks. Two more locks protect aspects of the buffer head/journal
head combination. And, of course, there is a whole set of ordering rules
to control which locks must be taken before which others. Believe it or
not, there is even a certain amount of documentation in the code comments
describing which locks protect which data structures.
The whole body of work clearly needs wider testing (and benchmarking), so
it's probably a good time for it to go into the mainline kernel. Hopefully
there won't be too many surprises lurking for the unwary (or unbacked-up).
As this work stabilizes, however, another big item can be scratched off the
"must-fix" list.
(
Log in to post comments)