User: Password:
Subscribe / Log in / New account

Improving ext4: bigalloc, inline data, and metadata checksums

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:22 UTC (Thu) by tytso (subscriber, #9993)
In reply to: Improving ext4: bigalloc, inline data, and metadata checksums by walex
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

One other thought. At least at the beginning ext4's raison d'etre (its reason for being) was as a stopgap file system until btrfs could be ready. We started with ext3 code which was proven, solid code, and support for delayed allocation, multiblock allocation, and extents had also been in use for quite a while in Clustrefs's Lustre product. So that code wasn't exactly new, either. What I did was integrate Cluterfs's contributions, and then worked on stablizing them so that we would have something that was better than ext3 ready in the short term.

At the time when I started working on ext4, XFS developers were all mostly still working for SGI, so there was a similar problem with the distributions not having anyone who could support or debugfs XFS problems. This has changed more recently, as more and more XFS developers have left (volunteraliy or involuntarily) SGI and joined companies such as Red Hat. XFS has also improved its small file performance, which was something it didn't do particularly well simply because SGI didn't optimize for that; its sweet spot was and still is really large files on huge RAID arrays.

One of the reasons why I felt it was necessary to work on ext4 was that everyone I talked to who had created a file system before in the industry, whether it was GPFS (IBM's cluster file system), or Digital Unix's advfs, or Sun's ZFS, gave estimates of somewhere between 50 to 200 person years worth of effort before the file system was "ready". Even if we assume that open source development practices would make development go twice as fast, and if we ignore the high end of the range because cluster file systems are hard, I was skeptical it would get done in two years (which was the original estimate) given the number of developers it was likely to attract. Given that btrfs started at the beginning of 2007, and here we are almost at 2012, I'd say my fears were justified.

At this point, I'm actually finding that ext4 has found a second life as a server file system in large cloud data centers. It turns out that if you don't need the fancy-shamcy features that Copy-on-Write file systems give you, they aren't free. In particular, ZFS has truly a prodigious appetite for memory, and one of the things about cloud servers is that in order for them to make economic sense, you try to pack as many jobs or VM's on them, so they are constantly under memory pressure. We've done some further optimizations so that ext4 performs much better when under memory pressure, and I suspect at this point that in a cloud setting, using a CoW file system may simply not make sense.

Once btrfs is ready for some serious benchmarking, it would be interesting to benchmark it under serious memory pressure, and see how well it performs. Previous CoW file systems, such as BSD's lfs two decades ago, and ZFS more recently, have needed a lot of memory to cache metadata blocks, and it will be interesting to see if btrfs has similar issues.

(Log in to post comments)

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds