Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Especially if mkfs could somehow align the block clusters to the flash page size.
Improving ext4: bigalloc, inline data, and metadata checksums
Posted Nov 30, 2011 8:35 UTC (Wed) by ebirdie (subscriber, #512)
Posted Nov 30, 2011 21:05 UTC (Wed) by walex (subscriber, #69836)
Posted Dec 1, 2011 23:04 UTC (Thu) by job (guest, #670)
Posted Dec 3, 2011 0:34 UTC (Sat) by walex (subscriber, #69836)
I had to completely re-vase a set of virtual machines that were installed by some less thoughtful predecessor into growable VMware virtual disks. Several virtual disks had several hundred thousand extents (measured with filefrag) and a couple had over a million, all mixed up randomly on the real disk. Performance was horrifying (it did not help that there were another two absurd choices in the setup).
I ended up with just the mostly readonly filesystem in the VM disk, and all the writable subtrees mounted via NFS from the host machine, which was much faster. In particular during backups, because I could run the backup program (BackupPC based on RSYNC) on the real machine and remote backup is a very high IO load operation, and running it inside the virtual machines on a virtual disk was much much slower.
Posted Dec 6, 2011 21:26 UTC (Tue) by job (guest, #670)
Posted Nov 30, 2011 20:22 UTC (Wed) by walex (subscriber, #69836)
And this is relevant to ext4... exactly how?
Posted Nov 30, 2011 21:33 UTC (Wed) by khim (subscriber, #9252)
Posted Nov 30, 2011 23:16 UTC (Wed) by Lennie (subscriber, #49641)
Posted Dec 1, 2011 1:01 UTC (Thu) by SLi (subscriber, #53131)
Posted Dec 1, 2011 1:59 UTC (Thu) by dlang (✭ supporter ✭, #313)
Posted Dec 1, 2011 3:29 UTC (Thu) by tytso (subscriber, #9993)
But yes, a journal by itself has as its primary feature avoiding long fsck times. One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.
If you are replicating your data anyway because you're using a cluster file system such as Hadoopfs, and you're confident that your data center has appropriate contingencies that mitigate against a simultaneous data-center wide power loss event (i.e., you have bat, and diesel generators, etc., and you test all of this equipment regularly), then it may be that going without a journal makes sense. You really need to know what you are doing though, and it requires careful design both at the hardware level, the data center level, as well as the storage stack above the local disk file system.
Posted Dec 2, 2011 18:55 UTC (Fri) by walex (subscriber, #69836)
One nice thing with ext4 is that fsck times are reduced (typically) by a factor of 7-12 times. So a TB file system that previously took 20-25 minutes might now only take 2-3 minutes.
Posted Dec 2, 2011 19:10 UTC (Fri) by dlang (✭ supporter ✭, #313)
Posted Dec 3, 2011 0:40 UTC (Sat) by walex (subscriber, #69836)
An unclean shutdown is usually not that damaged, which can however happen with a particularly bad unclean shutdown (lots of stuff in flight, for example on a wide RAID) or RAM/disk errors. The report I saw was not for a "enterprise" system with battery, ECC and a redundant storage layer.
Posted Dec 2, 2011 21:41 UTC (Fri) by nix (subscriber, #2304)
Fill up the fs, even once, and this benefit goes away -- but a *lot* of filesystems sit for years mostly empty. fscking those filesystems is very, very fast these days (I've seen subsecond times for mostly-empty multi-Tb filesystems).
Posted Dec 2, 2011 22:45 UTC (Fri) by tytso (subscriber, #9993)
Not all of the improvements in fsck time come from being able to skip reading portions of the inode table. Extent tree blocks are also far more efficient than indirect blocks, and so that contributes to much of the speed improvements of fsck'ing an ext4 filesystem compared to an ext2 or ext3 file system.
Posted Dec 2, 2011 23:35 UTC (Fri) by nix (subscriber, #2304)
We could fix things so that as you delete files from a full file system, we reduce the high watermark field for each block group's inode table
Posted Dec 4, 2011 4:25 UTC (Sun) by alankila (subscriber, #47141)
Posted Dec 4, 2011 4:38 UTC (Sun) by dlang (✭ supporter ✭, #313)
journaling writes data twice with the idea being that the first one is to a sequential location that is going to be fast and then the following write will be to the random location
with no seek time, you should be able to write the data to it's final location directly and avoid the second write. All you need to do is to enforce the ordering of the writes and you should be just as safe as with a journal, without the extra overhead.
Posted Dec 4, 2011 4:49 UTC (Sun) by mjg59 (subscriber, #23239)
Posted Dec 4, 2011 5:05 UTC (Sun) by dlang (✭ supporter ✭, #313)
if what you are writing is metadata, it seems like it shouldn't be that hard, since there isn't that much metadata to be written.
Posted Dec 4, 2011 11:32 UTC (Sun) by tytso (subscriber, #9993)
Or when you allocate a disk block, you need to modify the block allocation bitmap (or whatever data structure you use to indicate that the block is in use) and then update the data structures which map a particular inode's logical to physical block map.
Without a journal, you can't do this atomically, which means the state of the file system is undefined after a unclean/unexpected shutdown of the OS.
Posted Dec 4, 2011 17:02 UTC (Sun) by kleptog (subscriber, #1183)
Posted Dec 6, 2011 0:40 UTC (Tue) by cmccabe (guest, #60281)
Soft updates would not work for databases, because database operations often need to be logged "logically" rather than "physically." For example, when you encounter an update statement that modifies every row of the table, you just want to add the update statement itself to the journal, not the contents of every row.
Posted Dec 6, 2011 1:24 UTC (Tue) by tytso (subscriber, #9993)
My favorite line from that article is "...and then I turn to page 8 and my head explodes."
The *BSD's didn't get advanced features such as Extended Attribute until some 2 or 3 years after Linux. My theory why is that it required someone as smart as Kirk McKusick to be able to modify UFS with Soft Updates to add support for Extended Attributes and ACL's.
Also, note that because of how Soft Update works, it requires forcing metadata blocks out to disk more frequently than without Soft Updates; it is not free. What's worse, it depends on the disk not reordering write requests, which modern disks do to avoid seeks (in some cases a write can not make it onto the platter in the absence of a Cache Flush request for 5-10 seconds or more). If you disable the HDD's write cacheing, your lose a lot of performance on HDD's; if you leave it enabled (which is the default) your data is not safe.
Posted Dec 11, 2011 10:18 UTC (Sun) by vsrinivas (subscriber, #56913)
Posted Dec 21, 2011 23:09 UTC (Wed) by GalacticDomin8r (guest, #81935)
Duh. Can you name a file system with integrity features that doesn't introduce a performance penalty? I thought not. The point is that the Soft Updates method is (far) less overhead than most.
> What's worse, it depends on the disk not reordering write requests
Bald faced lie. The only requirement of SU's is that writes reported as done by disk driver are indeed safely landed in the nonvolatile storage.
Posted Dec 22, 2011 11:32 UTC (Thu) by nix (subscriber, #2304)
Posted Dec 4, 2011 17:13 UTC (Sun) by mjg59 (subscriber, #23239)
Posted Dec 4, 2011 10:31 UTC (Sun) by alankila (subscriber, #47141)
Anyway, isn't btrfs going to give us journal-less but atomic filesystem modification behavior?
Posted Dec 4, 2011 11:43 UTC (Sun) by tytso (subscriber, #9993)
So if you modify a node at the bottom of the b-tree, you write a new copy of the leaf block, but then you need to write a copy of its parent node with a pointer to the new leaf block, and then you need to write a copy of its grandparent, with a pointer to the new parent node, all the way up to the root of the tree. This also implies that all of these nodes had better be in memory, or you will need to read them into memory before you can write them back out. Which is why CoW file systems tend to be very memory hungry; if you are under a lot of memory pressure because you're running a cloud server, and are trying to keep lots of VM's packed into a server (or are on an EC2 VM where extra memory costs $$$$), good luck to you.
At least in theory, CoW file systems will try to batch multiple file system operations into a single big transaction (just as ext3 will try to batch many file system operations into a single transaction, to try to minimize writes to the journal). But if you have a really fsync()-happy workload, there definitely could be situations where a CoW file system like btrfs or ZFS could end up needing to update more blocks on an SSD than a traditional update-in-place file system with journaling, such as ext3 or XFS.
Posted Dec 12, 2011 12:13 UTC (Mon) by jlokier (guest, #52227)
Posted Dec 24, 2011 20:56 UTC (Sat) by rich0 (guest, #55509)
I believe Btrfs actually uses a journal, and then updates the tree very 30 seconds. This is a compromise between pure journal-less COW behavior and the memory-hungry behavior described above. So, the tree itself is always in a clean state (if the change propagates to the root then it points to an up-to-date clean tree, and if it doesn't propagate to the root then it points to a stale clean tree), and then the journal can be replayed to catch the last 30 seconds worth of writes.
I believe that the Btfs journal does effectively protect both data and metadata (equivalent to data=ordered). Since data is not overwritten in place you end up with what appears to be atomic writes I think (within a single file only).
Posted Dec 24, 2011 22:17 UTC (Sat) by jlokier (guest, #52227)
You can't implement a COW tree without writing all the way up the tree. You write a new node to the tree, so you have to have the tree point to it. You either copy an existing parent node and fix it, or you overwrite it in place. If you do the latter, then you aren't doing COW. If you copy the parent node, then its parent is pointing to the wrong place, all the way up to the root.
In fact you can. The simplest illustration: for every tree node currently, allocate 2 on storage, and replace every pointer in a current interior node format with 2 pointers, pointing to the 2 allocated storage nodes. Those 2 storage nodes both contain a 2-bit version number. The one with larger version number (using wraparound comparison) is "current node", and the other is "potential node".
To update a tree node in COW fashion, without writing all the way up the tree on every update, simply locate the tree node's "potential node" partner, and overwrite that in place with a version 1 higher than the existing tree node. The tree is thus updated. It is made atomic using the same methods as needed for a robust journal: if it's a single sector and the medium writes those atomically, or by using a node checksum, or by writing version number at start and end if the medium is sure to write sequentially.
Note I didn't say it made reading any faster :-) (Though with non-seeking media, speed might not be a problem.)
That method is clearly space inefficient and reads slowly (unless you can cache a lot of the node selections). It can be made more efficient in a variety of ways, such as sharing "potential node" space among multiple potential nodes, or having a few pre-allocated pools of "potential node" space which migrate into the explicit tree with a delay - very much like multiple classical journals. One extreme of that strategy is a classical journal, which can be viewed as every tree node having an implicit reference to the same range of locations, any of which might be regarded as containing that node's latest version overriding the explicit tree structure.
You can imagine there a variety of structures with space and behaviour in between a single, flat journal and an explicitly replicated tree of micro-journalled nodes.
The "replay" employed by classical journals also has an analogue: preloading of node selections either on mount, or lazily as parts of the tree are first read in after mounting, potentially updating tree nodes at preload time to reduce the number of pointer traversals on future reads.
The modern trick of "mounted dirty" bits for large block ranges in some filesystems to reduce fsck time, also has a natural analogue: Dirty subtree bits, indicating whether the "potential" pointers (implicit or explicit) need to be followed or can be ignored. Those bits must be set with a barrier in advance of using the pointers, but they don't have to be set again for new updates after that, and can be cleaned in a variety of ways; one of which is the preload mentioned above.
Posted May 29, 2012 8:49 UTC (Tue) by marcH (subscriber, #57642)
"You can implement a COW tree without writing all the way up the tree if your tree implements versioning".
Posted Dec 4, 2011 13:11 UTC (Sun) by dlang (✭ supporter ✭, #313)
this is ok with large streaming writes, but horrible with many small writes to the same area of disk.
the journal is many small writes to the same area of disk, exactly the worst case for an SSD
also with rotational media, writing all the block in place requires many seeks before the data can be considered safe, and if you need to write the blocks in a particular order, you may end up seeking back and forth across the disk. with a SSD the order the blocks are written in doesn't affect how long it takes to write them.
by the way, i'm not the OP who said that all journaling filesystems are bad on SSDs, I'm just pointing out some reasons why this could be the case.
Posted Dec 4, 2011 17:39 UTC (Sun) by tytso (subscriber, #9993)
This might be the case for cheap MMC or SD cards that are designed for use in digital cameras, but an SSD which is meant for use in a computer will have a much more sophisticated FTL than that.
Posted Dec 4, 2011 19:38 UTC (Sun) by dlang (✭ supporter ✭, #313)
yes, in theory it could mark that 4k of data as being obsolete and only write new data to a new eraseblock, but that would lead to fragmentation where the disk could have 256 1M chunks, each with 4K of obsolete data in them, and to regain any space it would then need to re-write 255M of data.
given the performance impact of stalling for this long on a write (not the mention the problems you would run into if you didn't have that many blank eraseblocks available), I would assume that if you re-write a 4k chunk, when it writes that data it will re-write the rest of the eraseblock as well so that it can free up the old eraseblock
the flash translation layer lets it mix the logical blocks in the eraseblocks, and the drives probably do something in between the two extremes I listed above (so they probably track a few holes, but not too many)
Posted Dec 5, 2011 1:34 UTC (Mon) by cmccabe (guest, #60281)
> Just to get the argument out in the open, what is the basis
> for making this claim?
Well, SSDs have a limited number of write cycles. With metadata journaling, you're effectively writing all the metadata changes twice instead of once. That will wear out the flash faster. I think a filesystem based on soft updates might do well on SSDs.
Of course the optimal thing would be if the hardware would just expose an actual MTD interface and let us use NilFS or UBIFS. But so far, that shows no signs of happening. The main reason seems to be that Windows is not able to use raw MTD devices, and most SSDs are sold into the traditional Windows desktop market.
Valerie Aurora also wrote an excellent article about the similarities between SSD block remapping layers and log structured filesystems here: http://lwn.net/Articles/353411/
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds