Improving ext4: bigalloc, inline data, and metadata checksums [LWN.net]

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 29, 2011 23:49 UTC (Tue) by yoe (guest, #25743) [Link] (10 responses)

What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.

Try to nail down whether your problem is LVM, one of your disks dying, or ext4, before changing things like that. Otherwise you'll be debugging for a long time to come...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 4:49 UTC (Wed) by ringerc (subscriber, #3071) [Link] (8 responses)

... or bad RAM, bad CPU cache, CPU heat, an unrelated kernel bug in (eg) a disk controller, a disk controller firmware bug, a disk firmware bug, or all sorts of other exciting possibilities.

I recently had a batch of disks in a backup server start eating data because of a HDD firmware bug. It does happen.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:29 UTC (Wed) by hmh (subscriber, #3838) [Link]

Please disclose disc model and fw level, this kind of stuff is important as it often helps someone avoid data loss...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:02 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (6 responses)

"occasionally ogginfo(1) reports corrupt OGG files"

Screams RAM or cache fault to me. It's that word "occasionally" which does it. Bugs tend to be systematic. Their symptoms may be bizarre, but there's usually something consistent about them, because after all someone has specifically (albeit accidentally) programmed the computer to do exactly whatever it was that happened. Even the most subtle Heisenbug will have some sort of pattern to it.

Yoe should be especially suspicious of their "blame ext4" idea if this "corruption" is one or two corrupted bits rather than big holes in the file. Disks don't tend to lose individual bits. Disk controllers don't tend to lose individual bits. Filesystems don't tend to lose individual bits. These things all deal in blocks, when they lose something they will tend to lose really big pieces.

But dying RAM, heat-damaged CPU cache, or a serial link with too little margin of error, those lose bits. Those are the places to look when something mysteriously becomes slightly corrupted.

Low-level network protocols often lose bits. But because there are checksums in so many layers you won't usually see this in a production system even when someone has goofed (e.g. not implemented Ethernet checksums at all) because the other layers act as a safety net.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 12:44 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Bah, should specify pr1268 not Yoe. Sorry.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 15:42 UTC (Wed) by pr1268 (guest, #24648) [Link] (4 responses)

The corruption I was getting was not merely "one or two bits" but rather a hole in the OGG file big enough to cause an audible "skip" in the playback—large enough to believe it was a whole block disappearing from the filesystem. Also, the discussion of write barriers came up; I have noatime,data=ordered,barrier=1 as mount options for this filesystem in my /etc/fstab file—I'm pretty sure those are the "safe" defaults (but I could be wrong).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 17:31 UTC (Wed) by rillian (subscriber, #11344) [Link] (3 responses)

Ogg files have block-level checksums too.

That means that a few bit errors will cause the decoder to drop ~100 ms of audio at a time, and tools will report this as 'hole in data'. To see if it's disk or filesystem corruption, look for pages of zeros in a hexdump around where the glitch is.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:17 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (2 responses)

Maybe ogg's block-level checksums aren't such a good idea after all. Most likely, a few wrong bits won't affect the sound output much, and a 100ms skip sounds much worse than just playing a single wrong sample. Checksums make sense for things that must stay intact, but I don't think most multimedia benefits from this kind of robustness.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 10:07 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

A few wrong bits in a Vorbis stream seem likely to give you more than just "one wrong sample".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 18:25 UTC (Thu) by rillian (subscriber, #11344) [Link]

Indeed. A few corrupt bits in a compressed format can result in a whole block of nasty noise in the output.

The idea with the Ogg checksums was to protect the listener's ears (and possibly speakers) from corrupt output. It's also nice to have a built-in check for data corruption in your archives, which is working as designed here.

What you said is valid for video, because we're more tolerant of high frequency visual noise, and because the extra data dimensions and longer prediction intervals mean you can get more useful information from a corrupt frame than you do with audio. Making the checksum optional for the packet data is one of the things we'd do if we ever revised the Ogg format.

shotgun debugging

Posted Dec 2, 2011 22:09 UTC (Fri) by giraffedata (guest, #1954) [Link]

What you're trying to do with moving back to ext3 is what the Jargon File calls shotgun debugging: trying out some radical move in hopes that this will fix your problem.

That's not shotgun debugging (and not what the Jargon File calls it). The salient property of a shotgun isn't that it makes radical changes, but that it makes widespread changes. So you hit what you want to hit without aiming at it.

Shotgun debugging is trying lots of little things, none that you particularly believe will fix the bug.

In this case, the fallback to ext3 is fairly well targeted: the problem came contemporaneously with this one major and known change to the system, so it's not unreasonable to try undoing that change.

The other comments give good reason to believe this is not the best way forward, but it isn't because it's shotgun debugging.

There must be a term for the debugging mistake in which you give too much weight to the one recent change you know about in the area; I don't know what it is. (I've lost count of how many people accused me of breaking their Windows system because after I used it, there was a Putty icon on the desktop and something broke soon after that).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:00 UTC (Wed) by bpepple (subscriber, #50705) [Link] (17 responses)

Depending on how you encoded those ogg files, the corruption you're seeing might not be due to ext4 but to this bug(1).

1. https://bugzilla.redhat.com/show_bug.cgi?id=722667

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:32 UTC (Wed) by pr1268 (guest, #24648) [Link] (16 responses)

Thanks for the pointer, and thanks also to yoe's reply above. But, my music collection (currently over 10,000 files) has existed for almost four years, ever since I converted the entire collection from MP3 to OGG (via a homemade script which took about a week to run).¹ (I've never converted from FLAC to OGG, although I do have a couple of FLAC files.) I never noticed any corruption in the OGG files until a few months ago, shortly after I did a clean OS re-install (Slackware 13.37) on bare disks (including copying the music files)². I'm all too eager to blame the corruption on ext4 and/or LVM, since those were the only two things that changed immediately prior to the corruption, but you both bring up a good point that maybe I should dig a little deeper into finding the root cause before I jump to conclusions.

¹ I've had this collection of (legitimately acquired) songs for years prior, even having it on NTFS back in my Win2000/XP days. I abandoned Windows (including NTFS) in August 2004, and my music collection was entirely MP3 format (at 320 kbit) since I got my first 200GB hard disk. After seeing the benefits of the OGG Vorbis format, I decided to switch.

² I have four physical disks (volumes) in which I've set up PV set spanning across all disks for fast I/O performance. I'm not totally impressed at the performance—it is somewhat faster—but that's a whole other discussion.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 0:57 UTC (Wed) by yokem_55 (subscriber, #10498) [Link] (9 responses)

I would also run smartctl -l error to see if your hard drives are bugging out and maybe even run memtest86+ overnight to see if you are having memory errors. Wierd, widespread data (with metadata intact) corruption in my experience tends to be more hardware related than anything else.

ext4 experience

Posted Nov 30, 2011 2:11 UTC (Wed) by dskoll (subscriber, #1630) [Link] (5 responses)

I also had a very nasty experience with ext4. A server I built using ext4 suffered a power failure and the file system was completely toast after it powered back up. fsck threw hundreds of errors and I ended up rebuilding from scratch.

I have no idea if ext4 was the cause of the problem, but I've never seen that on an ext3 system. I am very nervous... possibly irrationally so, but I think I'll stick to ext3 for now.

ext4 experience

Posted Nov 30, 2011 4:52 UTC (Wed) by ringerc (subscriber, #3071) [Link] (3 responses)

The usual culprit in those sorts of severe corruption or loss cases is aggressive write-back caching without battery backup. Some cheap RAID controllers will let you enable write-back caching without a BBU, and some HDDs support it too.

Write-back caching on volatile storage without careful use of write barriers and forced flushes *will* cause severe data corruption if the storage is cleared due to (eg) unexpected power loss.

ext4 experience

Posted Nov 30, 2011 9:00 UTC (Wed) by Cato (guest, #7643) [Link] (2 responses)

You are right about battery backup. Every modern hard disk uses writeback caching, and some of them make it hard to ensure that the cache is flushed when the kernel wants to ensure a write barrier is implemented. The size of hard disk caches (32 MB typically) and the use of journalling filesystems (concentrating key metadata writes in journal blocks) can mean that a power loss or hard crash loses a large amount of filesystem metadata.

ext4 experience

Posted Nov 30, 2011 12:40 UTC (Wed) by dskoll (subscriber, #1630) [Link] (1 responses)

My system was using Linux Software RAID, so there wasn't a cheap RAID controller in the mix. You could be correct about the hard drives doing caching, but it seems odd that I've never seen this with ext3 but did with ext4. I am still hoping it was simply bad luck, bad timing, and writeback caching... but I'm also still pretty nervous.

ext4 experience

Posted Nov 30, 2011 12:50 UTC (Wed) by dskoll (subscriber, #1630) [Link]

Ah... reading http://serverfault.com/questions/279571/lvm-dangers-and-caveats makes me think I was a victim of LVM and no write barriers. I've followed the suggestions in that article. So maybe I'll give ext4 another try.

ext4 experience

Posted Nov 30, 2011 20:20 UTC (Wed) by walex (guest, #69836) [Link]

You have been wishing for O_PONIES!

It is a very well known issue usually involving unaware sysadms and cheating developers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 2:13 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

Quite so. I've been using ext4 atop LVM (atop raid1 md, raid5 md, and Areca hardware RAID) for many years, and have never encountered a single instance of fs corruption which fsck could not repair -- and only one severe enough to prevent mounting which was not attributable to abrupt powerdowns, and *that* was caused by a panic at the end of a suspend, and e2fsck fixed it.

I'm quite willing to believe that bad RAM and the like can cause data corruption, but even when I was running ext4 on a machine with RAM so bad that you couldn't md5sum a 10Mb file three times and get the same answer thrice, I had no serious corruption (though it is true that I didn't engage in major file writing while the RAM was that bad, and I did get the occasional instances of bitflips in the page cache, and oopses every day or so).

bitflips

Posted Nov 30, 2011 12:49 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

"occasional instances of bitflips in the page cache"

To someone who isn't looking for RAM/ cache issues as the root cause, those often look just like filesystem corruption of whatever kind. They try to open a file, get an error saying it's corrupted. Or they run a program and it mysteriously crashes.

If you _already know_ you have bad RAM, then you say "Ha, bitflip in page cache" and maybe you flush a cache and try again. But if you've already begun to harbour doubts about Seagate disks, or Dell RAID controllers, or XFS then of course that's what you will tend to blame for the problem.

bitflips

Posted Dec 1, 2011 19:23 UTC (Thu) by nix (subscriber, #2304) [Link]

This does depend on how bad the RAM was. The RAM on this machine was so bad that the fs was not the only thing misbehaving by any means.

Rare bitflips are normally going to be harmless or fixed up by e2fsck, one would hope. There may be places where a single bitflip, written back, toasts the fs, but I'd hope not. (The various fs fuzzing tools would probably have helped comb those out.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 10:19 UTC (Wed) by Trou.fr (subscriber, #26289) [Link] (5 responses)

Not related to the current discussion : I hope you are aware that transcoding your MP3 collection to Vorbis only decreased their audio quality : http://wiki.hydrogenaudio.org/index.php?title=Transcoding

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 15:35 UTC (Wed) by pr1268 (guest, #24648) [Link] (4 responses)

From that article: Mp3 to Ogg Ogg -q6 was required to achieve transparency against the (high-quality) mp3 with difficult samples.

I used -q8 (or higher) when transcoding with oggenc(1); I've done extensive testing by transcoding back-and-forth to different formats (including RIFF WAV) and have never noticed any decrease in audio quality or frequency response, even when measured with a spectrum analyzer. I do value your point, though.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 22:54 UTC (Thu) by job (guest, #670) [Link] (3 responses)

Just to clarify for everyone (who perhaps stumbles in via a web search): converting from mp3 to ogg, or indeed any time you apply lossy compression to something already lossy compressed, can only make the quality worse. The best case here is "at least not audibly worse".

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 10, 2011 1:04 UTC (Sat) by ibukanov (subscriber, #3942) [Link] (2 responses)

When one approximate another approximation it is possible the result will be closer to the original than the initial approximation. So in theory one can get better result with MP3->OGG conversion. For this reason if tests show that people cannot detect the difference with the *properly* done conversion, then I do not see how one can claim that it can only made the quality worse.

Lossy format conversion

Posted Dec 10, 2011 15:20 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

Pretty far off-topic, but: it is a rare situation indeed where the removal of information will improve the fidelity of a signal. One might not be able to hear the difference, but I have a hard time imagining how conversion between lossy formats could do anything but degrade the quality. You can't put back something that the first lossy encoding took out, but you can certainly remove parts of the signal that the first encoding preserved.

Lossy format conversion

Posted Dec 12, 2011 2:54 UTC (Mon) by jimparis (guest, #38647) [Link]

You can't replace missing information, but you could still make something that sounds better -- in a subjective sense. For example, maybe the mp3 has harsh artifacts at higher frequencies that the ogg encoder would remove.

It could apply to lossy image transformations too. Consider this sample set of images. An initial image is pixelated (lossy), and that result is then blurred (also lossy). Some might argue that the final result looks better than the intermediate one, even though all it did was throw away more information.

But I do agree that this is off-topic, and that such improvement is probably rare in practice.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:50 UTC (Wed) by ebirdie (guest, #512) [Link]

I did administer a backup server with multidisk LVM volumes and had a persistent problem in one volume with XFS. I even reported the problem to XFS development mailing list. I couldn't find no evident, where the problem came from eventually. My final conclusion was that mkfs.xfs had some bug at the time the volume was created, since newer XFS volumes never got similar problem and after couple upgrades to xfs utilities newer xfs_repair pushed the problem over the edge to non existence.

Lesson learned: it pays to keep data on smaller volumes although it is very very tempting to stuff data to ever bigger volumes and postpone the headache in splitting and managing smaller volumes.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 8:57 UTC (Wed) by Cato (guest, #7643) [Link]

It can be difficult to ensure that writes are getting to disk through caches in the hard disk and kernel (write barriers etc), particularly when using LVM. Turning off hard disk write caching may be necessary in some cases.

This may help: http://serverfault.com/questions/279571/lvm-dangers-and-c...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 21:01 UTC (Wed) by walex (guest, #69836) [Link] (58 responses)

As to corruption, you might want to read some nice papers by CERN presented at HEPiX on silent corruption. It happens everywhere, and it can have subtle effects. But as a legendary quote goes, "as far as we know we never had an undetected error" (mainframe IT manager interviewed by Datamation many years ago) is a common position. Thanks to 'ogginfo' you have discovered just how important end-to-end arguments are.

But the main issue is not that, by all accounts 'ext4' is quite reliable (when on a properly setup storage system and properly used by applications).

The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).

Other than that new "typical" systems almost only JFS and XFS make sense (and perhaps in the distant future BTRFS).

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse). JFS is still probably (by a significant margin) the best ''all-rounder'' filesystem (XFS beats it in performance only on very parallel large workloads, and it is way more complex, and JFS has two uncommon but amazingly useful special features).

Sure it was very convenient to let people (in particular Red Hat customers) upgrade in place from 'ext' to 'ext2' to 'ext3' to 'ext4' (each in-place upgrade keeping existing files unchanged and usually with terrible performance), but given that when JFS was introduced the Linux base was growing rapidly, new installations could be expected to outnumber old ones very soon, making that point largely moot.

PS: There are other little known good filesystems, like OCFS2 (which is pretty good in non-clustered mode) and NILFS2 (probably going to be very useful on SSDs), but JFS is amazingly still very good. Reiser4 was also very promising (it seems little known that the main developer of BTRFS was also the main developer of Reiser4). As a pet peeve of mine UDF could have been very promising too, as it was quite well suited to RW media like hard disks too (and the Linux implementation almost worked in RW mode on an ordinary partition), and also to SSDs.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 22:07 UTC (Wed) by yokem_55 (subscriber, #10498) [Link]

I agree that both jfs and disk based RW udf are way underrated. I use jfs on our laptop as it supposedly tends to have less cpu usage, and thus is better for reducing power usage. UDF, if properly supported by the kernel, would make a fantastic fs for accessing data in dual boot situations as Windows has pretty good support and it doesn't have the limitations of vfat nor does it require a nasty, awful, performance sucking hack like ntfs-3g.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Nov 30, 2011 23:12 UTC (Wed) by Lennie (subscriber, #49641) [Link]

The main developer of ext234fs is currently a Google employee.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 0:53 UTC (Thu) by SLi (subscriber, #53131) [Link] (37 responses)

I very much disagree about JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

The only filesystem, years back, that could have said to outperform ext4 on most counts was ReiserFS 4. Unfortunately on each of the three times I stress tested it I hit different bugs that caused data loss.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:03 UTC (Thu) by dlang (guest, #313) [Link]

for a lot of people, ext4 is a pretty new filesystem, just now getting to the point where it has enough of a track record to trust data to.

I haven't benchmarked against ext4, but I have done benchmarks with the filesystems prior to it, and I've run into many cases where JFS and XFS are clear winners.

even against ext4, if you have a fileserver situation where you have lots of drives involved, XFS is still likely to be a win, ext4 just doesn't have enough developers/testers with large numbers of disks to work with (this isn't my opinion, it's a statement from Ted Tso in response to someone pointing out where EXT4 doesn't do as well as XFS with a high performance disk array)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:52 UTC (Fri) by walex (guest, #69836) [Link] (35 responses)

JFS or XFS being the preferable filesystem on normal Linux use. Believe me, I've tried them both, benchmarked them both, and on almost all counts ext4 outperforms the two by a really wide margin (note that strictly speaking I'm not comparing the filesystems but their Linux implementations). In addition any failures have tended to be much worse on JFS and XFS than on ext4.

Most well done benchmarks I have seen show them mostly equivalent performance, with XFS leading the group in scalability, JFS pretty good across the field, and 'ext4' just like the previous 'ext's being good only on totally freshly loaded filesystems as it packs newly created files pretty densely, and when there is ample caching (no use of 'O_DIRECT'), and both fresh loading and caching mask its fundamental, BSD FFS derived, downsides. It is very very easy to do meaningless filesystem benchmarks (the vast majority that I see on LWN and most others are worthless).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:15 UTC (Fri) by tytso (subscriber, #9993) [Link] (34 responses)

One caution about JFS. JFS does not issue cache flush (i.e., barrier) requests, which (a) gives it a speed advantage of file systems that do issue cache flush commands as necessary, and (b) it makes JFS unsafe against power failures. Which is most of the point of having a journal...

So benchmarking JFS against file systems that are engineered to be safe against power failures, such as ext4 and XFS, isn't particularly fair. You can disable cache flushes for both ext4 and XFS, but would you really want to run in an unsafe configuration for production servers? And JFS doesn't even have an option for enabling barrier support, so you can't make it run safely without fixing the file system code.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:56 UTC (Sat) by walex (guest, #69836) [Link] (31 responses)

As to JFS and performance and barriers with XFS and ext4:

I mentioned JFS as a "general purpose" filesystem, for example desktops, and random servers, in that it should have been the default instead of ext3 (which acquired barriers a bit late).
Anyhow on production servers I personally regard battery backup as essential, as barriers and/or disabling write caching both can have a huge impact, depending on workload.
The speed tests I have done and seen and that I trust are with barriers disabled and either batteries or write caching off, and with O_DIRECT (it is very difficult for me to like any file system test without O_DIRECT). I think these are fair conditions.
Part of the reason why barriers were added to ext3 (and at least initially they had horrible performance) and not JFS is that ext3 was chosen as the default filesystem and thus became community supported and JFS did not.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 1:56 UTC (Sat) by dlang (guest, #313) [Link] (27 responses)

battery backup does not make disabling barriers safe. without barriers, stuff leaves RAM to be sent to the disk at unpredictable times, and so if you loose the contents of RAM (power off, reboot, hang, etc) you can end up with garbage on your disk as a result, even if you have a battery-backed disk controller.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:06 UTC (Sat) by raven667 (subscriber, #5198) [Link] (26 responses)

I'm pretty sure, in this context, the OP was talking about battery backed write cache ram on the disk controller

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 6:29 UTC (Sat) by dlang (guest, #313) [Link] (25 responses)

that's what I think as well, and my point is that having battery backed ram on the controller does not make it safe to disable barriers.

it should make barriers very fast so there isn't a big performance hit from leaving them on, but if you disable barriers and think the battery will save you, you are sadly mistaken

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:05 UTC (Sat) by nix (subscriber, #2304) [Link] (24 responses)

Really? In that case there's an awful lot of documentation out there that needs rewriting. I was fairly sure that the raison d'etre of battery backup was 1) to make RAID-[56] work in the presence of power failures without data loss, and 2) to eliminate the need to force-flush to disk to ensure data integrity, ever, except if you think your power will be off for so very long that the battery will run out.

If the power is out for months, civilization has probably fallen, and I'll have bigger things to care about than a bit of data loss. Similarly I don't care that battery backup doesn't defend me against people disconnecting the controller or pulling the battery while data is in transit. What other situation does battery backup not defend you against?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 15:39 UTC (Sat) by dlang (guest, #313) [Link] (15 responses)

there are two stages to writing things to a raid array

1. writing from the OS to the raid card

2. writing from the raid card to the drives

battery backup on the raid card makes step 2 reliable. this means that if the data is written to the raid card it should be considered as safe as if it was on the actual drives (it's not quite that safe, but close enough)

However, without barriers, the data isn't sent from the OS to the raid card in any predictable pattern. It's sent at the whim of the OS cache flusing algorithm. This can result in some data making it to the raid controller and other data not making it to raid controller if you have an unclean shutdown. If the data is never sent to the raid controller, then the battery there can't do you any good.

With Barriers, the system can enforce that data gets to raid controller in a particular order, and so the only data that would be lost is the data since the last barrier operation was completed.

note that if you are using software raid, things are much uglier as the OS may have written the stripe to one drive and not to another (barriers only work on a single drive, not across drives). this is one of the places where hardware raid is significantly more robust than software raid.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:04 UTC (Sat) by raven667 (subscriber, #5198) [Link] (14 responses)

Maybe I'm wrong but I dont think it works that way. Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed. So I think it already works the way you want, filesystems already manage their writes and caching afaik

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:31 UTC (Sat) by dlang (guest, #313) [Link] (11 responses)

I'm not quite sure which part of my statement you are disagreeing with

barriers preserve the ordering of writes throughout the entire disk subsystem, so once the filesystem decides that a barrier needs to be at a particular place, going through a layer of LVM (before it supported barriers) would run the risk of the writes getting out of order

with barriers on software raid, the raid layer won't let the writes on a particular disk get out of order, but it doesn't enforce that all writes before the barrier on disk 1 get written before the writes after the barrier on disk 2

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:17 UTC (Sun) by raven667 (subscriber, #5198) [Link] (10 responses)

I guess I was under the impression, incorrect it may be, that the concepts of write barriers were already baked into most responsible filesystems but that the support for working through LVM was recent (in the last 5yrs) and the support for actually issuing the right commands to the storage and having the storage respect them was also more recent. Maybe I'm wrong and barriers as a concept are newer.

In any event there is a bright line between how the kernel handles internal data structures and what the hardware does and for storage with battery backed write cache once an IO is posted to the storage it is as good as done so there is no need to ask the storage to commit its blocks in any particular fashion. The only issue is that the kernel issue the IO requests in a responsible manner.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 6:41 UTC (Sun) by dlang (guest, #313) [Link] (8 responses)

barriers as a concept are not new, but your assumption that filesystems support them is the issue.

per the messages earlier in this thread, JFS does not, for a long time (even after it was the default in Fedora), LVM did not.

so barriers actually working correctly is relatively new (and very recently they have found more efficient ways to enforce ordering than the older version of barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 11:24 UTC (Sun) by tytso (subscriber, #9993) [Link]

JFS still to this day does not issues barriers / cache flushes.

It shouldn't be that hard to add support, but no one is doing any development work on it.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:26 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (6 responses)

JFS has never been default in Fedora.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 16:50 UTC (Sun) by dlang (guest, #313) [Link] (5 responses)

I didn't think that I ever implied that it was.

Fedora has actually been rather limited in it's support of various filesystems. The kernel supports the different filesystems, but the installer hasn't given you the option of using XFS and JFS for your main filsystem for example.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 17:41 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (4 responses)

It appears you did

"JFS does not, for a long time (even after it was the default in Fedora)"

You are inaccurate about your claim on the installer as well. XFS is a standard option in Fedora for several releases ever since Red Hat hired Eric Sandeen from SGI to maintain it (and help develop Ext4). JFS is a non-standard option.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:22 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

re: JFS, oops, I don't know what I was thinking when I typed that.

re: XFS, I've been using linux since '94, so XFS support in the installer is very recent :-)

I haven't been using Fedora for quite a while, my experiance to RedHat distros is mostly RHEL (and CentOS), which lag behind. I believe that RHEL5 still didn't support XFS in the installer

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 19:53 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

"Very recent" is relative and not quite so accurate either. All versions of Fedora installer have supported XFS. You just had to pass "xfs" as a installer option. Same with jfs or reiserfs. Atleast Fedora 10 beta onwards supports XFS as a standard option without having to do anything

http://fedoraproject.org/wiki/Releases/10/Beta/ReleaseNot...

That is early 2008. RHEL 6 has xfs support as a add-on subscription and is supported within the installer as well IIRC.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:15 UTC (Mon) by wookey (guest, #5501) [Link] (1 responses)

I think dlang meant this:
"..., for a long time (even after it was the default in Fedora), LVM did not"

(I parsed it the way rahulsundaram did too - it's not clear).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:59 UTC (Mon) by dlang (guest, #313) [Link]

yes, now that you say that ir reminds me that I was meaning that for a long time after LVM was the default on Fedora, it didn't support barriers.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 30, 2012 8:50 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Old thread, I know. But why people are still talking about barriers I'm not sure. Abandoning the use of barriers was agreed upon at the 2010 Linux Filesystem Summit. And they completed their departure in 2.6.37, IIRC. Barriers are no more. They don't matter. They've been replaced by FUA, etc.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:54 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

>Barriers are there to control the write cache after data has been posted to the storage device, to ensure that the device doesn't report completion until the data is actually perminanely committed

Surely what you're describing is a cache flush, not a barrier?

A barrier is intended to control the *order* in which two pieces of data are written, not when or even *if* they're written. A barrier *could* be implemented by issuing a cache flush in between writes (maybe this is what's commonly done in practice?) but in that case you're getting slightly more than you asked for (ie. you're getting durability of the first write), with a corresponding performance impact.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 23:24 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think you are right, I may have misspoke.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 12:01 UTC (Mon) by jlokier (guest, #52227) [Link] (7 responses)

I believe dlang is right. You need to enable barriers even with battery-backed disk write cache. If the storage device has a good implementation, the cache flush requests (used to implement barriers) will be low overhead.

Some battery-backed disk write caches can commit the RAM to flash storage or something else, on battery power, in the event that the power supply is removed for a long time. These systems don't need a large battery and provide stronger long-term guarantees.

Even ignoring ext3's no barrier default, and LVM missing them for ages, there is the kernel I/O queue (elevator) which can reorder requests. If the filesystem issues barrier requests, the elevator will send writes to the storage device in the correct order. If you turn off barriers in the filesystem when mounting, the kernel elevator is free to send writes out of order; then after a system crash, the system recovery will find inconsistent data from the storage unit. This can happen even after a normal crash such as a kernel panic or hard-reboot, no power loss required.

Whether that can happen when you tell the filesystem not to bother with barriers depends on the filesystem's implementation. To be honest, I don't know how ext3/4, xfs, btrfs etc. behave in that case. I always use barriers :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 15:40 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (6 responses)

I think these days any sensible fs actually waits for the writes to reach storage independent of barrier usage. The only different with barriers on/off is whether a FUA/barrier/whatever is sent to the device to force the device to write out the data.
I am rather sure at least ext4 and xfs do it that way.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:14 UTC (Mon) by dlang (guest, #313) [Link] (5 responses)

no, jlokier is right, barriers are still needed to enforce ordering

there is no modern filesystem that waits for the data to be written before proceeding. Every single filesystem out there will allow it's writes to be cached and actually written out later (in some cases, this can be _much_ later)

when the OS finally gets around to writing the data out, it has no idea what the application (or filesystem) cares about, unless there are barriers issued to tell the OS that 'these writes must happen before these other writes'

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:15 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (4 responses)

The do wait for journaled data uppon journal commit. Which is the place where barriers are issued anyway.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:39 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

issueing barriers is _how_ the filesystem 'waits'

it actually doesn't stop processing requests and wait for the confirmation from the disk, it issues a barrier to tell the rest of the storage stack not to reorder around that point and goes on to process the next requrest and get it in flight.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 12, 2011 18:53 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (2 responses)

Err. Read the code. xfs uses io completion callbacks and only relies on the contents of the journal after the completion returned. (xlog_sync()->xlog_bdstrat()->xfs_buf_iorequest()->_xfs_buf_ioend()).
jbd does something similar but I don't want to look it up unless youre really interested.

It worked a littlebit more like you describe before 2.6.37 but back then it waited if barriers were disabled.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:35 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Well, this is clear as mud :) guess I'd better do some code reading and figure out wtf the properties of the system actually are...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 13, 2011 13:38 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

If you want I can give you the approx calltrace for jbd2 as well, I know it took me some time when I looked it up...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:00 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

You got that backwards. Filesystems do not become community-supported because they are chosen as a default (though if they are common, community members *are* more likely to show an interest in them). It is more that they are very unlikely ever to be chosen as a default by anyone except their originator unless they are already community-supported.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 18:06 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Reiserfs3 being an example of that, being widely shipped but unsupported and unsupportable by the community leading to more stringent support guidelines for future code acceptance

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:33 UTC (Sat) by tytso (subscriber, #9993) [Link]

Again, you have this backwards. Ext3 was chosen in part because it was a community-supported file system. From the very beginning, ext2 and ext3 had support from a broad set of developers, at a large number of ***different*** companies. Of the original three major developers of ext2/ext3 (Remy Card, Stephen Tweedie, and myself), only Stephen worked at Red Hat. Remy was a professor at a university in France, and I was working at MIT as the technical lead for Kerberos. And there were many other people submitting contributions to ext3 and choosing to use ext3 in embedded products (including Andrew Morton, when he worked at Digeo between 2001 and 2003).

ext3 was first supported by RHEL as of RHEL 2 which was released May 2003 --- and as you can see from the dates above, we had developers working at a wide range of companies, thus making it a communuty-supported distribution, long before Red Hat supported ext3 in their RHEL product. In contrast, most of the reiserfs developers worked at Namesys (with a one or two exceptions, most notably Chris Mason when he was at SuSE), and most of the XFS developers worked at SGI.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 16:29 UTC (Mon) by wookey (guest, #5501) [Link] (1 responses)

I'm very surprised by the assertion that XFS is intended to be safe against power failures, as it directly condtradicts my experience. I found it to be a nice filesystem with some cool features (live resizing was really cool back circa 2005/6), but I also found (twice, on different machines) that it was emphatically not designed for systems without UPS. In both caces a power failure caused significant filesystem corruption (those machines had lvm as well as XFS).

When I managed to repair them I found that many files had big blocks of zeros in them - essentially anything that was in the journal and had not been written. Up to that point I had naively thought that the point of the journal was to keep actual data, not just filesystem metadata. Files that have been 'repaired' by being silently filled with big chunks of zeros did not impress me.

So I now believe that XFS is/was good, but only on properly UPSed servers. Am I wrong about that?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 5, 2011 17:03 UTC (Mon) by dlang (guest, #313) [Link]

for a very long time, LVS did not support barriers, which means that _any_ filesystem running on top of LVM could not be safe.

XFS caches more stuff than ext does, so a crash looses more stuff.

so XFS or ext* with barriers disabled is not good to use, For a long time, running these things on top of LVM had the side effect of disabling barriers, it's only recently that LVM gained the ability to support them

JFS is not good to use (as it doesn't have barriers at all)

note that when XFS is designed to be safe, that doesn't mean that it won't loose data, just that the metadata will not be corrupt.

the only way to not loose data in a crash/power failure is to do no buffering at all, and that will absolutely kill your performance (and we are talking hundreds of times slower, not just a few percentage points)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 2:58 UTC (Thu) by tytso (subscriber, #9993) [Link] (2 responses)

The main reason JFS wasn't accepted in the community was because all of the developers worked at IBM. Very few people in the other distributions understood it, which meant that there weren't people who could support at the distro's. One of the things that I've always been very happy about is the fact that developers for ext2/3/4 come from many, many different companies.

JFS was a very good file system, and at the time when it was released, it certainly was better than ext3. But there's a lot more to having a successful open source project beyond having the best technology. The fact that ext2 was well understood, and had a mature set of file system utilities, including tools like "debugfs", are one of the things that do make a huge difference towards people accepting the technology.

At this point, though, ext4 has a number of features which JFS lacks, including delayed allocation, fallocate, punch, and TRIM/discard support. These are all features which I'm sure JFS would have developed if it still had a development community, but when IBM decided to defund the project, there were few or no developers who were not IBM'ers, and so the project stalled out.

---

People who upgrade in place from ext3 to ext4 will see roughly half the performance increase compared to doing a backup, reformat to ext4, and restore operation. But they *do* see a performance increase if they do an upgrade-in-place operation. In fact, even if they don't upgrade the file system image, and use ext4 to mount an ext2 file system image, they will see some performance improvement. So this gives them flexibility, which from a system administrator's point of view, is very, very important!

---

Finally, I find it interesting that you consider OCFS2 "pretty good" in non-clustered mode. OCFS2 is a fork of the ext3 code base[1] (it even uses fs/jbd and now fs/jbd2) with support added for clustered operation, and with support for extents (which ext4 has as well, of course). It doesn't have delayed allocation. But ext4 will be better than ocfs2 in non-clustered mode, simply because it's been optimized for it. The fact that you seem to think OCFS2 to be "pretty good", while you don't seem to think much about ext4 makes me wondered if you have some pretty strong biases against the ext[234] file system family.

[1] Ocfs2progs is also a fork of e2fsprogs. Which they did with my blessing, BTW. I'm glad to see that the code that has come out of the ext[234] project have been useful in so many places. Heck, parts of the e2fsprogs (the UUID library, which I relicensed to BSD for Apple's benefit) can be found in Mac OS X! :-)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 20:25 UTC (Thu) by sniper (guest, #13219) [Link] (1 responses)

Small correction.

ocfs2 is not a fork of ext3 and neither is ocfs2-tools a fork of e2fsprogs. But both have benefited a _lot_ from ext3. In some instances, we copied code (non-indexed dir layout). In some instances, we used a different approach because of collective experience (indexed dir). grep ext3 fs/ocfs2/* for more.

The toolset has a lot more similarities to e2fsprogs. It was modeled after it because it is well designed and to also allow admins to quickly learn it. The tools even use the same parameter names where possible. grep -r e2fsprogs * for more.

BTW, ocfs2 has had bigalloc (aka clusters) since day 1, inline-data since 2.6.24 and metadata checksums since 2.6.29. Yes, it does not have delayed allocations.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:30 UTC (Fri) by fragmede (guest, #50925) [Link]

OCFS2 does have snapshots though, which is why I use it. :)

LVM snapshots are a joke if you have *lots* of snapshots, though I haven't looked at btrfs snapshots since it became production ready.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:22 UTC (Thu) by tytso (subscriber, #9993) [Link]

One other thought. At least at the beginning ext4's raison d'etre (its reason for being) was as a stopgap file system until btrfs could be ready. We started with ext3 code which was proven, solid code, and support for delayed allocation, multiblock allocation, and extents had also been in use for quite a while in Clustrefs's Lustre product. So that code wasn't exactly new, either. What I did was integrate Cluterfs's contributions, and then worked on stablizing them so that we would have something that was better than ext3 ready in the short term.

At the time when I started working on ext4, XFS developers were all mostly still working for SGI, so there was a similar problem with the distributions not having anyone who could support or debugfs XFS problems. This has changed more recently, as more and more XFS developers have left (volunteraliy or involuntarily) SGI and joined companies such as Red Hat. XFS has also improved its small file performance, which was something it didn't do particularly well simply because SGI didn't optimize for that; its sweet spot was and still is really large files on huge RAID arrays.

One of the reasons why I felt it was necessary to work on ext4 was that everyone I talked to who had created a file system before in the industry, whether it was GPFS (IBM's cluster file system), or Digital Unix's advfs, or Sun's ZFS, gave estimates of somewhere between 50 to 200 person years worth of effort before the file system was "ready". Even if we assume that open source development practices would make development go twice as fast, and if we ignore the high end of the range because cluster file systems are hard, I was skeptical it would get done in two years (which was the original estimate) given the number of developers it was likely to attract. Given that btrfs started at the beginning of 2007, and here we are almost at 2012, I'd say my fears were justified.

At this point, I'm actually finding that ext4 has found a second life as a server file system in large cloud data centers. It turns out that if you don't need the fancy-shamcy features that Copy-on-Write file systems give you, they aren't free. In particular, ZFS has truly a prodigious appetite for memory, and one of the things about cloud servers is that in order for them to make economic sense, you try to pack as many jobs or VM's on them, so they are constantly under memory pressure. We've done some further optimizations so that ext4 performs much better when under memory pressure, and I suspect at this point that in a cloud setting, using a CoW file system may simply not make sense.

Once btrfs is ready for some serious benchmarking, it would be interesting to benchmark it under serious memory pressure, and see how well it performs. Previous CoW file systems, such as BSD's lfs two decades ago, and ZFS more recently, have needed a lot of memory to cache metadata blocks, and it will be interesting to see if btrfs has similar issues.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:36 UTC (Thu) by nix (subscriber, #2304) [Link] (13 responses)

I shouldn't respond to this troll-bait, but nonetheless...

The big problem with 'ext4' is that its only reason to be is to allow Red Hat customers to upgrade in place existing systems, and what Red Hat wants, Red Hat gets (also because they usually pay for that and the community is very grateful).

Interesting. tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle.

I also see that I was making some sort of horrible mistake by installing ext4 on all my newer systems, but you never make clear what that mistake might have been.

In particular JFS should have been the "default" Linux filesystem instead of ext[23] for a long time. Not making JFS the default was probably the single worst strategic decision for Linux (but it can be argued that letting GKH near the kernel was even worse).

Ah, yeah. Because stable kernels, USB support, mentoring newbies, the driver core, -staging... all these things were bad.

I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 19:41 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> I've been wracking my brains and I can't think of one thing Greg has done that has come to public knowledge and could be considered bad. So this looks like groundless personal animosity to me.
Also, uhm. Didn't he work for Suse?

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 11:35 UTC (Fri) by alankila (guest, #47141) [Link] (5 responses)

I dimly recall that the animosity originated from the work with udev, and the removal of devfs. Since I personally don't care one bit about this issue, I have hard time now reconstructing the relevant arguments, but my guess is that some people really hate the idea that a system needs more than just kernel to be useful.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 18:40 UTC (Fri) by nix (subscriber, #2304) [Link]

udev is prone to creating frothing-at-the-mouth even in otherwise reasonable people, due to the udev authors' patent lack of concern for backward compatibility. Twice now they've broken existing systems without so much as a by-your-leave: firstly with the massive migration of all system-provided state out of /etc/udev.d/rules into /lib/udev/rules: what, you customized them? sucks to be you, now you have to customize them before *building* udev, and more recently with the abrupt movement of /sbin/udevd into /lib/udev without even leaving behind a symlink! Oh, you were starting that at bootup and relying on it to be there? Sorry, we just broke your bootup, your own fault for not reading the release notes! Hope you don't need to downgrade!

(Yes, I read the release notes, so didn't fall into these traps, but FFS, at least the latter problem was trivial to work around -- one line in the makefile to drop a symlink in /sbin -- and they just didn't bother.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 2, 2011 23:40 UTC (Fri) by walex (guest, #69836) [Link] (3 responses)

As to udev some people dislike smarmy shysters who replace well designed working subsystems seemingly for the sole reason of making a political landgrab, because the replacement has both more kernel complexity and more userland complexity and less stability.

The key features of devfs were that it would populate automatically /dev from the kernel with basic device files (major, minor) and then use a very simple userland daemon to add extra aliases as required.

It turns out that after several attempts to get it to work udev adds to /sys from inside the kernel exactly the same information, so there has been no migration of functionality from kernel to userspace:

$ ls -ld /dev/tty9
crw--w---- 1 root tty 4, 9 2011-11-28 14:03 /dev/tty9
$ cat /sys/class/tty/tty9/dev
4:9

And the userland part is also far more complex and unstable than devfsd ever was (for example devfs did not require cold start).

And udev is just the most shining example of a series of similar poor decisions (which however seem to have been improving a bit with time).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 3:16 UTC (Sat) by raven667 (subscriber, #5198) [Link] (1 responses)

I'm not sure that is an accurate portrayal of what happened, on this planet at least. My recollection from the time is that there were fundamental technical problems with the devfs implementation which is why it was redone into udev. I think those problems were some inherent race conditions on device add/removal, plus concerns about how much policy about /dev file names, permissions, etc was hard coded into the kernel and unmodifyable by an end user or sysadmin. That is just my recollection.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 11:07 UTC (Sat) by nix (subscriber, #2304) [Link]

The latter is doubly ironic now that udev forbids you from changing the names given to devices by the kernel. (You can introduce new names, but you can't change the kernel's anymore.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 4:04 UTC (Sat) by alankila (guest, #47141) [Link]

To your specific example: obviously the kernel is going to have some kind of (generated) name for a device, and to know the major/minor number pair which is the very thing that faciliates the communication between userspace and kernel... But udev is still controlling things like permissions and aliases for those devices where necessary.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:12 UTC (Sat) by walex (guest, #69836) [Link] (5 responses)

«tytso wasn't working for RH when ext4 started up, and still isn't working for them now. So their influence must be more subtle. »

Quite irrelevant: a lot of file system were somebody's hobby file systems, but they did not achieve prominence and instant integration into mainline even if rather alpha, and RedHat did not spend enormous amounts of resources quality assuring them to make them production ready either, and quality assurance is a pretty vital detail for file systems, as the Namesys people discovered.

Pointing to tytso is just misleading. Also because ext4 really was seeded by Lustre people before tytso became active on it in his role as ext3 curator (and in 2005, which is 5 years later than when JFS became available).

Similarly for BTRFS, it has been initiated by Oracle (who have an ext3 installed base), but its main appeal is still as the next inplace upgrade on the Red Hat installed base (thus the interest in trialing it in Fedora, where EL candidate stuff is mass tested), even if for once it is not just an extension of the ext line but has some interesting new angles.

But considering ext4 on its own is a partial view; one must consider the pre-existing JFS and XFS stability and robustness and performance, and from a technical point of view ext4 is not that interesting (euphemism) and its sole appeal is inplace upgrades, and the widest installed based for that is RedHat, and to a large extent that could have been said of ext3 too.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 0:52 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

So you're blaming the Lustre people now? You do realise Lustre is not owned by Red Hat, and never was?

And if you're claiming that btrfs is effectively RH-controlled merely because RH customers will benefit, then *everything* that happens to Linux must by your bizarre definition be RH-controlled. That's a hell of a conspiracy: so vague that the coconspirators don't even realise they're conspiring!

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Apr 13, 2012 19:34 UTC (Fri) by fragmede (guest, #50925) [Link]

I though *Oracle* was a/the big contributor to btrfs...

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 19:45 UTC (Sat) by tytso (subscriber, #9993) [Link] (2 responses)

Sure, and I've always been careful to give the Lustre folk credit for the work that they did between 2003 and 2006 extending ext3 to add support for delayed allocation (which JFS didn't have), multi-block allocation (which JFS didn't have) and extents (OK, JFS had extents).

But you can't have it both ways. If that code had been in use by paying Lustre companies, then it's hardly alpha code, wouldn't you agree?

And why did the Lustre developers at Clustrefs chose ext3? Because the engineers they hired knew ext3, since it was a community-supported distribution, whereas JFS was controlled by a core team that was all IBM'ers, and hardly anyone outside of IBM was available who knew JFS really well.

But as others have already pointed out, there was no grand conspiracy to pick ext2/3/4 over its competition. It won partially due to its installed base, and partially because of the availability of developers who understood it (and books written about it, etc., etc., etc.) The way you've been writing you seem to think there was some secret cabal (at Red Hat?) that made these decisions, and there was a "mistake" because they didn't chose your favorite file systems.

The reality is that file systems all have trade-offs, and what's good for some people are not so great for others. Take a look at some of the benchmarks at btrfs.boxacle.net; they're a bit old, but they are well done, and they show that across many different workloads at that time (2-3 years ago) there was no one single file system that was the best across all of the different workloads. So anyone who only uses a single workload, or a single hardware configuration, and tries to use that to prove that their favorite file system is the "best" is trying to sell you something, or who is a slashdot kiddie who has a fan-favorite file system. The reality is a lot more complicated than that, and it's not just about performance. (Truth be told, for many/most uses cases, the file system is not the bottleneck.) Issues like availability of engineers to support the file system in a commercial product, the maturity of the userspace support tools, ease of maintainability, etc. are at least as important if not more so.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 3, 2011 20:43 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

at the time ext3 became the standard, JFS and XFS had little support (single vendor) and were both 'glued on' to linux with heavy compatibility layers.

Add to this the fact that you did not need to reformat your system to use ext3 when upgrading, and the fact that ext3 became the standard (taking over from ext2, which was the prior standard) is a no-brainer, and no conspiracy.

In those days XFS would outperform ext3, but only in benchmarks on massive disk arrays (which were even more out of people's price ranges at that point then they are today)

XFS was scalable to high-end systems, but it's low-end performance was mediocre

looking at things nowdays, XFS has had a lot of continuous improvement and integration, both improving it's high-end performance and reliability, and improving it's low-end performance without loosing it's scalability. There are also more people, working for more companies supporting it, making it far less of a risk today, with far more in the way of upsides.

JFS has received very little attention after the initial code dump from IBM, and there is now nobody actively maintaining/improving it, so it really isn't a good choice going forward.

reiserfs had some interesting features and performance, but it suffered from some seriously questionably benchmarking (the one that turned me off to it entirely was a spectacular benchmarking test that reiserfs completed in 20 seconds that took several minutes on ext*, but then we discovered that reiserfs defaulted to a 30 second delay before writing everything to disk, so the entire benchmark was complete before any day started getting written to disk, after that I didn't trust anything that they claimed), and a few major problems (the fsck scrambling is a huge one). It was then abandoned by the developer in favor of the future reiserfs4, with improvements that were submitted being rejected as they were going to be part of the new, incompatible filesystem.

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running, but it has now been out for several years, with developers who are responsive to issues, are a diverse set (no vendor lock-in or dependencies) and are willing to say where the filesystem is not the best choice.

btrfs is still under development (the fact that they don't yet have a fsck tool is telling), is making claims that seem too good to be true, and have already run into several cases where they have pathalogical behavior and have had to modify things significantly. I wouldn't trust it for anything other than non-critical personal use for another several years.

as a result, I am currently using XFS for the most part, but once I get a chance to do another round of testing, ext4 will probably join it. I have a number of systems that have significant numbers of disks, so XFS will probably remain in use.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 4, 2011 1:12 UTC (Sun) by nix (subscriber, #2304) [Link]

ext4 is in large part a new filesystem who's name just happens to be similar to what people are running

ext4 is ext3 with a bunch of new extensions (some incompatible): indeed, initially the ext4 work was going to be done to ext3, until Linus asked for it to be done in a newly-named clone of the code instead. It says a lot for the ext2 code and disk formats that they've been evolvable to this degree.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 1, 2011 3:46 UTC (Thu) by eli (guest, #11265) [Link] (1 responses)

You mention that you have backups. Have you done a bit-wise comparison of the corrupted files versus the backups to see how much was actually corrupted and if there might be some pattern to the corruption in the file? Given the points made about bit-flipping vs blocks dropped (or replaced by something else), a "cmp -l" might help you find the root cause.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 6, 2011 1:06 UTC (Tue) by pr1268 (guest, #24648) [Link]

Thanks for the suggestion; I'll give it a try sometime if/when I find a corrupt OGG file.

Just to bring some closure to this discussion, I wish to make a few points:

In 13 years of using Linux, I've not once ever experienced data corruption on any filesystem running in Linux (prior to the random OGG errors I encountered a few months ago on ext4 and LVM). This includes ext2, ext3, Reiser(3)fs, and Linux's fat and vfat implementations. (Hardware failures are not included, but I've been fortunate not to have much experience with those.)
My existing hard drives are error-free (according to smartctl -l as suggested above). I have two IDE disks and two SATA disks, all Seagate brand (not that this matters, but I do admire their 5-year warranty).
My OGG files are all on their own filesystem/partition (ext4), for which only root has write privileges, and thus my non-privileged account can't accidentally be writing to this filesystem (or so I'd hope).
I've since created an oggcheck.sh script which scans the whole filesystem containing the OGG files and reports any errors found to a log.
Running this script lately (within past 1 month or so) has yielded no errors, so I'm wondering if (hoping that) this is some weird anomaly which has passed.
Any errors experienced on the OGG filesystem would imply the possibility of similar errors on my / filesystem (which includes /home), but I haven't noticed any such errors as of yet (and I hope that none exist!).
I'm all too eager to blame this on ext4 or LVM (or the combination of both), but I'm equally eager to blame this on some strange operator error. Trust me, I'm good at creating these!

Many thanks to everyone's discussion above; I always learn a lot from the comments here on LWN.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 15:34 UTC (Thu) by lopgok (guest, #43164) [Link] (8 responses)

You should generate a checksum for each file in your filesystem.
I wrote a trivial python script to generate a checksum file for each directory's files. If you run it, and it finds a checksum file, it checks that the files in the directory match the checksum file, and if they don't it reports that.

I wrote it when I had a serverworks chipset on my motherboard that corrupted IDE hard drives when DMA was enabled. However, the utility lets me know there is no bit rot in my files.

It can be found at http://jdeifik.com/ , look for 'md5sum a directory tree'. It is GPL3 code. It works independently from the files being checksummed and independently of the file system. I have found flaky disks that passed every other test with this utility.

The other thing that can corrupt files is memory errors. Many new computers do not support ECC memory. If you care about data integrity, you should use ECC memory. Intel has this feature for their server chips (xeons) and AMD has this feature for all ofgf their processors (though not all motherboard makers support it).
It is very cheap insurance.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 16:24 UTC (Thu) by nix (subscriber, #2304) [Link] (7 responses)

It is very cheap insurance.

Look at the price differential between the motherboards and CPUs that support ECCRAM and those that do not. Now add in the extra cost of the RAM.

ECCRAM is worthwhile, but it is not at all cheap once you factor all that in.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 17:47 UTC (Thu) by tytso (subscriber, #9993) [Link] (6 responses)

Whether or not it is cheap or not depends on how much you value your data.

It's like people who balk at spending an extra $200 to mirror their data, or to provide a hot spare for their RAID array. How much would you be willing to spend to get back your data after you discover it's been vaporized? What kind of chances are you willing to take against that eventuality happen?

It will vary depending on each person, but traditional people are terrible and figuring out cost/benefit tradeoffs.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 8, 2011 19:10 UTC (Thu) by nix (subscriber, #2304) [Link] (5 responses)

Yep. That's why I said it was worthwhile. But 'very cheap'? Not unless 'cheap' means 'costs much more money than other alternatives'. Yes, it has benefits, but immediate financial return is not one of them.

(Also, last time I tried you couldn't buy a desktop with ECCRAM for love nor money. Servers, sure, but not desktops. So of course all my work stays on the server with battery-backed hardware RAID and ECCRAM, and I just have to hope the desktop doesn't corrupt it in transit.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 0:57 UTC (Fri) by tytso (subscriber, #9993) [Link] (2 responses)

What I have under my desk at work (and I'm quite happy with it) is the Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory. It's not a mini-ATX desktop, but it's definitely not a server, either.

I really like how quickly I can build kernels on this machine. :-)

I'll grant it's not "cheap" in absolute terms, but I've always believed that skimping on a craftsman's tools is false economy.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 7:41 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

> Dell T3500 Precision Workstation, which supports up to 24GB of ECC or non-ECC memory.

I have the same machine. Oddly enough, it only supports 12GB of non-ECC memory, at least according to Dell's manual. How does that happen?

(Also, Intel's processor datasheet claims that several hundred gigabytes of either ECC or non-ECC memory should be supported using the integrated memory controller. I wonder why Dell's system supports less.)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 12:40 UTC (Fri) by nix (subscriber, #2304) [Link]

Oh, agreed. I've seen multiple rounds of friends deciding to save money on a cheap PC, trying to do real work on it, and finding the result a crashy erratic data-corrupting horror that is almost impossible to debug unless you have a second identical machine to swap parts out of... and losing years of working time to these unreliable nightmares. I pay a bit more (well, OK, quite a lot more) and those problems simply don't happen. I don't think this is ECCRAM, though: I think it's simply a matter of tested components with a decent safety margin rather than bargain-basement junk.

EDAC support for my Nehalem systems landed in mainline a couple of years ago but I'll admit to never having looked into how to get it to tell me what errors may have been corrected, so I have no idea how frequent they might be.

(And if it didn't mean dealing with Dell I might consider one of those machines myself...)

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 13:53 UTC (Fri) by james (subscriber, #1325) [Link] (1 responses)

AMD processors since the Athlon 64 all support ECC, and most Asus AMD boards (even cheap ones) wire the lines up.

Even ECC memory isn't that much more expensive: Crucial do a 2x2GB ECC kit for £27 + VAT ($42 in the US) against £19 ($30).

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Dec 9, 2011 15:19 UTC (Fri) by lopgok (guest, #43164) [Link]

I agree. The last 3 motherboards I have bought were for AMD processors. I bought a 3 core phenom II, an asus motherboard, and 4gb of ECC ram for around $200. I have no idea why Intel only supports ECC on their server motherboards. For me, this is a critical feature. In my experience, many Gigabyte motherboards do not support ECC, so check the motherboard manual, or list of supported memory before buying. In fact AMD supports IBM's Chipkill technology which will detect 4 bit errors and correct 3 bit errors. In addition, my Asus motherboards support memory scrubbing, which can help detect memory errors in a timely fashion.

If you buy assembled computers and can't get ECC support without spending big bucks, it is time to switch vendors.

It is true that ECC memory is more expensive and less available than non-ECC memory, but the price difference is around 20% or so, and Newegg and others sell a wide variety of ECC memory. Mainstream memory manufacturers, including Kingston sell ECC memory.

Of course, virtually all server computers come with ECC memory.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Jan 15, 2012 3:45 UTC (Sun) by sbergman27 (guest, #10767) [Link]

Mount with "nodelalloc". I have servers which host quite a few Cobol C/ISAM files. I was uncomfortable with the very idea of delayed allocation. But the EXT4 delayed allocation cheer-leading section, headed by Ted T'So, convinced me that after 2.6.30, it would be OK.

The very first time we had an power failure, with a UPS with a bad battery, we experienced corruption in several files of those files. Never *ever* *ever* had we experienced such a thing with EXT3. I immediately added nodelalloc as a mount option, and the EXT4 filesystem now seems as resilient as EXT3 ever was. Note that at around the same time as 2.6.30, EXT3 was made less reliable by adding the same 2.6.30 patches to it, and making data=writeback the default journalling mode. So if you do move back to EXT3, make sure to mount with data=journal.

BTW, I've not noted any performance differences mounting EXT4 with nodelalloc. Maybe in a side by side benchmark comparison I'd detect something.

Improving ext4: bigalloc, inline data, and metadata checksums

Posted Feb 19, 2013 10:23 UTC (Tue) by Cato (guest, #7643) [Link]

For LVM and write caching setup generally, see http://serverfault.com/questions/279571/lvm-dangers-and-c...

You might also like to try ZFS or btrfs - both have enough built-in checksumming that they should detect issues sooner, though in this case Ogg's checksumming is doing that for audio files. With a checksumming FS you could detect whether the corruption is in RAM (seen when writing to file) or on disk (seen when reading from file). ZFS also does periodic scrubbing to validate checksums.