Ext3 and RAID: silent data killers?
The conversation actually began last March, when Pavel Machek posted a proposed documentation patch describing the assumptions that he saw as underlying the design of Linux filesystems. Things went quiet for a while, before springing back to life at the end of August. It would appear that Pavel had run into some data-loss problems when using a flash drive with a flaky connection to the computer; subsequent tests done by deliberately removing active drives confirmed that it is easy to lose data that way. He hadn't expected that:
In an attempt to prevent a surge in refund requests at universities worldwide, Pavel tried to get some warnings put into the kernel documentation. He has run into a surprising amount of opposition, which he (and some others) have taken as an attempt to sweep shortcomings in Linux filesystems under the rug. The real story, naturally, is a bit more complex.
Journaling technology like that used in ext3 works by writing some data to the filesystem twice. Whenever the filesystem must make a metadata change, it will first gather together all of the block-level changes required and write them to a special area of the disk (the journal). Once it is known that the full description of the changes has made it to the media, a "commit record" is written, indicating that the filesystem code is committed to the change. Once the commit record is also safely on the media, the filesystem can start writing the metadata changes to the filesystem itself. Should the operation be interrupted (by a power failure, say, or a system crash or abrupt removal of the media), the filesystem can recover the plan for the changes from the journal and start the process over again. The end result is to make metadata changes transactional; they either happen completely or not at all. And that should prevent corruption of the filesystem structure.
One thing worth noting here is that actual data is not normally written to the journal, so a certain amount of recently-written data can be lost in an abrupt failure. It is possible to configure ext3 (and ext4) to write data to the journal as well, but, since the performance cost is significant, this option is not heavily used. So one should keep in mind that most filesystem journaling is there to protect metadata, not the data itself. Journaling does provide some data protection anyway - if the metadata is lost, the associated data can no longer be found - but that's not its primary reason for existing.
It is not the lack of journaling for data which has created grief for Pavel and others, though. The nature of flash-based storage makes another "interesting" failure mode possible. Filesystems work with fixed-size blocks, normally 4096 bytes on Linux. Storage devices also use fixed-size blocks; on traditional rotating media, those blocks are traditionally 512 bytes in length, though larger block sizes are on the horizon. The key point is that, on a normal rotating disk, the filesystem can write a block without disturbing any unrelated blocks on the drive.
Flash storage also uses fixed-size blocks, but they tend to be large - typically tens to hundreds of kilobytes. Flash blocks can only be rewritten as a unit, so writing a 4096-byte "block" at the operating system level will require a larger read-modify-write cycle within the flash drive. It is certainly possible for a careful programmer to write flash-drive firmware which does this operation in a safe, transactional manner. It is also possible that the flash drive manufacturer was rather more interested in getting a cheap device to market quickly than careful programming. In the commodity PC hardware market, that possibility becomes something much closer to a certainty.
What this all means is that, on a low-quality flash drive, an interrupted write operation could result in the corruption of blocks unrelated to that operation. If the interrupted write was for metadata, a journaling filesystem will redo the operation on the next mount, ensuring that the metadata ends up in its intended destination. But the filesystem cannot know about any unrelated blocks which might have been trashed at the same time. So journaling will not protect against this kind of failure - even if it causes the sort of metadata corruption that journaling is intended to prevent.
This is the "bug" in ext3 that Pavel wished to document. He further asserted that journaling filesystems can actually make things worse in this situation. Since a full fsck is not normally required on journaling filesystems, even after an improper dismount, any "collateral" metadata damage will go undetected. At best, the user may remain unaware for some time that random data has been lost. At worst, corrupt metadata could cause the code to corrupt other parts of the filesystem over the course of subsequent operation. The skipped fsck may have enabled the system to come back up quickly, but it has done so at the risk of letting corruption persist and, possibly, spread.
One could easily argue that the real problem here is the use of hidden translation layers to make a flash device look like a normal drive. David Woodhouse did exactly that:
The manufacturers of flash drives have, thus far, proved impervious to this line of reasoning, though.
There is a similar failure mode with RAID devices which was also discussed. Drives can be grouped into a RAID5 or RAID6 array, with the result that the array as a whole can survive the total failure of any drive within it. As long as only one drive fails at a time, users of RAID arrays can rest assured that the smoke coming out of their array is not taking their data with it.
But what if more than one drive fails? RAID works by combining blocks into larger stripes and associating checksums with those stripes. Updating a block requires rewriting the stripe containing it and the associated checksum block. So, if writing a block can cause the array to lose the entire stripe, we could see data loss much like that which can happen with a flash drive. As a normal rule, this kind of loss will not occur with a RAID array. But it can happen if (1) one drive has already failed, causing the array to run in "degraded" mode, and (2) a second failure occurs (Pavel pulls the power cord, say) while the write is happening.
Pavel concluded from this scenario that RAID devices may actually be more dangerous than storing data on a single disk; he started a whole separate subthread (under the subject "raid is dangerous but that's secret") to that effect. This claim caused a fair amount of concern on the list; many felt that it would push users to forgo technologies like RAID in favor of single, non-redundant drive configurations. Users who do that will avoid the possibility of data loss resulting from a specific, unlikely double failure, but at the cost of rendering themselves entirely vulnerable to a much more likely single failure. The end result would be a lot more data lost.
The real lessons from this discussion are fairly straightforward:
- Treat flash drives with care, do not expect them to be more reliable
than they are, and do not remove them from the system until all writes
are complete.
- RAID arrays can increase data reliability, but an array which is not
running with its full complement of working, populated drives has lost
the redundancy which provides that reliability. If the consequences
of a second failure would be too severe, one should avoid writing to
arrays running in degraded mode.
- As Ric Wheeler pointed out, the
easiest way to lose data on a Linux system is to run the disks with
their write cache enabled. This is especially true on RAID5/6
systems, where write barriers are still not properly supported. There
has been some talk of
disabling drive write caches and enabling barriers by default, but no
patches have been posted yet.
- There is no substitute for good backups. Your editor would add that any backups which have not been checked recently have a strong chance of not being good backups.
How this information will be reflected in the kernel documentation remains
to be seen. Some of it seems like the sort of system administration
information which is not normally considered appropriate for inclusion in
the documentation of the kernel itself. But there is value in knowing what
assumptions one's filesystems are built on and what the possible failure
modes are. A better understanding of how we can lose data can only help us
to keep that from actually happening.
Index entries for this article | |
---|---|
Kernel | Data integrity |
Kernel | Filesystems/ext3 |
Kernel | RAID |
Posted Aug 31, 2009 21:32 UTC (Mon)
by chrish (guest, #351)
[Link] (3 responses)
Posted Sep 1, 2009 6:05 UTC (Tue)
by k8to (guest, #15413)
[Link] (2 responses)
Posted Sep 1, 2009 17:33 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
Date: Wed, 26 Aug 2009 09:28:50 +1000
On Monday August 24, greg.freemyer@gmail.com wrote:
It does affect raid6 with a single drive missing.
After an unclean shutdown you cannot trust any Parity block as it
To take a more concrete example, imagine a 5 device RAID6 with
We could conceivably try each of those and if they all produce the
But if both D0 and D1 were being updated I think there would be too
So yes: a singly degraded RAID6 cannot promise no data corruption
NeilBrown
Posted Sep 4, 2009 23:32 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
But you don't have to build a RAID6 array that way. Ones I've looked at use a journaling scheme to provide atomic parity update. No matter where you get interrupted in the middle of updating a stripe, you can always get back the pre-update parity-consistent stripe (minus whatever 1 or 2 components that might have died at the same time).
I suspect Linux 'md' doesn't have the resources to do this feasibly, but a SCSI RAID6 unit probably would. I don't expect there's much market for the additional component loss protection of RAID6 without getting the interrupted write protection too.
Posted Aug 31, 2009 21:33 UTC (Mon)
by yohahn (guest, #4107)
[Link] (6 responses)
I can imagine much script writing, but if it is a general need, shouldn't there be a general method?
(an even more fun question would be, "Will applications fail in a reasonable fashion if they suddenly have their backing store mounted read-only".)
Posted Sep 1, 2009 11:54 UTC (Tue)
by Klavs (guest, #10563)
[Link] (5 responses)
Will remount the device mounted on / - read-only.
Posted Sep 3, 2009 15:37 UTC (Thu)
by yohahn (guest, #4107)
[Link] (4 responses)
Posted Sep 3, 2009 21:42 UTC (Thu)
by Cato (guest, #7643)
[Link] (1 responses)
Sending notifications of serious system events like this would be very helpful, with a bundle of standard event filters that can easily generate an email or other alert.
Posted Sep 11, 2009 17:59 UTC (Fri)
by jengelh (guest, #33263)
[Link]
Posted Sep 4, 2009 8:22 UTC (Fri)
by njs (subscriber, #40338)
[Link]
Posted Sep 4, 2009 23:39 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
I think this would be only marginally better than shutting the whole system down. In many ways it would be worse, since users have ways to deal with a missing system but not a system acting strangely.
Posted Aug 31, 2009 21:49 UTC (Mon)
by me@jasonclinton.com (subscriber, #52701)
[Link] (25 responses)
Posted Aug 31, 2009 22:31 UTC (Mon)
by proski (subscriber, #104)
[Link] (12 responses)
Posted Aug 31, 2009 22:34 UTC (Mon)
by me@jasonclinton.com (subscriber, #52701)
[Link] (11 responses)
Posted Aug 31, 2009 22:45 UTC (Mon)
by pizza (subscriber, #46)
[Link] (10 responses)
Granted, that "firmware" may be in the form fo mask ROM, but I know of at least one case where a CF card had a firmware update released for it.
SD and MS are a lot simpler, but even they require something to translate the SD/MS wire protocols into flash read/write ops.
Posted Aug 31, 2009 22:52 UTC (Mon)
by me@jasonclinton.com (subscriber, #52701)
[Link] (9 responses)
Posted Aug 31, 2009 23:27 UTC (Mon)
by drag (guest, #31333)
[Link] (8 responses)
They all are 'smart devices'.
If it was not for the firmware MTD-to-Block translation then you could not use them in Windows and they could not be formatted Fat32.
When I have dealt with Flash in the past, the raw flash type, the flash just appears as a memory region. Like I have this old i386 board I am dealing with that has it's flash just starting at 0x80000 and it goes on for about eight megs or so.
That's it. That's all the hardware does for you. You have to know then how to communicate with it and it's underlining structure and know the proper way to write to it and everything. All that has to be done in software.
I suppose most of that is rather old fashioned.. the flash was soldiered directly into the traces on the board.
I can imagine it would be quite difficult and would require new hardware protocols to allow a OS to manage flash directly properly over something like SATA or USB.
But fundamentally MTD are quite a bit different from Block devices. It's a different class of I/O completely. Just like how a character device like a mouse or a keyboard can't be written to with Fat32. You can fake MTD by running a Block-to-MTD layer on SD flash or a file or anything else and some poeple think that helps with wear leveling, but I think that is foolish and may actually end up being self-defeating as you have no idea how the algorithms in the firmware work.
Posted Aug 31, 2009 23:59 UTC (Mon)
by BenHutchings (subscriber, #37955)
[Link]
Not all. SmartMedia, xD and Memory Stick variants provide a raw flash interface - that's a major reason why they have had to be revised repeatedly to allow for higher-capacity chips. They rely on an external controller to do write-buffering, and do not support any wear-leveling layer. It is possible for a flash controller to map NOR flash into memory since it is random-access for reading. However, large flash chips are all NAND flash which only supports block reads.
Posted Sep 1, 2009 0:00 UTC (Tue)
by me@jasonclinton.com (subscriber, #52701)
[Link] (6 responses)
Posted Sep 1, 2009 0:52 UTC (Tue)
by drag (guest, #31333)
[Link] (3 responses)
Remember that SD stands for 'Secure Digital' and is DRM'd. So there has to be some smarts in it to do that.
Posted Sep 1, 2009 6:22 UTC (Tue)
by Los__D (guest, #15263)
[Link] (2 responses)
(Still doesn't change the point, though. SDs are probably designed with internal firmware)
Posted Sep 1, 2009 9:36 UTC (Tue)
by alonz (subscriber, #815)
[Link] (1 responses)
As for firmware, the SD card interface (available for free at www.sdcard.org defines accesses in terms of 512-byte “logical” sectors, practically mandating the card to implement a flash translation layer.
Posted Sep 1, 2009 12:49 UTC (Tue)
by Los__D (guest, #15263)
[Link]
I read "devices" as the SD cards themselves.
Posted Sep 1, 2009 16:57 UTC (Tue)
by Baylink (guest, #755)
[Link]
Posted Sep 1, 2009 17:37 UTC (Tue)
by iabervon (subscriber, #722)
[Link]
Posted Sep 1, 2009 3:16 UTC (Tue)
by zlynx (guest, #2285)
[Link] (2 responses)
From what I gather, MD does not use write-intent logging by default, and when it is enabled it is very inefficient. Probably because it doesn't spread the write intent logs around the disks. Also, MD does not detect a unclean shutdown, so it does not start a RAID scrub and go into read+verify mode. And all that is a problem even when the array isn't degraded.
And of course it doesn't have a battery backup. :)
All that said, Linux MD hasn't given me any problems, and I prefer it over most cheap hardware RAID.
Posted Sep 1, 2009 5:14 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (1 responses)
And MD most certainly does detect an unclean shutdown and will validate all parity block on restart.
But you are right that it doesn't have battery backup. If fast NVRAM were available on commodity server hardware, I suspect we would get support for it in md/raid5 in fairly short order.
Posted Sep 1, 2009 15:30 UTC (Tue)
by zlynx (guest, #2285)
[Link]
It may start a background verify, although it seemed to me that was dependent on what the distro's startup scripts did...
Posted Sep 2, 2009 0:09 UTC (Wed)
by Richard_J_Neill (subscriber, #23093)
[Link] (8 responses)
Yes...but it's only good for a few hours. So if your power outage lasts more than that, then the BBWC (battery backed write cache) is still toast.
On a related note, I've just bought a pair of HP servers (DL380s) and an IBM X3550. It's very annoying that there is no way to buy either of these without hardware raid, nor can the raid card be turned off in the BIOS. For proper reliability, I only really trust software (md) raid in Raid 1 mode (with write caching off). [Aside: this kills the performance for database workloads (fdatasync), though the Intel X25-E SSDs outperform 10k SAS drives by a factor of about 12.]
Posted Sep 2, 2009 0:13 UTC (Wed)
by dlang (guest, #313)
[Link] (7 responses)
Posted Sep 4, 2009 10:32 UTC (Fri)
by nix (subscriber, #2304)
[Link] (6 responses)
Posted Sep 5, 2009 0:01 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (5 responses)
The battery in RAID adapter card only barely addresses the issue, I don't care how long it lives.
But the comment also addressed "RAID enclosures," which I take to mean storage servers that use RAID technology. Those, if they are at all serious, have batteries that power the disk drives as well, and only for a few seconds -- long enough to finish the write. It's not about backup power, it's about a system in which data is always consistent and persistent, even if someone pulled a power cord at some point.
Posted Sep 5, 2009 0:31 UTC (Sat)
by dlang (guest, #313)
[Link]
it's possible that they have heavy duty power supplies that keep power up for a fraction of a second after the power fail signal goes out to the drives, but they defiantly do not keep the drives spinning long enough to flush their caches
Posted Sep 7, 2009 22:47 UTC (Mon)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Sep 7, 2009 23:18 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (2 responses)
No one said anything about pulling a disk. I did mention pulling a power cord. I meant the power cord that supplies the RAID enclosure (storage server).
A RAID enclosure with a battery inside that powers only the memory can keep the data consistent in the face of a power cord pull, but fails the persistence test, because the battery eventually dies. I think when people think persistent, they think indefinite. High end storage servers do in fact let you pull the power cord and not plug it in again for years and still be able to read back all the data that was completely written to the server before the pull. Some do it by powering disk drives (not necessarily the ones that normally hold the data) for a few seconds on battery.
Also, I think some people expect of persistence that you can take the machine, once powered down, apart and put it back together and the data will still be there. Battery backed memory probably fails that test.
Posted Sep 8, 2009 4:56 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
Posted Sep 8, 2009 6:25 UTC (Tue)
by giraffedata (guest, #1954)
[Link]
Now that you mention it, I do remember that earlier IBM Sharks had nonvolatile storage based on a battery. Current ones don't, though. The battery's only job is to allow the machine to dump critical memory contents to disk drives after a loss of external power. I think that's the trend, but I haven't kept up on what EMC, Hitachi, etc. are doing. IBM's other high end storage server, the former XIV Nextra, is the same.
Posted Aug 31, 2009 23:22 UTC (Mon)
by pr1268 (guest, #24648)
[Link] (1 responses)
I can't imagine that these issues (especially the flash-based disk problems Pavel experienced) are unique to Ext3. Microsoft's NTFS and Mac OS X's HFS+ are both journaling file systems that perform transactional commits, don't they?
Posted Sep 2, 2009 7:01 UTC (Wed)
by Cato (guest, #7643)
[Link]
There was a 'journal checksum patch' for ext3 by the authors, I think, but I can't track it down.
Not sure about NTFS, but HFS+ does seem to have journal checksums - see http://developer.apple.com/mac/library/technotes/tn/tn115...
Posted Sep 1, 2009 0:41 UTC (Tue)
by ncm (guest, #165)
[Link] (13 responses)
Pavel's complaint is a consequence of this failure.
I note in passing that there is no need, in general, for a file system that journals data as well as metadata to write the data twice. That's just a feature of common, naive designs. Journaling data along with metadata, howsoever sophisticated, doesn't protect against power drops either. Nothing does. Use battery backup if you care. A few seconds is enough if you can shut down the host fast enough.
Posted Sep 1, 2009 6:40 UTC (Tue)
by IkeTo (subscriber, #2122)
[Link] (11 responses)
Posted Sep 1, 2009 7:58 UTC (Tue)
by ncm (guest, #165)
[Link] (10 responses)
Some drives only report blocks written to the platter after they really have been, but that's bad for benchmarks, so most drives fake it, particularly when they detect benchmark-like behavior. Everyone serious about reliability uses battery backup, so whoever's left isn't serious, and (they reason) deserve what they get, because they're not paying. Building in better reliability manifestly doesn't improve sales or margins.
If you pay twice as much for a drive, you might get better behavior. Or you might only pay more.
If you provide a few seconds' battery backup for the drive but not the host, then the blocks in the buffer that the drive said were on the disk get a chance to actually get there.
Posted Sep 1, 2009 17:09 UTC (Tue)
by Baylink (guest, #755)
[Link]
{{citation-needed}}
> If you pay twice as much for a drive, you might get better behavior. Or you might only pay more.
I generally find the difference per GB to be 6:1 going from even enterprise SATA drives to Enterprise SCSI (U-160 or faster, 10K or faster). My experience is that I get what I pay for, YMMV.
Posted Sep 1, 2009 17:20 UTC (Tue)
by markusle (guest, #55459)
[Link] (1 responses)
I'd be very interested in some additional references or a list of drives
Posted Sep 1, 2009 17:44 UTC (Tue)
by ncm (guest, #165)
[Link]
The storage industry is as mature as any part of the computer business. It is arranged such as to allow you to spend as much money as you like, and can happily absorb as much as you throw at it. If you know what you're doing, you can get full value for your money. If you don't know what you're doing, you can spend just as much and get little more value than
the raw disks in the box. There is no substitute for competence.
Posted Sep 1, 2009 23:28 UTC (Tue)
by dododge (guest, #2870)
[Link]
The old DeskStar drive manual (circa 2002) explicitly stated that power loss in the middle of a write could lead to partially-written sectors, which would trigger a hard error if you tried to read them later on. According to an LKML discussion back then, the sectors would stay in this condition indefinitely and would not be remapped; so the drive would continue to throw hard errors until you manually ran a repair tool to find and fix them.
Posted Sep 5, 2009 0:10 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
But then you also get the garbage that the host writes in its death throes (e.g. update of a random sector) while the drive is still up.
To really solve the problem, you need much more sophisticated shutdown sequencing.
Posted Sep 8, 2009 20:54 UTC (Tue)
by anton (subscriber, #25547)
[Link] (2 responses)
Posted Sep 10, 2009 20:58 UTC (Thu)
by Cato (guest, #7643)
[Link] (1 responses)
UPSs are useful at least to regulate the voltage and cover against momentary power cuts, which are very frequent where I live, and far more frequent than UPS failures in my experience.
Posted Sep 10, 2009 21:34 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Posted Sep 10, 2009 9:00 UTC (Thu)
by hensema (guest, #980)
[Link] (1 responses)
Which is no problem. The CRC for the sector will be incorrect, which will be reported to the host adapter. The host adapter will then reconstruct the data and write back the correct sector.
Of course you do need RAID for this.
Posted Sep 10, 2009 20:52 UTC (Thu)
by Cato (guest, #7643)
[Link]
Without RAID, the operating system will have no idea the sector is corrupt - this is why I like ZFS's block checksumming, as you can get a list of files with corrupt blocks in order to restore from backup.
Posted Sep 1, 2009 16:10 UTC (Tue)
by davecb (subscriber, #1574)
[Link]
--dave
Posted Sep 1, 2009 1:35 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link] (2 responses)
It's right in the definition of RAID5: you cannot expect to maintain data integrity after a double-fault. The filesystem used on top of it is irrelevant.
Posted Sep 1, 2009 3:22 UTC (Tue)
by zlynx (guest, #2285)
[Link]
Most people, myself included, expect a degraded RAID array to fail **only if another drive fails**. I do NOT expect losing 256KB of data because a single 4KB write failed.
And in fact, the array didn't lose it. It just can't tell which 4KB went bad, which it could if MD did good write-intent logging.
Posted Sep 1, 2009 7:51 UTC (Tue)
by job (guest, #670)
[Link]
Posted Sep 1, 2009 4:09 UTC (Tue)
by sureshb (guest, #25018)
[Link] (1 responses)
Posted Sep 1, 2009 7:45 UTC (Tue)
by job (guest, #670)
[Link]
Posted Sep 1, 2009 5:16 UTC (Tue)
by k8to (guest, #15413)
[Link] (18 responses)
RAID 6 is slightly less bad. If you want to avoid problems with crashes, outages, you should have multiple hot standbys. if you want performance you should use RAID 10.
Either way you should use a backup as your data loss reduction strategy.
Posted Sep 1, 2009 7:46 UTC (Tue)
by job (guest, #670)
[Link] (4 responses)
Posted Sep 1, 2009 8:05 UTC (Tue)
by drag (guest, #31333)
[Link] (1 responses)
With RAID 5 the amount of time it takes to recover is so long nowadays that the chances of having a double fault is pretty good. It was one thing to have 20GB with 30MB/s performance, but it's quite another to have 1000GB with 50MB/s performance...
Posted Sep 11, 2009 1:18 UTC (Fri)
by Pc5Y9sbv (guest, #41328)
[Link]
My cheap MD RAID5 with three 500 GB SATA drives allows me to have 1TB and approximately 100 MB/s per drive throughput, which implies a full scan to re-add a replacement drive might take 2 hours or so (reading all 500 GB from 2 drives and writing 500 GB to the third at 75% of full speed). I have never been in a position where this I/O time was worrisome as far as a double fault hazard. Having a commodity box running degraded for several days until replacement parts are delivered is a more common consumer-level concern, which has not changed with drive sizes.
Posted Sep 3, 2009 5:05 UTC (Thu)
by k8to (guest, #15413)
[Link] (1 responses)
Meanwhile, you also get vastly better performance, and higher reliability of implementation.
It's really a no brainer unless you're poor.
Posted Sep 3, 2009 5:26 UTC (Thu)
by dlang (guest, #313)
[Link]
in digging further I discovered that they key to performance was to have enough queries in flight to keep all disk heads fully occupied (one outstanding query per drive spindle), and you can do this with both raid 6 and raid 10.
Posted Sep 1, 2009 8:11 UTC (Tue)
by drag (guest, #31333)
[Link] (11 responses)
RAID = availability/performance
Anything other way of looking at is pretty much doomed to be flawed.
Posted Sep 1, 2009 15:43 UTC (Tue)
by Cato (guest, #7643)
[Link] (1 responses)
Posted Sep 1, 2009 16:05 UTC (Tue)
by jonabbey (guest, #2736)
[Link]
Posted Sep 1, 2009 16:47 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (7 responses)
Backups are good for certain limited chores such as backing up your version control system! :) But ONLY if you have a mechanism to verify the sanity of your previous backup and the original before making the next backup. Else, you are back to backing up corrupted data.
A good version control system protects you from corruption and accidental deletion since you can always go to an older version. And the backup system with checksums (often built into VCS) should protect the version control system.
If you don't have space for VCing your data you don't likely really have space for backing it up either, so do not accept this as an excuse to not vcs your data instead of backing it up.
Posted Sep 1, 2009 17:44 UTC (Tue)
by Cato (guest, #7643)
[Link]
rsnapshot is pretty good as a 'sort of' version control system for any type of file including binaries. It doesn't do any compression, just rsync plus hard links, but works very well within its design limits. It can backup filesystems including the hard links (use rsync -avH in the config file), and is focused on 'pull' backups i.e. backup server ssh's into the server to be backed up. It's used by some web hosting providers who back up tens of millions of files every few hours, with scans taking a surprisingly short time due to the efficiency of rsync. Generally rsnapshot is best if you have a lot of disk space available, and not much time to run the backups in.
rdiff-backup may be closer to what you are thinking of - unlike rsnapshot it only stores the deltas between versions of a file, and stores permissions etc as metadata (so you don't have to have root on the box being backed up to rsync arbitrary files). It's a bit slower than rsnapshot but a lot of people like it. It does include checksums which is a very attractive feature.
duplicity is somewhat like rsnapshot, but can also do encryption, so it's more suitable for backup to a system you don't control.
There are a lot of these tools around, based on Mike Rubel's original ideas, but these ones seem the most actively discussed.
For a non-rsync backup, dar is excellent but not widely mentioned - it includes per-block encryption and compression, and per-file checksums, and is generally much faster for recovery than tar, where you must read through the whole archive to recover.
rdiff-backup, like VCS tools, will have difficulty with files of 500 MB or more - it's been reported that such files don't get backed up, or are not delta'ed. Very large files that change frequently (databases, VM images, etc) are a problem for all these tools.
Posted Sep 1, 2009 17:55 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
there are lots of things that can happen to your computer (including your house burning down) that will destroy everything on it.
no matter how much protection you put into your storage system, you still need backups.
Posted Sep 1, 2009 18:05 UTC (Tue)
by martinfick (subscriber, #4455)
[Link]
Thus, locality is unrelated to whether your are using backups or version control. Yes, it is better to put it on another computer, or, at least another physical device. But, this is in no way an argument for using backups instead of version control.
Posted Sep 1, 2009 18:05 UTC (Tue)
by joey (guest, #328)
[Link] (1 responses)
I'd agree, but you may not have memory to VCS your data. Git, in particular, scales memory usage badly with large data files.
Posted Sep 1, 2009 18:16 UTC (Tue)
by martinfick (subscriber, #4455)
[Link]
Posted Sep 2, 2009 0:39 UTC (Wed)
by drag (guest, #31333)
[Link] (1 responses)
If your using version control for backups then that is your backup. Your
My favorite form of backup is to use Git to sync data on geographically
> Backups are horrible to recover from.
They are only horrible to recover with if the backup was done poorly. If
Backing up is a concept.
Anyways its much more horrible to recover data that has ceased to
> Backups provide no sane automatable mechanism for pruning older data
Your doing it wrong.
The best form of backup is to full backups to multiple places. Ideally they
It depends on what your doing but a ideal setup would be like this:
That would probably be a good idea for most small/medium businesses.
If your relying on a server or a single datacenter to store your data
Posted Sep 3, 2009 7:51 UTC (Thu)
by Cato (guest, #7643)
[Link]
Posted Sep 3, 2009 5:06 UTC (Thu)
by k8to (guest, #15413)
[Link]
Posted Sep 4, 2009 10:38 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Furthermore, reliability is fine *if* you can be sure that once RAID parity computations have happened the stripe will always hit the disk, even if there is a power failure. With battery-backed RAID, this is going to be true (modulo RAID controller card failure or a failure of the drive you're writing to). Obviously if the array is sufficiently degraded reliability isn't going to be good anymore, but doesn't everyone know that?
Posted Sep 1, 2009 8:15 UTC (Tue)
by Cato (guest, #7643)
[Link] (22 responses)
My standard setup now is to:
1. Avoid LVM completely
2. Disable write caching on all hard drives using hdparm -W0 /dev/sdX.
3. Enable data=journal on ext3 (tune2fs -o journal_data /dev/sdX is the best way to ensure partitions are mounted with this option, including the root partition and when installed in another system, post-reinstall, etc).
The performance hit from these changes is trivial compared to the two days I spent rebuilding a PC where the root filesystem lost thousands of files and the backup filesystem was completely lost.
I suspect that the reason LVM is seen as reliable despite being the default for Fedora and RHEL/CentOS is that enterprise Linux deployment use hardware RAID cards with battery-backed cache, and perhaps higher quality drives that don't lie about write completion.
Linux is far worse at losing data with a default ext3 setup than I once thought it was, unfortunately. If correctly configured it's fine, but the average new Linux user has no way to know how to configure this. I can't recall losing data like this on Windows in the absence of a hardware problem.
Posted Sep 1, 2009 8:51 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (10 responses)
It is always nice to imagine that if you find the right combination of DIP switches, set the right values in the configuration file, choose the correct compiler flags, your problems will magically vanish.
But sometimes the real problem is just that your hardware is broken, and all the voodoo and contortions do nothing. You've changed _three_ random things about your setup based on, AFAICT, no evidence at all, and you think it's a magic formula. Until something bad happens again and you return in another LWN thread to tell us how awful ext3 is again...
Posted Sep 1, 2009 9:03 UTC (Tue)
by Cato (guest, #7643)
[Link] (9 responses)
I did base these changes mostly on the well known lack of journal checksumming in ext3 (going to data=journal and avoiding write caching) - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal. Dropping LVM is harder to justify, it's really just a hunch based on a number of reports of LVM being involved in data corruption, and on my own data point that the LVM volumes on one disk were completely inaccessible (i.e. corrupted LVM metadata) - hence it was not just ext3 involved here, though it might have been write caching as well.
I'm interested to hear responses that show these steps are unnecessary, of course.
I really doubt the hardware is broken: there are no disk I/O errors in any of the logs, there were 2 disks corrupted (1 SATA, 1 PATA), and there are no symptoms of memory errors (random application/system crashes).
Posted Sep 1, 2009 17:48 UTC (Tue)
by dlang (guest, #313)
[Link] (7 responses)
disabling write caches
this is mandatory unless you have battery backed cache to recover from failed writes. period, end of statement. if you don't do this you _will_ loose data when you loose power.
avoiding LVM
I have also run into 'interesting' things with LVM, and so I also avoid it. I see it as a solution in search of a problem for just about all users (just about all users would be just as happy, and have things work faster with less code involved, if they just used a single partition covering their entire drive.)
I suspect that some of the problem here is that ordering of things gets lost in the LVM layer, but that's just a guess.
data=journal,
this is not needed if the application is making proper use of fsync. if the application is not making proper use of fsync it's still not enough to make the data safe.
by the way, ext3 does do checksums on journal entries. the details of this were posted as part of the thread on linux-kernel.
Posted Sep 1, 2009 18:05 UTC (Tue)
by Cato (guest, #7643)
[Link] (3 responses)
Do you know roughly when ext3 checksums were added, or by whom, as this contradicts the Wikipedia page? Must be since 2007, based on http://archives.free.net.ph/message/20070519.014256.ac3a2.... I thought journal checksumming was only added to ext4 (see first para of http://lwn.net/Articles/284037/) not ext3.
This sort of corruption issue is one reason to have multiple partitions; parallel fscks are another. In fact, it would be good if Linux distros automatically scheduled a monthly fsck for every filesystem, even if journal-based.
Posted Sep 1, 2009 18:15 UTC (Tue)
by dlang (guest, #313)
[Link] (2 responses)
I'm not sure I believe that parallel fscks on partitions on the same drive do you much good. the limiting factor for speed is the throughput of the drive. do you really gain much from having it bounce around interleaving the different fsck processes?
as for protecting against this sort of corruption, I don't think it really matters.
for flash, the device doesn't know about your partitions, so it will happily map blocks from different partitions to the same eraseblock, which will then get trashed on a power failure, so partitions don't do you any good.
for a raid array it may limit corruption, but that depends on how your partition boundaries end up matching the stripe boundaries.
Posted Sep 1, 2009 18:44 UTC (Tue)
by Cato (guest, #7643)
[Link]
This Usenix paper mentions that JBD2 will ultimately be usable by other filesystems, so perhaps that's how ext3 does (or will) support this: http://www.usenix.org/publications/login/2007-06/openpdfs... - however, I don't think ext3 has journal checksums in (say) 2.6.24 kernels.
Posted Sep 2, 2009 6:36 UTC (Wed)
by Cato (guest, #7643)
[Link]
One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...
Posted Sep 1, 2009 22:37 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (2 responses)
If this is true (and I don't doubt that it is), why on earth is it not the default? Shipping software with such an unsafe default setting is stupid. Most users have no ideas about these settings... surely they shouldn't be handed a delicious pizza smeared with nitroglycerin topping, and then be blamed when they bite into it and it explodes...
Posted Sep 1, 2009 22:41 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
the market has shown that people are willing to take this risk by driving all vendors that didn't make the change out of the marketplace
Posted Sep 3, 2009 7:58 UTC (Thu)
by Cato (guest, #7643)
[Link]
It would still be better if distros made this the default but I don't see much prospect of this.
One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...
Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.
Posted Sep 3, 2009 23:18 UTC (Thu)
by landley (guest, #6789)
[Link]
People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)
Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."
That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.
I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.
Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.
The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.
I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.
Rob
Posted Sep 1, 2009 16:43 UTC (Tue)
by ncm (guest, #165)
[Link] (6 responses)
I don't buy that your experience suggests there's anything wrong with ext3, if you didn't protect against power drops. The same could happen with any file system. The more efficient the file system is, the more likely is corruption in that case -- although some inefficient file systems seem especially corruptible.
Posted Sep 1, 2009 17:31 UTC (Tue)
by Cato (guest, #7643)
[Link] (5 responses)
Since the rebuild, I have realised that the user of the PC has been turning it off via the power switch accidentally, which perhaps caused the write cache of the disk(s) to get corrupted and is a fairly severe test. Despite several sudden poweroffs due to this, with the new setup there has been no corruption yet. It seems unlikely that the writes would be pending in the disk's write cache for so long that they couldn't be written out while power was still, but the fact is that both ext3 and LVM data structures got corrupted.
It's acknowledged that ext3's lack of journal checksumming can cause corruption when combined with disk write caching (whereas XFS does have such checksums I think). The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted. Betting the data integrity of a remotely administered system on this time window is not something I want to do.
Posted Sep 1, 2009 18:06 UTC (Tue)
by ncm (guest, #165)
[Link]
That's easy: No. When power starts to drop, everything is gone at that moment. If the disk is writing at that moment, the unfinished sector gets corrupted, and maybe others. UPS for the computer and disk together helps only a little against corruption unless power drops are almost always shorter than the battery time, or it automatically shuts down the computer before getting used up. You may be better off if the computer loses power immediately, and only the disk gets the UPS treatment.
it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.
Perhaps, but it is. (What is this "should"?) The file system doesn't matter near so much as you would like. They can be indefinitely bad, but can be no more than fairly good. The good news is that the UPS only needs to support the disk, and it only needs to keep power up for a few seconds; then many file systems are excellent, although the bad ones remain bad.
Posted Sep 9, 2009 20:35 UTC (Wed)
by BackSeat (guest, #1886)
[Link] (3 responses)
It may only be semantics, but it's unlikely that the lack of journal checksumming causes corruption, although it may make it difficult to detect. As for LVM, I've never seen the point. Just another layer of ones and zeros between the data and the processor. I never use it, and I'm very surprised some distros seem to use it by default.
Posted Sep 10, 2009 20:50 UTC (Thu)
by Cato (guest, #7643)
[Link] (2 responses)
In the absence of ext3 journal checksumming, and if there is a crash requiring replay of this journal block, horrible things will happen - presumably garbage is written to various places on disk from the 'journal' entry. One symptom may be log entries saying 'write beyond end of partition', which I've seen a few times with ext3 corruption and I think is a clear indicator of corrupt filesystem metadata.
This is one reason why JBD2 added journal checksumming for use with ext4 - I hope this also gets used by ext3. In my view, it would be a lot better to make that change to ext3 than to make data=writeback the default, which will speed up some workloads and most likely corrupt some additional data (though I guess not metadata).
Posted Sep 10, 2009 21:05 UTC (Thu)
by Cato (guest, #7643)
[Link] (1 responses)
Posted Sep 11, 2009 16:33 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(this is all supposition from postmortems of shagged systems. Thankfully
Posted Sep 1, 2009 23:49 UTC (Tue)
by dododge (guest, #2870)
[Link] (3 responses)
I don't use EXT3 much, but from a quick googling it sounds like you have to explicitly turn on barrier support in fstab and it still won't warn you about the LVM issue until it actually tries to use one.
Posted Sep 2, 2009 7:18 UTC (Wed)
by Cato (guest, #7643)
[Link]
Posted Sep 3, 2009 7:18 UTC (Thu)
by job (guest, #670)
[Link] (1 responses)
Posted Sep 3, 2009 10:23 UTC (Thu)
by Cato (guest, #7643)
[Link]
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
partially degraded raid 6 _IS_ vunerable to partial writes on power failure
From: Neil Brown <neilb@suse.de>
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible
> > +Don't damage the old data on a failed write (ATOMIC-WRITES)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > + Because RAM tends to fail faster than rest of system during
> > + powerfail, special hw killing DMA transfers may be necessary;
> > + otherwise, disks may write garbage during powerfail.
> > + This may be quite common on generic PC machines.
> > +
> > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + because it needs to write both changed data, and parity, to
> > + different disks. (But it will only really show up in degraded mode).
> > + UPS for RAID array should help.
>
> Can someone clarify if this is true in raid-6 with just a single disk
> failure? I don't see why it would be.
is possible that some of the blocks in the stripe have been updated,
but others have not. So you must assume that all parity blocks are
wrong and update them. If you have a missing disk you cannot do that.
3 data blocks D0 D1 D2 as well a P and Q on some stripe.
Suppose that we crashed while updating D0, which would have involved
writing out D0, P and Q.
On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
of D0, P and Q have been updated and the others not.
We can try to recompute D2 from D0 D1 and P, from
D0 P and Q or from D1, P and Q.
same result we might be confident of it.
If two produced the same result and the other was different we could
use a voting process to choose the 'best'. And in this particular
case I think that would work. If 0 or 3 had been updates, all would
be the same. If only 1 was updated, then the combinations that
exclude it will match. If 2 were updated, then the combinations that
exclude the non-updated block will match.
many combinations and it would be very possibly that all three
computed values for D2 would be different.
after an unclean shutdown. That is why "mdadm" will not assemble such
an array unless you use "--force" to acknowledge that there has been a
problem.
partially degraded raid 6 _IS_ vunerable to partial writes on power failure
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
(and how will applications respond to their disks becoming read-only suddenly)
Ext3 and RAID: silent data killers?
All RAID enclosures and RAID host-adapters worth their salt have a BBU
(battery backup unit) option for exactly this purpose. Why the small block
write buffer on an SSD cannot be backed up by a suitably small zinc-air
battery is a mystery to me.
I imagine that at least the more expensive SSDs have something like that, or maybe they can finish the write if the external power is disconnected. But when it comes to flash cards, like those used in digital cameras, the cost difference would be prohibitive.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
AND SD and Memorystick and any other remotely consumer-related device.
They all are 'smart devices'.
When I have dealt with Flash in the past, the raw flash type, the flash just appears as a memory region
Isn't the ATA/MMC<->MTD translation done in the consumer "reader" that you
stick these devices in? CF is electrically compatible with ATA. That's not
even remotely the case with the electrical interfaces on either SD or MS.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
That's not what Wikipedia says—they say few devices support CPRM. Which is more-or-less true—almost no devices in the western market use CPRM, while in Japan every single device does (it is required as part of i-Mode, which is mandated by DoCoMo).
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
> All RAID enclosures and RAID host-adapters worth their salt have a BBU
>(battery backup unit) option for exactly this purpose.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
disk* halfway through a write, the disk state is still consistent? Yeah,
battery-backed cache alone obviously can't ensure that.
Ext3 and RAID: silent data killers?
Ah, I see, the point is that even if you turn off the power *and pull the
disk* halfway through a write, the disk state is still consistent? Yeah,
battery-backed cache alone obviously can't ensure that.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
I don't know what 'high end storage servers' you are talking about, the even the multi-million dollar arrays from EMC and IBM do not have the characteristics that you are claiming.
Are these issues really unique to Ext3?
Are these issues really unique to Ext3?
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
> have been, but that's bad for benchmarks, so most drives fake it,
> particularly when they detect benchmark-like behavior.
that do or don't do this.
Start by looking at very, very expensive, slow drives. Then forget about them. Instead, rely on redundancy and battery backup. There are lots of companies that aggregate cheap disks, batteries, cache, and power in a nice box, and each charges what they can get for it. Some work well, others less so. Disk arrays work like insurance: spread the risk, and cover for failures. Where they inadvertently concentrate risk, you get it all.
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
If you provide a few seconds' battery backup for the drive but not the host, then the blocks in the buffer that the drive said were on the disk get a chance to actually get there.
Journaling no protection against power drop
[Engineers at drive manufacturers] say they happily stop
writing halfway in the middle of a sector, and respond to power drop
only by parking the head.
The results from my experiments
on cutting power on disk drives are consistent with the theory
that the drives I tested complete the sector they write when the power
goes away. However, I have seen drives that corrupt sectors on
unusual power conditions; the manufacturers of these drives (IBM,
Maxtor) and their successors (Hitachi) went to my don't-buy list and
are still there.
Some drives only report blocks written to the platter
after they really have been, but that's bad for benchmarks, so most
drives fake it, particularly when they detect benchmark-like
behavior.
Write-back caching (reporting completion before the data hits the
platter) is normally enabled in PATA and also SATA drives (running
benchmarks or not), because without tagged commands (mostly absent in
PATA, and not universally supported for SATA) performance is very bad
otherwise. You can disable that with hdparm -W0
. Or you
can ask for barriers (e.g., as an ext3 mount option), which should
give the same consistency guarantees at lower cost if the file system
is implemented properly; however, my trust in the proper
implementation in Linux is severely undermined by the statements that
some prominent kernel developers have made in recent months on file
systems.
Everyone serious about reliability uses battery
backup
Do you mean a UPS? So how does that help when the UPS fails? Yes, we
have had that (while power was alive), and we concluded that our power
grid is just as reliable as a UPS. One could protect against a
failing UPS with dual (redundant) power supplies and dual UPSs, but
that would probably double the cost of our servers. A better option
would be to have an OS that sets up the hardware for good reliability
(i.e., disable write caching if necessary) and works hard in the OS to
ensure data and metadata consistency. Unfortunately, it seems that
that OS is not Linux.
Journaling no protection against power drop
It depends on where you live. Here power outages are quite
infrequent, but mostly take so long that the UPS will run out of
power. So then the UPS only gives the opportunity for a clean
shutdown (and that opportunity was never realized by our sysadmin when
we had UPSs), and that is unnecessary if you have all of the
following: decent drives that complete the last sector on power
failure; a good file system; and a setup that gives the file system
what it needs to stay consistent (e.g., barriers or hdparm -W0). And
of course we have backups around if the worst comes to worst. And
while we don't have the ultimate trust in ext3 and the hard drives we
use, we have not yet needed the backups for that.
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
Journaling no protection against power drop
This, by the way, is an oversimplified description of ZFS (;-))
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
BACKUPS = data protection.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
> space for backing it up either, so do not accept this as an excuse to not
> vcs your data instead of backing it up.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
sentence does not really make a whole lot of sense and is nonsensical.
There is no difference.
disparate machines. But this is only suitable for text data. If I have to
backup photographs then source code management systems are utter shit.
you (or anybody else) does a shitty job of setting them up then it's your
(or their's) fault they are difficult.
exist.
> (backups) that doesn't suffer from the same corruption/accidental deletion
> problem that originals have, but worse, amplified since they don't even
> have a good mechanism for sanity checking (usage)! Backups tend to backup
> corrupted data without complaining.
should be in a different region. You don't go back and prune data or clean
them up. Thats WRONG. Incremental backups are only useful to reduce the
amount of dataloss between full backups. A full copy of _EVERYTHING_ is a
requirement. And you save it for as long as that data is valuable. Usually
5 years.
* On-site backups every weekend. Full backups. Stored for a few months.
* Incremental backups twice a day, and resets at the weekend with the full
backup.
* Every month 2 full backups are stored for 2-3 years.
* Off-site backups 1 a month, stored for 5 years.
etc. etc.
reliably then your a fool. I don't give a shit on how high quality your
server hardware is or file system or anything. A single fire, vandalism,
hardware failure, disaster, sabotage, or any number of things can utterly
destroy _everything_.
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and RAID: silent data killers?
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
>
> this is mandatory unless you have battery backed cache to recover from
> failed writes. period, end of statement. if you don't do this you _will_
> loose data when you loose power.
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
What's wrong with providing battery backup for your drives? If they have power for a little while after the last write request arrives, then write caching, re-ordering writes, and lying about what's already on the disk don't matter. You still need to do backups, of course, but you'll need to use them less often because your file system will be that much less likely to have been corrupted.
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted
Ext3 and write caching by drives are the data killers...
It's acknowledged that ext3's lack of journal checksumming can cause corruptionExt3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
notice that 'hey, this doesn't look like a journal' once the record that
spanned the block boundary is complete. But that's a bit late...
we no longer use hardware prone to this!)
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
LVM barriers