User: Password:
|
|
Subscribe / Log in / New account

Barriers and journaling filesystems

Barriers and journaling filesystems

Posted May 21, 2008 18:17 UTC (Wed) by raven667 (subscriber, #5198)
Parent article: Barriers and journaling filesystems

This is all well and good but it would be I think most effective if drives had their own
battery 
backed write-back cache which can be pinned by the OS to the journal.  If the feature can be
made 
persistent, by committing the data to flash or something on drive shutdown, then the data
doesn't 
even need to be committed to the disk.  It seems to me that this kind of design could remove 
inherent penalties associated with journaling using proper write barriers.  This kind of
technology 
could use used for transactional databases as well by making the transaction record work at 
memory and interface speed rather than be limited by the rotational speed of the drives.  This

should reduce contention on the platters by removing one of the more constant sources of
activity.


(Log in to post comments)

Barriers and journaling filesystems

Posted May 21, 2008 18:42 UTC (Wed) by jwb (guest, #15467) [Link]

Or you could just use a CF device for your journal.

Barriers and journaling filesystems

Posted May 22, 2008 5:05 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

Except that flash has a limited number of rewrites before it's toast?

Barriers and journaling filesystems

Posted May 22, 2008 9:39 UTC (Thu) by ekj (guest, #1524) [Link]

Would you please stop whipping this long-dead horse ?

Typical flash-memory today is rated for 1M writes. There are internal wear-leveling that
ensures that this is that number of writes to the ENTIRE module. (i.e. it is impossible to
wear out flash faster by writing repeatedly to the same location)

So, even a SMALL 1GB flash-memory requires the writing of a minimum of 1000TB worth of data
before it'll start failing. (or another order of magnitude if it's a 1M flash-module)

To put this in perspective, if you are writing 24x365 to the module at a constant speed of
1MB/s, then you'll wear it out in 31 years. If you write constantly at 10MB/s, you wear it out
in 3 years.

For most uses, this simply isn't a concern. Most computers, even file-servers, don't write
constantly to disk 24x365, and even those that do; the journal is metadata-only by default, so
only writes that changes the filesystem-structure generate load on the journal at ALL.

For those extremely rare servers that DO a gargantuan amount of writes that changes the
filesystem-structure, there's a simple cure: Buy a sligthly LARGER flash-module. 

Because of the wear-leveling a flash-module that is 4 times the size will sustain 4 times the
amount (measured in GB) of data written before the cells start reaching their limit.

A server that -MUST- journal more than a PETABYTE of data before the end of its lifespan can
also afford to buy 16GB or 64GB of flash-storage rather than a paultry 1GB.

Barriers and journaling filesystems

Posted May 22, 2008 11:51 UTC (Thu) by SimonKagstrom (subscriber, #49801) [Link]

You are right, but it's also important to lookup the specs of the CompactFlash card before buying it. How they actually do wear-levelling differs, and I've heard of brands which perform wear-levelling only within regions of the disk - not over the entire disk.

In such cases, you can still wear out the flash in "reasonable" times. There are also some brands (I know of SiliconSystems) which allow reading CF-internal wear-levelling data (spare blocks, number of erases etc), although I'm not sure if there is any non-proprietary software to actually use this.

// Simon

Barriers and journaling filesystems

Posted May 22, 2008 12:44 UTC (Thu) by joern (subscriber, #22392) [Link]

> Typical flash-memory today is rated for 1M writes. There are internal wear-leveling that
ensures that this is that number of writes to the ENTIRE module. (i.e. it is impossible to
wear out flash faster by writing repeatedly to the same location)

This happens to be wrong on both accounts.  Typical flash-memory today is rated at either 10k
or 100k - the lower number being for MLC flashes, which are cheaper and therefore can be
expected in your average cheap medium from Fry's.

Far worse, the normal wear leveling scheme does _not_ cover the complete device.  It covers
some part of it, typically 16M or so.  The next part is also wear-leveled in itself, but not
wrt. any other part of the device.  Therefore having a really hot area like a 32M journal is
comparable to disabling the wear leveling for the device completely.  After 10k journal wraps
you're depending on pure luck.

The horse may be locked away in a broken shed, but it's still kicking.

[ To be fair, some expensive devices are far far better.  Some expensive devices are just that
- more expensive.  So do your own QA to be sure. ]

Barriers and journaling filesystems

Posted May 22, 2008 15:13 UTC (Thu) by drag (subscriber, #31333) [Link]

> This happens to be wrong on both accounts.  Typical flash-memory today is rated at either
10k or 100k - the lower number being for MLC flashes, which are cheaper and therefore can be
expected in your average cheap medium from Fry's.


Well you wouldn't buy the cheapest thing you can find when you go want to use it in a server,
right? So you make sure you get the 'high endurance' versions of the drives with SLC NAND
chips and make sure that you go through a cycle of swapping it out every year or so.

The thing is is that people are actually using flash to help speed up disk access in their
datacenters.  This isn't the first time I've heard of people doing this sort of thing.

Personally I work with a lot of flash media. The cheaper stuff. I haven't been here long, but
I've talked to people that have been working here for 20 years. (of course they haven't been
using CF cards that long). Nobody has yet to see any sort of flash media failing that they can
remember. The actual physical connections (the plastic holes for the pins get malformed) get
all screwed up before the any actual data ever gets corrupted. 


Were do you get your numbers for the 16M wear leveling? Typically your dealing with media that
is at least 512 megs and soon you'll have a hard time finding new stuff that is under 2 gigs.
I am doubtful that only 16megs out of 2gigs is going to be wear leveled.

Barriers and journaling filesystems

Posted May 22, 2008 15:24 UTC (Thu) by joern (subscriber, #22392) [Link]

> Personally I work with a lot of flash media. The cheaper stuff. I haven't been here long,
but I've talked to people that have been working here for 20 years. (of course they haven't
been using CF cards that long). Nobody has yet to see any sort of flash media failing that
they can remember. The actual physical connections (the plastic holes for the pins get
malformed) get all screwed up before the any actual data ever gets corrupted.

I know of reports.  Ext3 on CF seems to be fairly good at corrupting stuff, particularly data
that is stored close to the journal.  Whether the cards in question used SLC or MLC I don't
know.  The pesky thing about them is that vendors hardly ever publish information at all.

> Were do you get your numbers for the 16M wear leveling? Typically your dealing with media
that is at least 512 megs and soon you'll have a hard time finding new stuff that is under 2
gigs. I am doubtful that only 16megs out of 2gigs is going to be wear leveled.

http://www.linuxconf.eu/2007/papers/Engel.pdf
Mainly based on the smartmedia spec and some reverse engineering.  I didn't do the latter
myself, though.

Barriers and journaling filesystems

Posted May 23, 2008 2:21 UTC (Fri) by drag (subscriber, #31333) [Link]

> Whether the cards in question used SLC or MLC I don't know.  The pesky thing about them is
that vendors hardly ever publish information at all.


Ya. There is only a handful of people that actually make flash media. Maybe four companies in
total, I forget. 

Everybody that sells that flash media to end users uses a hodgepodge of different sources for
different products. Cheaper folks will mix in different flashes for different  production runs
on the same product... We ran into this problem with Kingston shipping devices that had any
sizes from 470megs on up for their 512 meg products.

So I'd stay far away from vendors that don't publish details about their products for anything
serious. 

Barriers and journaling filesystems

Posted Jun 12, 2008 14:42 UTC (Thu) by salimma (subscriber, #34460) [Link]

Not commenting on whether the 16MB information is correct or not, but grandparent's point is
not that only 16MB gets write-leveled; it's that for the purpose of write-leveling, the drive
is treated as a series of 16MB blocks, each of them are write-leveling within themselves.

(The write-leveling circuitry would then be much simpler -- imagine a parallel series of, say,
8-bit adders, compared to a 64-bit adders made up of 8-bit adders)

Barriers and journaling filesystems

Posted May 22, 2008 22:30 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

i.e. it is impossible to wear out flash faster by writing repeatedly to the same location)
Actually that's not true. Many (possibly most) large flash memory cards have the wear leveling done in sections, so it's possible to wear out one section before the others.

The vendors tend to be fairly secretive about the details of their wear-leveling algorithms.

Barriers and journaling filesystems

Posted May 23, 2008 2:29 UTC (Fri) by drag (subscriber, #31333) [Link]

What would be great would be to get flash manufacturers to, optionally, allow the OS to access
the flash media more directly as a MTD, which reflects the true nature of flash media.

This way Linux folks can take advantage of MTD-specific file systems that can handle wear
leveling in a very effective and open manner. (and probably get better performance, to boot)

(runnning MTD file systems on top of Block-to-MTD emulation in software on top of MTD-to-Block
emulation in hardware on top of MTD flash seems self-defeating..)

This way for the 'industrial' flash people using the flash for performance reasons on Unix
systems can get the most benefit while their Windows-using counterparts can continue to use
that stuff to speed up swap file access and application pre-caching in Vista using the block
emulation hardware.

wearing out Flash memory

Posted May 23, 2008 3:07 UTC (Fri) by sbishop (guest, #33061) [Link]

Part of the trouble is that people confuse Flash memory with devices implemented using Flash.
A location within a Flash memory chip, for example, will certainly wear out faster if it's
written to repeatedly.  The chips themselves do absolutely no wear leveling.  But, of course,
it would be insane to build a Flash-based device without built-in wear-leveling logic and CRC
checks, which may have been the reason for the "it is impossible to wear out flash faster by
writing repeatedly" comment.

By the way, I work for a memory manufacturer, and it's my job to do reliability testing on
this stuff.  My co-workers and I have all come to hate Flash.  It is expected that the chips
will wear out, and transient failures are okay.  The controllers are expected to deal with
these issues; it's the nature of the beast.  So what does "working" mean?!  Oh, and the state
machine of each one of these *#$%!@ things are different.  I miss DRAM...

Barriers and journaling filesystems

Posted May 29, 2008 18:22 UTC (Thu) by mcortese (guest, #52099) [Link]

Or you could just use a CF device for your journal.
But then how do you guarantee the synchronization between the data written to the HD and the journal witten to the flash? The whole issue here is to avoid any reordering that would spoil the journaling strategy. You are merely moving the problem from flushing within a device, to syncing two devices!

Barriers and journaling filesystems

Posted May 21, 2008 20:15 UTC (Wed) by evgeny (guest, #774) [Link]

> it would be I think most effective if drives had their own battery backed write-back cache

This would be as controversial as using battery modules for some RAID controllers. Many folks
don't like this idea - if we talk reliability, a UPS is a must. Then an extra battery could
only be a source of extra trouble.

Barriers and journaling filesystems

Posted May 21, 2008 22:28 UTC (Wed) by iabervon (subscriber, #722) [Link]

Having a UPS is great until the cat turns it off or the battery ages to the point where power
fluctuations cause it to reset or somebody turns off the computer's power switch or the power
supply burns out. Even if the outside power situation is well-protected, there's value to
having the drive store enough power to finish with its buffers and spin down carefully and
such.

Barriers and journaling filesystems

Posted May 22, 2008 18:42 UTC (Thu) by amikins (guest, #451) [Link]

> Having a UPS is great until the cat turns it off

I'm NOT the only one that's been plagued by this? I've since learned to make cat-proof covers
for my UPS buttons...

Barriers and journaling filesystems

Posted May 22, 2008 20:03 UTC (Thu) by v13 (subscriber, #42355) [Link]

Why would anyone not like the battery backed cache?

Have you ever considered the effects of not having to do a disk write 
when doing synchronous disk operations? Journals and databases perform a 
*lot* faster on BB controllers. (based on my experience).

Consider no having to do any safety-related writes to disk for safety. 
Even barriers have zero overhead.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds