|
|
Log in / Subscribe / Register

The bcachefs filesystem

Kent Overstreet, author of the bcache block caching layer, has announced that bcache has metamorphosed into a fully featured copy-on-write filesystem. "Well, years ago (going back to when I was still at Google), I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem - and there was a really clean and elegant design to be had there if we took it and ran with it. And a fast one - the main goal of bcachefs to match ext4 and xfs on performance and reliability, but with the features of btrfs/zfs."

to post comments

The bcachefs filesystem

Posted Aug 21, 2015 17:05 UTC (Fri) by butlerm (subscriber, #13312) [Link] (84 responses)

I believe it is pretty much a physical impossibility for a copy on write filesystem to match the performance of a non-copy-on-write filesystem for hosting large database files and nested block devices on physical, non-solid-state media. Large files subject to random writes end up severely fragmented on the underlying medium and cannot be scanned with any efficiency any more.

The bcachefs filesystem

Posted Aug 21, 2015 19:59 UTC (Fri) by ms (subscriber, #41272) [Link] (66 responses)

Probably doesn't matter. It'll be 4 years before this becomes ready to trust (guessing) and by that point spinning rust will be a tiny fraction of the market. Sure, they'll need special case FSs but I'm certainly of the opinion that if you're starting to build a general purpose FS today you don't worry about spinning rust.

The bcachefs filesystem

Posted Aug 21, 2015 21:13 UTC (Fri) by fishface60 (subscriber, #88700) [Link] (58 responses)

I wouldn't be so sure, I suspect some form of spinning rust to survive in the form of SMR devices.
But they might actually benefit from a Copy on Write file-system, since it's cheaper to write a new version of the file, than go back and modify the old one, since you'd need to rewrite all the overlapping data.

The bcachefs filesystem

Posted Aug 21, 2015 22:13 UTC (Fri) by pedrocr (guest, #57415) [Link] (57 responses)

>I suspect some form of spinning rust to survive in the form of SMR devices.

What makes you say that? The recent 16TB SSD drive announcement means that in terms of density SSDs have won. Will manufacturing silicon always be more expensive than manufacturing magnetic discs? Won't the economies of scale of building chips eventually make flash the overall winner?

The bcachefs filesystem

Posted Aug 22, 2015 4:45 UTC (Sat) by drag (guest, #31333) [Link] (17 responses)

It's difficult to argue with Moore's law on this one.

The bcachefs filesystem

Posted Aug 22, 2015 7:47 UTC (Sat) by pedrocr (guest, #57415) [Link] (16 responses)

Magnetic disks have actually historically grown faster than Moore's law. It's called Kryder's law apparently and has recently slowed down, so maybe Moore is catching up on the storage side:

https://en.wikipedia.org/wiki/Moore%27s_law#Other_formula...

The bcachefs filesystem

Posted Aug 22, 2015 16:04 UTC (Sat) by magila (guest, #49627) [Link] (6 responses)

Then again Moore's law isn't exactly going full steam ahead either. The latest 16/14nm lithography processes have been late to market relative to Moore's law and they are extremely expensive compared to the previous generation. The cost problem is only going to get worse as newer processes are forced to switch away from silicon to much more expensive substrates like gallium arsenide.

The bcachefs filesystem

Posted Aug 23, 2015 21:12 UTC (Sun) by jospoortvliet (guest, #33164) [Link]

There is of course progress with the new 3D nand and through-silicon-via's allow stacking chips. The higher density and lower costs of those technologies might get them even closer to the spinning rust.

The bcachefs filesystem

Posted Aug 24, 2015 20:36 UTC (Mon) by rahvin (guest, #16953) [Link] (4 responses)

Intel believes they can get it down to less than 7nm but I'm skeptical. The lithography tech at that point would need to be using x-ray lithography. I believe Moore's law has probably already failed because the lithography is getting so much harder and more expensive. The near UV at the bleeding edge is so expensive I'm surprised they can make up the margins on the equipment. I expect that the things will fail in stages, the 2 year cycle will begin to extend to 3 years, then 5, then 10 as the lithography gets more and more expensive. I can remember reading stories a few years ago about how hard UV lithography was going to be to get working.

The bcachefs filesystem

Posted Aug 24, 2015 21:16 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It's still within the extreme UV range. After 7nm it really gets complicated, but even 5nm is probably possible.

The bcachefs filesystem

Posted Aug 25, 2015 23:51 UTC (Tue) by oshepherd (guest, #90163) [Link]

Today we are drawing 14nm features with 193nm light, and contemplating continuing with it for 10nm.

13.5nm EUV will cope with 5nm.

The bcachefs filesystem

Posted Aug 26, 2015 16:47 UTC (Wed) by rahvin (guest, #16953) [Link]

I'm skeptical but I won't say it's not possible. I'm also skeptical that quantum effects won't make lithography this small impossible. Silicon atoms are 0.2nm wide. 5nm Lithography is cutting circuits just 25 atoms wide. That's frankly incredible but I have to wonder what bad quantum effects they are going to run into at that level. They already have trouble with quantum tunneling and it's only going to get worse as things get smaller.

The bcachefs filesystem

Posted Aug 25, 2015 11:31 UTC (Tue) by mmechri (subscriber, #95694) [Link]

This, as a matter of fact, has already started with Haswell/Broadwell/Skylake

The bcachefs filesystem

Posted Aug 27, 2015 8:58 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (8 responses)

Hard disks using current production technology have run up against their physical limits. They can't pack magnetic domains closer together along a track without them becoming unstable. They can't make a narrower write head because of energy density considerations, so they cannot back tracks closer (shingling is a partial, performance-damaging, solution). They can't pack more platters in the same volume because of mechanical considerations (vibrational modes and reduced stiffness, mostly).

However, HAMR (and maybe BPM) offer solutions. HAMR is working in the labs. It ought to enable at least one order of magnitude increase in magnetic storage density.

Magnetic storage has this intrinsic advantage. A disk platter is merely an extremely homogeneous smooth film of material (BPM changes this, which is why I'm doubtful compared to HAMR). Manufacturing is intrinsically far simpler than the complex material patterning of solid-state memory. Further, a magnetic disk is tolerant of defects on that film and can dynamically map out future defects should such develop.

I expect large magnetic disks to be with us for quite some years yet, though the disadvantage of relying on physical movement of heads means that solid-state has a huge advantage for many purposes. And there's the obvious question: if we could buy a 20Tb drive for the price of a 2Tb drive, would we buy them in the same volumes?

The bcachefs filesystem

Posted Aug 27, 2015 15:44 UTC (Thu) by martinfick (subscriber, #4455) [Link]

> Further, a magnetic disk is tolerant of defects on that film and can dynamically map out future defects should such develop.

This doesn't seem unique to magnetic disks.

The bcachefs filesystem

Posted Aug 28, 2015 17:31 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

there is nothing magic about the current physical sizes.

We've migrated to smaller platters because the density was climbing fast enough to have this work and smaller platters are easier to spin fast.

the smaller the platter, the worse the ratio of total drive footprint to usable sq in of recording space is.

The bcachefs filesystem

Posted Aug 30, 2015 19:28 UTC (Sun) by Wol (subscriber, #4433) [Link]

You mean the 8inch drive might make a comeback :-)

But yes, many computers still have 5 1/4 bays to accommodate DVD drives, so bringing back hard drives that size makes sense. My first drives that I bought were bigfoot drives - cheap, big and slow, but very good value £/MB.

Cheers,
Wol

The bcachefs filesystem

Posted Aug 30, 2015 19:24 UTC (Sun) by Wol (subscriber, #4433) [Link] (4 responses)

> And there's the obvious question: if we could buy a 20Tb drive for the price of a 2Tb drive, would we buy them in the same volumes?

In the past, we bought single drives because we couldn't afford to pay for unnecessary capacity.

My computer now has two mirrored drives, because I've git too much to trust to just one drive.

When I can afford it, I'm going RAID-6 for safety - pretty much a necessity in the near future as capacities outstrip reliability (reading a typical big drive today twice from stem to stern will - according to the specs - pretty much guarantee at least one read error). And yes I know drives typically perform "better than spec", but I wouldn't like to bank on it!

Cheers,
Wol

The bcachefs filesystem

Posted Sep 2, 2015 22:15 UTC (Wed) by malor (guest, #2973) [Link] (3 responses)

>My computer now has two mirrored drives

Remember, RAID is not a backup. It's to prevent downtime from drive failure. It does nothing about controller problems, and it especially does nothing for fat-finger mistakes.

I would strongly suggest breaking that mirror, and then maintaining two separate filesystems, backing up from one to the other. If you're down a day or two because a drive croaked, you're probably not going to be hurt too bad, but if you rm -rf /home/userdir on a mirrored filesystem, you are *so* screwed.

Plus, assuming the drive isn't full, separate filesystems will let you do versioning, so you can have backups over time.

The bcachefs filesystem

Posted Sep 2, 2015 23:22 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (2 responses)

> Remember, RAID is not a backup. It's to prevent downtime from drive failure. It does nothing about controller problems, and it especially does nothing for fat-finger mistakes.

> I would strongly suggest breaking that mirror, and then maintaining two separate filesystems, backing up from one to the other. If you're down a day or two because a drive croaked, you're probably not going to be hurt too bad, but if you rm -rf /home/userdir on a mirrored filesystem, you are *so* screwed.

Hm. I've lost about ~10 disks as part of a raid on my workstations over the years. Not once did I have to stop work for more than a single reboot (disks used to have too long timeouts to always sanely be detected as dead without reboots, and I'm still hesitant to hot-plug drives in consumer hardware). I've never lost data via an rm -rf or similar incident. Once a couple files through a kernel bug.

Lost a bit more data in my laptops where I don't do raids.

I *do* additionally have backups, but I pretty much never access it outside of testing whether it still works. A couple times I looked in there for files I thought I might have deleted to find out that I just had them stored at a different place than I remembered.

Actually just ordered replacement SSDs for my workstation a couple minutes ago.

The bcachefs filesystem

Posted Sep 3, 2015 1:38 UTC (Thu) by malor (guest, #2973) [Link] (1 responses)

> Hm. I've lost about ~10 disks as part of a raid on my workstations over the years. Not once did I have to stop work for more than a single reboot

Yep, it did what it was supposed to do. That's what RAID is for.

>I've never lost data via an rm -rf or similar incident. Once a couple files through a kernel bug.

You got lucky. Just a few weeks ago, I lost a 4TB data drive when I re-partitioned it instead of a flash key. I didn't have it backed up at all, because it didn't have anything I really cared about. Except... I forgot it had a batch file to build chromium, with quite a bit of embedded knowledge (that stuff's kind of a pain on Windows), and it took me several days of off-and-on fiddling to recreate it. Hardly a disaster, but if I'd had it backed up like my SSD, it would have taken an hour or so to recover everything, with just about zero attention.

Anyway, just a reminder. If all you have is a RAID, you're one thoughtless moment away from data loss. Separate filesystems mean that recovery from hardware failure is more painful, but you avoid many of the truly dire outcomes.

I give this advice a lot: do a backup first, THEN do RAID if the budget permits. It's real nice, but a true backup is the #1 priority.

The bcachefs filesystem

Posted Sep 4, 2015 23:49 UTC (Fri) by Wol (subscriber, #4433) [Link]

Well, most of my "critical" stuff is owned by root, and hardlinked into my personal area, so it's protected against accidents in the normal course of events.

Then, I tend to back things up to DVD.

And stuff often gets copied "accidentally".

But like many home users, I don't have "proper" backups - the problem is disks nowadays are huge, and to some extent it's "where do I back up to?". Bearing in mind, however, my current drives are Seagate Barracudas, when I switch to raid 5/6, I'll also switch to WD Reds, so I'll have a couple of large hard disks to stick in some sort of enclosure for backups. I'll probably try and use btrfs or somesuch :-)

Cheers,
Wol

The bcachefs filesystem

Posted Aug 22, 2015 8:19 UTC (Sat) by Otus (subscriber, #67685) [Link] (38 responses)

Eventually, yes.

Currently, however, the cheapest SSDs are still about €0.50 per GB, while the cheapest HDDs are €0.05 per GB. That's over three Moore's law doublings to catch up... if HDDs stay exactly where they are. More likely four or five, which would be closer to ten years than four.

The bcachefs filesystem

Posted Aug 22, 2015 8:37 UTC (Sat) by ms (subscriber, #41272) [Link] (11 responses)

But it's not just about price, it's about performance too. Sure, for big data centres and semi-cold storage (or whatever it's called) there will no doubt be some need for very high capacities. But for all normal desktop/laptops, I'd have thought that within the next few years almost nothing will come with spinning rust preinstalled - it'll be SSD all the way (or memristor/xpoint/whatever). And of course smartphones too. In fact surely even today, by shipment, the volume of flash drives must vastly exceed the volume of spinning rust drives.

The bcachefs filesystem

Posted Aug 22, 2015 9:27 UTC (Sat) by warrax (subscriber, #103205) [Link] (10 responses)

> it's not just about price, it's about performance too.

True, but beyond the first 100 GB (the OS), performance mostly doesn't matter for most regular users. Assuming that "regular users" will actually be using desktop machines and laptops in Imagined Future(TM).

Video editing is about the only remotely-mainstream activity that would benefit from very fast storage, but even that's probably not really mainstream.

The bcachefs filesystem

Posted Aug 24, 2015 10:34 UTC (Mon) by moltonel (subscriber, #45207) [Link] (5 responses)

> True, but beyond the first 100 GB (the OS), performance mostly doesn't matter for most regular users. Assuming that "regular users" will actually be using desktop machines and laptops in Imagined Future(TM).
>Video editing is about the only remotely-mainstream activity that would benefit from very fast storage, but even that's probably not really mainstream.

Games is another usecase that requires lots of disk space and significantly benefits from SSD. It's much more mainstream than (high-end) video editing, and the gaming industry has a long history of pushing performance ahead.

Last but not least, power usage and resistance to shock is a performance metric that affects a huge portion of the market and will push SSDs forward.

The bcachefs filesystem

Posted Aug 24, 2015 19:31 UTC (Mon) by raven667 (subscriber, #5198) [Link] (3 responses)

[tangent]
Although it's funny when adding an SSD to a game machine, how much of the startup time is not loading assets from disk, but it's compiling shaders from the standard OpenGL/DirectX Shader Language to the actual native architecture of the GPU. Some games it's 30s of LLVM run time between levels which are all in RAM.
[/tangent]

The bcachefs filesystem

Posted Aug 25, 2015 1:53 UTC (Tue) by flussence (guest, #85566) [Link] (2 responses)

I wonder why they don't cache compiled shaders to disk. It's certainly possible - E17 does it.

The bcachefs filesystem

Posted Aug 25, 2015 6:52 UTC (Tue) by drago01 (subscriber, #50715) [Link]

Some graphics drivers (like the NVIDIA proprietary one) do that. There has been patches / discussions about doing something similar for mesa but that has not been finished yet.

The bcachefs filesystem

Posted Aug 25, 2015 7:58 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

There's an OpenGL extension for that, but it's not implemented by Mesa.

The bcachefs filesystem

Posted Aug 25, 2015 7:15 UTC (Tue) by jezuch (subscriber, #52988) [Link]

> Games is another usecase that requires lots of disk space and significantly benefits from SSD.

Another is cold-starting LibreOffice.

(Sorry, couldn't resist :) )

The bcachefs filesystem

Posted Aug 26, 2015 4:47 UTC (Wed) by bandrami (guest, #94229) [Link] (3 responses)

> True, but beyond the first 100 GB (the OS)

I just want to let that sink in for a second.

The bcachefs filesystem

Posted Aug 26, 2015 12:35 UTC (Wed) by madscientist (subscriber, #16861) [Link] (2 responses)

>> True, but beyond the first 100 GB (the OS)

>I just want to let that sink in for a second.

My fully-installed Ubuntu GNOME Linux system at home has just 8 GB used for the root partition (everything that's not /home). My development system at work has 13 GB used for the root partition. In contrast my Windows 8 VM with similar tools including MSVC 2012 installed runs over 80 GB used.

The bcachefs filesystem

Posted Aug 27, 2015 22:23 UTC (Thu) by dashesy (guest, #74652) [Link] (1 responses)

I recently installed an image of Windows 7 to have a 64bit Windows build (I have XP but it is 32bit), not expecting this ridiculous space requirement I selected the default 25GB suggested by VirtualBox. After install it went close to 14GB I think but as soon as I updated the OS it was 21GB, with absolutely nothing else but the OS, and with system restore turned off! I could not even install Visual Studio, because it required additional 10GB. I ended up re-sizing the partition which corrupted the file system, so I started from original snapshot, re-sizing it, doing a check-disk immediately then updating the OS.

The bcachefs filesystem

Posted Sep 2, 2015 22:21 UTC (Wed) by malor (guest, #2973) [Link]

By default, Win7 makes a hibernate file that's as big as your RAM, and it makes a swap file that's something like twice that size. (I'd have to look it up to be sure.) You can disable the former and resize the latter to something much more reasonable.

I never go past two gigs -- if you're trying to swap two gigs of data, the machine won't be usable anyway. Windows will complain about not being able to do crash dumps if you set the swap to lower than the amount of RAM, but everything else seems to work fine.

On my 32 gig system, a Win7 install uses about 75 gigs of my SSD before I've done anything... but after removing hibernate and moving swap to spinning rust, it's more like 8 or 10.

The bcachefs filesystem

Posted Aug 22, 2015 10:18 UTC (Sat) by pedrocr (guest, #57415) [Link] (25 responses)

It's an innovator's dilemma situation. SSDs don't need to reach cost parity with HDDs for most usages since disks are growing faster than people's needs for them. For most regular users a 250GB SSD is now enough for everything so new laptops are pretty much all moving to SSD. For me, to have my full collection of RAW files from my camera I need 1TB, that's currently 5x more expensive in an SSD for my laptop so I'm doing the SSD+HDD setup but I don't expect that to be the case 1-2 years from now. For my NAS I need about 10TB of raw space to then run as RAID6 and get around 5TB, so for that 4-5 years sounds about right.

The bcachefs filesystem

Posted Aug 22, 2015 10:32 UTC (Sat) by dlang (guest, #313) [Link] (15 responses)

> SSDs don't need to reach cost parity with HDDs for most usages since disks are growing faster than people's needs for them. For most regular users a 250GB SSD is now enough for everything...

this is why there are so many USB SSD drives sold

ohh right, nobody is bothering to try and sell SSDs in USB drives because the cost would be so high and the performance would not improve that much due to the limitations of USB.

There is always going to be a market for archival storage where capacity matters far more than performance. and this market doesn't exclude home users.

I just saw a post where there was a reminder of the problem of ransomware and the expectation that it will expand to cover cloud storage accounts. This is exactly the type of thing that cheap, offline storage like USB drives is perfect for.

The bcachefs filesystem

Posted Aug 22, 2015 10:36 UTC (Sat) by pedrocr (guest, #57415) [Link]

>ohh right, nobody is bothering to try and sell SSDs in USB drives because the cost would be so high and the performance would not improve that much due to the limitations of USB.

This too will be fixed by better interconnects and eventually people will prefer the speed over the capacity.

>There is always going to be a market for archival storage where capacity matters far more than performance. and this market doesn't exclude home users.

There will always be a market for photographic film, it will just be very small.

>I just saw a post where there was a reminder of the problem of ransomware and the expectation that it will expand to cover cloud storage accounts. This is exactly the type of thing that cheap, offline storage like USB drives is perfect for.

Sure, and in the past it was what LTO tapes were perfect for. Large capacity HDDs will become the new tape.

The bcachefs filesystem

Posted Aug 22, 2015 19:07 UTC (Sat) by khim (subscriber, #9252) [Link] (11 responses)

ohh right, nobody is bothering to try and sell SSDs in USB drives because the cost would be so high and the performance would not improve that much due to the limitations of USB.

I'm just not sure what you are talking about. Just why would anyone bother with two translation layers when one is sufficient? USB sticks are certainly more popular than USB-HDD drives and many of USB3 ones could easily outperform SATA-SSD in an USB enclosure. These are your "SSDs in USB drives". They are not that popular because they are expensive but are created and sold quite frequently.

The bcachefs filesystem

Posted Aug 23, 2015 1:13 UTC (Sun) by dlang (guest, #313) [Link] (10 responses)

so do you think USB sticks are going to reach the multi TB range in the immediate future?

Also, the ongoing reliability of USB sticks is FAR lower than drives. they do not do anywhere close to the same wear leveling. I've had a lot more UDB sticks become unusable than external drives.

The bcachefs filesystem

Posted Aug 23, 2015 1:49 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

2Tb thumb drives are already out there, but they are darn expensive right now. For stuff like backup (where you don't really rewrite data) they are OK.

The bcachefs filesystem

Posted Aug 24, 2015 17:51 UTC (Mon) by the.wrong.christian (guest, #73127) [Link] (8 responses)

>> 2Tb thumb drives are already out there, but they are darn expensive right now. For stuff like backup (where you don't really rewrite data) they are OK.

Holy Jesus, please tell me you're going only incremental backups to thumb drives? FLASH drives are not designed for long term archive like storage. While I would trust 1 years worth of storage to a FLASH drive (as much as any other type of drive), I certainly wouldn't trust 5 years of storage to one.

AFAIK, hard disk drives (and magnetic storage in general) are just better long term storage options for backups and archives.

The bcachefs filesystem

Posted Aug 26, 2015 18:52 UTC (Wed) by drag (guest, #31333) [Link] (7 responses)

> AFAIK, hard disk drives (and magnetic storage in general) are just better long term storage options for backups and archives.

Why?

For most purposes flash drives are awesome because they are almost indestructible compared to rotating disks.

The bcachefs filesystem

Posted Aug 26, 2015 23:39 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

The charge in flash cells slowly leaks. So some drives might become unreadable in just a year. I'm using flash for backups, not archives because of this.

Digital Archival

Posted Aug 27, 2015 9:16 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (4 responses)

Digital archival is problematic in many ways. There's no medium available which one can pop on a shelf and rely on being readable decades hence. Flash: charge leakage. Disk: mechanical degradation (bearings mostly). Anything electronic: it just takes diffusion to ruin one VLSI transistor for a controller to cease functioning. Optical: probably the best, if in complete darkness, but are the dye layers long-term stable against air and atmospheric moisture? Even paper: make sure you use archival grade, because other modern paper contains acid that will cause it to crumble within a few decades. (Don't store your precious optical disks in a folder with paper records! )

The best archive at present is active, not passive. Multiple redundant data storage on disk, with continuous monitoring for disk failures and deterioration, and timely replacement and data regeneration. It costs ....

Digital Archival

Posted Aug 27, 2015 9:20 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Tape is pretty good. Cassettes have no complicated moving parts and can last for many decades.

Digital Archival

Posted Aug 27, 2015 23:32 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

Try telling NASA that ...

I gather they are desperately trying (or were) to recover all their gemini and apollo mission data. They had two problems - (1) finding a working tape drive, and (2) reading the tape without causing it to shed oxide all over the tape drive.

Cheers,
Wol

Digital Archival

Posted Aug 28, 2015 15:49 UTC (Fri) by opalmirror (guest, #23465) [Link] (1 responses)

Sad to hear that NASA is having trouble restoring its mag tape data. Tape lasts a while but has to be maintained. When I was a summer co-op student back in the '80s, I worked in an electric utility's data services group. In the tape library, we would service all the mag tapes in the library using a tape reader/writer machine that would evaluate the magnetic and physical health of the archive tape as it read it, and if necessary, duplicate it onto new archive tape. A database program kept track of the tape maintenance schedule and replacement. It was boring work, but the data lived on. The utility also had a room full of large filing cabinets for old programs archived on card decks. Many of these programs had moved on to mag tape and then fixed disc packs. Slash slash sys in dee dee star, how I wonder what you are...

Digital Archival

Posted Aug 31, 2015 15:42 UTC (Mon) by mstone_ (subscriber, #66309) [Link]

sounds like active archive ;-)

The bcachefs filesystem

Posted Aug 28, 2015 10:20 UTC (Fri) by jezuch (subscriber, #52988) [Link]

The late-ish Samsung SSDs hit this problem at one point - the flash cells holding data that was not being accessed for some time degraded and required much longer time to read, causing performance problems severe enough that it made news. The solution (I think) was to periodically refresh all the cells. Dunno how they handle drives that are powered off for months :) So, yeah, this illustrates how much flash is not suited for log-term storage.

The bcachefs filesystem

Posted Aug 22, 2015 22:30 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> this is why there are so many USB SSD drives sold
Actually, they are getting popular. For example, I have one of these for backups: http://www.newegg.com/Product/Product.aspx?Item=N82E16820...

USB3 is pretty fast to make it worthwhile.

The bcachefs filesystem

Posted Aug 24, 2015 5:46 UTC (Mon) by liam (guest, #84133) [Link]

USB 3 has uas mode.
No, is not it implemented in many places, but the standard exists, and 3.1 provides enough bandwidth even if you ignore alternate modes.

The bcachefs filesystem

Posted Aug 24, 2015 7:03 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (8 responses)

250GB is not enough the moment one starts to record video. I am not even talking about, say, semi-professional blogger with at least 1TB per year of video. Just consider a family member who likes to record HD video on a smartphone. Even that gives at least 50 GB per year. And soon phones starts to record in 4K.

The bcachefs filesystem

Posted Aug 24, 2015 16:57 UTC (Mon) by zlynx (guest, #2285) [Link] (1 responses)

Use external drives for video. Even the pros do it that way. The last video guy I worked with several years ago used about six external Firewire drives on a Mac and used different drives for source and output and backups. Very nice streaming behavior that way. I don't think anyone would want to do video processing on a spinning laptop drive no matter how big it is because the IO speed and IOPS are abysmal.

I've been using laptops with SSDs for many years now and I haven't needed more than 128 GB. This one has 250 but I kept a Windows partition so Linux only has about 160 GB available. If I need more space I have 128 GB memory sticks and 1.5 TB external backup drives.

The bcachefs filesystem

Posted Aug 24, 2015 18:23 UTC (Mon) by ibukanov (subscriber, #3942) [Link]

I commented about the general need for storage even on consumer level when ssd is just too expensive. Plus consider reliability. From physical point ov view storing information as magnetisation is just more reliable than those triple-level cells holding charge. So as long as there is need for backup, the demand for magnetic media should stay.

As for ultimate performance during video editing my personal experience is that maxing ram and enabling zram or similar has more effects than using faster ssd.

The bcachefs filesystem

Posted Aug 24, 2015 19:24 UTC (Mon) by raven667 (subscriber, #5198) [Link] (4 responses)

It's getting pretty easy to rent this kind of space as well, upload to YouTube or Google Photos/Drive and stream back to your laptop for viewing.

The bcachefs filesystem

Posted Sep 2, 2015 22:46 UTC (Wed) by jschrod (subscriber, #1646) [Link] (3 responses)

Just curious: Where do you live?

Here, in Germany, uploading a TB of video to some cloud service is a major undertaking for most parts of our country. There are ridiculously few areas with FTTH/FTTB, and even for VDSL connections, such uploads are not easily been done.

E.g., best I got is 3 Mb/s download and a few 100s Kb/s upload. Uploading lots of media data to the cloud for backup purposes is a pipe dream. And that's 25km away from Frankfurt, in an area that's supposed to be a economic active area of Western Europe.

The bcachefs filesystem

Posted Sep 3, 2015 3:49 UTC (Thu) by raven667 (subscriber, #5198) [Link] (2 responses)

I live in the US and my Cable Internet connection gets 4Mbit upload and 20Mbit download and is not even the fastest on offer. I don't have TB of video to upload every day but for the tens or hundreds of MB of photos and videos I can create on my phone in a day I can sync in the background whenever I am connected to unmetered WiFi so there is really very little effort needed to have all of my data accessible in the cloud. With 0.1 Mbit upload I can see this being less usable, are you out in the countryside? My family live on a farm that has speeds much like you describe, after fiber was run to a pedestal a kilometer from their house, before that they had dialup or satellite Internet.

The bcachefs filesystem

Posted Sep 8, 2015 1:07 UTC (Tue) by jschrod (subscriber, #1646) [Link] (1 responses)

> With 0.1 Mbit upload I can see this being less usable,

Well, it's not 0.1 Mbit, but 0.3 Mbit, but that doesn't make any difference for the sake of this discussion...

> are you out in the countryside?

Hard to say. What is »out in the countryside«? I live 25km (roughly 15 miles) away from Frankfurt, the central finance city of continental Europe. The city I live in has ca. 25.000 inhabitants. My area is populated by people with above-average income, still working, though slightly older.

But, I live in the city's border area, two houses between us and the forest. (Forest is almost everywhere in the Rhein-Main region, one of the advantages of living here.) Two streets away from me, 16 Mbit download and 0.7 MBit upload is available. They didn't pull the cables down to us. :-( Still, no cable connection; that's only available in the inner city, not in our outer skirts. For the US readers: Outer skirts in cities of this size here in Europe means hundreds of houses 1 mile away from city center, not some far-outlying farm where you have to drive hours to get to.

Nevertheless, what I wanted to say: Even if you live in the 1st world, 15 miles away from the important banks, don't take Internet highspeed connectivity for granted.

The bcachefs filesystem

Posted Sep 8, 2015 6:29 UTC (Tue) by anselm (subscriber, #2796) [Link]

You can live right in the middle of town in an affluent residential district, where everyone around you has 50-MBit/s VDSL, and still be stuck on 6 MBit/s DSL because all the ports on the VDSL box in the local exchange are in use and the telco won't put another expensive VDSL box in just for you.

The bcachefs filesystem

Posted Aug 24, 2015 23:12 UTC (Mon) by luya (subscriber, #50741) [Link]

Samsung Galaxy S5 already recorded in 4k @ 30fps.

The bcachefs filesystem

Posted Aug 22, 2015 7:37 UTC (Sat) by liam (guest, #84133) [Link] (5 responses)

My concern would be the, apparent, imminent arrival of pcm/reram/sttram/memristor/etc.

The bcachefs filesystem

Posted Aug 22, 2015 9:31 UTC (Sat) by warrax (subscriber, #103205) [Link] (4 responses)

Indeed. Samsung can announce whatever they want, but I want to see an actual device I can install before I'll believe any of it. (We've been promised all kinds of crazy things, but very few "magical" new technologies have materialized. Anyone remember the "store your data in a 1cm^3 cube" thing?)

The bcachefs filesystem

Posted Aug 22, 2015 9:39 UTC (Sat) by ms (subscriber, #41272) [Link] (3 responses)

Given the amount of noise Intel are making about 3D XPoint, I'm pretty convinced this at least is going to be Arriving Soon.
https://en.wikipedia.org/wiki/3D_XPoint
http://anandtech.com/show/9541/intel-announces-optane-sto...

The bcachefs filesystem

Posted Aug 24, 2015 5:43 UTC (Mon) by liam (guest, #84133) [Link] (2 responses)

This is what I had in mind. Additionally, iirc, anandtech reported that micron has ANOTHER storage-memory replacement technology that's supposed to be announced at the end of the year. That seems odd to me, but there's no lack of effort in this area, regardless.

The bcachefs filesystem

Posted Aug 24, 2015 20:45 UTC (Mon) by rahvin (guest, #16953) [Link] (1 responses)

Micron and Intel are partners. The 3D xpoint stuff is from their joint development company IM-Flash in Utah. So the rumored Micron announcement happened when Intel announced 3d Xpoint. Intel also said the memory is in sampling right now so it's already in production.

The bcachefs filesystem

Posted Aug 25, 2015 7:08 UTC (Tue) by liam (guest, #84133) [Link]

Yes, I know that they are partners in the x point venture.

www.anandtech.com/show/9470/intel-and-micron-announce-3d-xpoint-nonvolatile-memory-technology-1000x-higher-performance-endurance-than-nand/2

The bcachefs filesystem

Posted Aug 24, 2015 16:57 UTC (Mon) by dfsmith (guest, #20302) [Link]

FYI, I don't think any company has sold spinning rust since 14" drives were popular. There's no iron oxide in any current storage devices, and hasn't been for decades. (Except, maybe, FeRAM, which still has low penetration in the market, and I'm really not sure of the chemistry.)

Use the phrase "spinning thin film ferromagnetic cobalt/iron/nickel microcrystalline substrate" or "spinning aluminum" (current, 1950s-1990s) or "spinning glass" (1990s-2000s) if you want to denigrate hard drives, while still appearing with-the-times. B-)

The bcachefs filesystem

Posted Aug 21, 2015 22:59 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

Yes, this is true - you see this in the O_SYNC 1 thread dbench results, where bcachefs is ~half the speed of xfs and ext4. That's the journal flushing.

But for damn sure we can do better than btrfs, and on more typical workloads - especially on flash - that effect ought to be within the margin of error.

Also, with a large enough filesystem when there's lots of random metadata writing, we ought to solidly beat ext4 and xfs at writing out that metadata - I wasn't trying to show that here.

The bcachefs filesystem

Posted Aug 21, 2015 23:00 UTC (Fri) by Tobu (subscriber, #24111) [Link]

This is for tiered storage with flash and non-flash, I think accesses are in-place and fairly linear by the time they hit a spinning disk.

The bcachefs filesystem

Posted Aug 23, 2015 3:09 UTC (Sun) by dgm (subscriber, #49227) [Link]

In a naive implementation, you're probably right. But it needs not be like this. The filesystem can reserve contiguous disk space on copy, instead of allocating blocks as they are modified. A very good filesystem could even add the feature of moving data as to make the most used version contiguous, at the expense of less used ones.

The bcachefs filesystem

Posted Aug 23, 2015 14:17 UTC (Sun) by Wol (subscriber, #4433) [Link] (13 responses)

> I believe it is pretty much a physical impossibility for a copy on write filesystem to match the performance of a non-copy-on-write filesystem for hosting large database files and nested block devices on physical, non-solid-state media. Large files subject to random writes end up severely fragmented on the underlying medium and cannot be scanned with any efficiency any more.

And with a relational database, what are you going to do about the data getting fragmented across all those tables and rows?

I know I go on about Pick/MV, but where it really scores is that, if you do the design properly (and the database structure encourages this), then all data that is statistically closely related ends up in one database RECORD, and that record (provided it's not too big) ends up in one disk block. And because it's a hashed file, the CPU can calculate which disk block it's in, without having to scan the disk to look for it. (So, in a single disk access, I can retrieve what would take an RDBMS multiple accesses across indexes, tables and rows.)

Which is why, for the exact same dataset, the disk and memory footprints of an MV database are much less than relational, and the CPU load is a lot less too.

Oh - and what about SMR drives, where the drive itself enforces "copy on write" :-) You're falling into the trap of optimising your layer of the stack in isolation, without realising that your optimisations are actually making life *worse* in other layers. I'm minded of a story about a factory - a new high-up manager came in, looked at the *entire* process, and realised all the individual line managers were focussed on reducing their own costs. He then made several changes that *increased* some line-managers' costs, but the *overall* cost went *down*.

Cheers,
Wol

The bcachefs filesystem

Posted Aug 23, 2015 15:12 UTC (Sun) by dlang (guest, #313) [Link]

not all filesystems are equally suitable for all uses

not all storage devices are suitable for all uses

> I'm minded of a story about a factory - a new high-up manager came in, looked at the *entire* process, and realised all the individual line managers were focussed on reducing their own costs. He then made several changes that *increased* some line-managers' costs, but the *overall* cost went *down*.

That is important to keep in mind, It's a perfect example of "(premature) optimization is the root of all evil"

But it's also important to realize that there isn't one overall process and different use cases make different trade-offs worth while.

SMR drives are a perfect example of this. They effectively enforce COW and what is effectively a gigantic block size. But they provide significantly higher storage density in exchange. For a database this is a disaster. For archives, this is fantastic.

some people want lots of storage for things like databases that change all the time

other people want lots of storage for thigns that seldom change (think video/audio /picture/document storage)

This is less like line managers in a company and more like separate companies/subcontractors offering different services and allowing you to pick which one works best for your process.

The bcachefs filesystem

Posted Aug 24, 2015 4:46 UTC (Mon) by butlerm (subscriber, #13312) [Link] (11 responses)

> what about SMR drives, where the drive itself enforces "copy on write"

Anyone planning on hosting a very large database on SMR drives should be prepared for write transaction throughput to drop by a factor of one hundred or so. They are not remotely suitable for the purpose.

Hosting a very large database on a copy on write filesystem is a milder but still significant problem. It is not just that it is no longer possible to scan a data file in an efficient, physically linear order, it means that every file offset turns into an index lookup.

There are no extents that go on for hundreds of megabytes on the disk anymore, instead you have tiny extents where every access means the filesystem has to look up the location of the extent in the extent tree. For a large database where one or more layers of the extent tree do not fit in cache that gets awfully expensive.

The bcachefs filesystem

Posted Aug 24, 2015 8:33 UTC (Mon) by ms (subscriber, #41272) [Link] (10 responses)

A couple of questions:

> Hosting a very large database on a copy on write filesystem is a milder but still significant problem. It is not just that it is no longer possible to scan a data file in an efficient, physically linear order, it means that every file offset turns into an index lookup.

With increasing RAM sizes, is this really a problem these days? - I genuinely don't know. My guess is that for the vast majority of cases, the entire database is unlikely to be bigger than the GBs of system RAM, so caching of the whole dataset should be possible. 5 years ago, I learnt to aggressively normalize, so to shrink the data size enough that the whole thing fitted in RAM - joins on tables which are in RAM are much cheaper than reads off disk, or at least they were back then. Are there really that many users out there with TB sized databases? Is this just volume of data, or is it abuse of the database in storing big assets/blobs, or is the attitude these days that real databases should have no problem storing arbitrary blobs?

> Anyone planning on hosting a very large database on SMR drives should be prepared for write transaction throughput to drop by a factor of one hundred or so. They are not remotely suitable for the purpose.

But IME you're always dominated by fsync time anyway - even good SATA SSDs (I've not tried PCIe ones) top out at about 600 fsyncs / sec IIRC. If you're not doing an fsync after every txn then you don't have durability, and if you are doing an fsync after every txn then the volume of data you're writing is unlikely to be vast. If you really need the best of both worlds then aren't you looking at battery-backed / super-cap drives / arrays?

The bcachefs filesystem

Posted Aug 24, 2015 9:26 UTC (Mon) by ms (subscriber, #41272) [Link]

Gah, sorry, I can read - eventually. You clearly said:

> Hosting a very large database

which negates much of what I wrote. Sorry.

The bcachefs filesystem

Posted Aug 24, 2015 12:39 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

> With increasing RAM sizes, is this really a problem these days? - I genuinely don't know.

Yes. Data is growing *fast*. And systems with terabytes of RAM are still really, really expensive.

> But IME you're always dominated by fsync time anyway - even good SATA SSDs (I've not tried PCIe ones) top out at about 600 fsyncs / sec IIRC. If you're not doing an fsync after every txn then you don't have durability, and if you are doing an fsync after every txn then the volume of data you're writing is unlikely to be vast. If you really need the best of both worlds then aren't you looking at battery-backed / super-cap drives / arrays?

You can easily get more than 600 xacts/sec, even if your drive is limited to 600 fsyncs, in a concurrent environment. The trick is that journaling writes usually are sequentially, so several writes and queue flushes (fsyncs) can be combined if there's concurrency. But that still requires a bunch of fsyncs a second to the same file - nothing you can do with an SMR drive.

Additionally you can trivially load humongous amounts of data with 600tps if you do more than a single write per xact.

> If you're not doing an fsync after every txn then you don't have durability

In many cases that's perfectly fine. A data loss window of a 1/3s (which is what you approximately get in postgres if you disable synchronous commit for an xact, allowing 10s of thousands write tps) is often stricter than needed.

The bcachefs filesystem

Posted Aug 24, 2015 15:21 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

> joins on tables which are in RAM are much cheaper than reads off disk, or at least they were back then

Or pick a database where, to quote wikipedia, "these types of databases *don't* *need* expensive joins".

Incidentally, where you "aggressively normalise to shrink the dataset", I'd shove it (as NF2, or "non-first-normal-form") into a MultiValue database, and shrink the dataset even more :-) See my comment earlier, about how for the same data, MV has a much smaller disk/memory footprint (although that would probably be somewhat negated by indices, which can be tricky to shrink).

Cheers,
Wol

The bcachefs filesystem

Posted Aug 24, 2015 18:36 UTC (Mon) by rschroev (subscriber, #4164) [Link] (2 responses)

Or pick a database where ...
You just couldn't resist, could you ;)

The bcachefs filesystem

Posted Aug 24, 2015 18:42 UTC (Mon) by ms (subscriber, #41272) [Link] (1 responses)

> You just couldn't resist, could you ;)

Meh, if you don't learn the lessons of the past, you're doomed to repeat them. So said someone. I'd never heard of Pick (and from there I've found my way to MUMPS and Intersystems Cache, and from there to a lot of opinion on dailywtf about the virtues or otherwise of Cache), so I've certainly been edumacted. :)

The bcachefs filesystem

Posted Aug 24, 2015 19:20 UTC (Mon) by Wol (subscriber, #4433) [Link]

:-)

Cache and MUMPS are nothing to do with Pick, apart from dating from the same era, and having a similar philosophy.

Also, one of the authors of jBase (a Pick variant) wrote a MultiValue layer that sits on top of Cache and uses it as its backend. (There was a war-story recently about a shoot-out between Oracle and Cache - the target was 100K inserts/sec. Oracle had to cheat, Cache coasted it, and then promptly hit 250K in production.)

If you want some fun reading about Pick, go to tincat and read the blog

http://www.tincat-group.com/

The author is a manager who was converted to MV because she was managing two groups of developers, one who were using relational technologies and were always over budget and late, and the other group were using MultiValue and were always on budget and on time.

Also go to http://www.pickwiki.com/index.php/Main_Page which is a community wiki, with a fair chunk of interesting resources.

If you want to join us we're a welcoming bunch, but be warned, it is a massive jolt if you're coming from a relational background - think of a Pascal programmer suddenly having to code in C :-)

Cheers,
Wol

The bcachefs filesystem

Posted Aug 24, 2015 19:09 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

> My guess is that for the vast majority of cases, the entire database is unlikely to be bigger than the GBs of system RAM, so caching of the whole dataset should be possible.

In a recent LWN article, there was a discussion of PostGRES, which as we know is a very fast, efficient, relational database. If you want to cache a 5GB database in RAM, you'll probably need about 15GB of ram to do so!

The discussion explained why - it's something to do with needing to keep a copy of the current working set, uncommitted transactions, it gets cached at the linux level, it gets cached at the disk access layer, etc etc.

http://lwn.net/Articles/590214/#Comments

(And again, why I go on about MV - it was designed right from the start to treat disk as virtual memory - the original Pick machines didn't have the concept of disk, it was just treated logically as "permanent memory". When discussing speed, I *always* start with the assumption that the machine is "cold" and no data is cached in ram, an RDBMS simply doesn't have a hope in hell of competing in those circumstances, it's a "drag racer versus penny-farthing" sort of race.)

Cheers,
Wol

The bcachefs filesystem

Posted Aug 24, 2015 19:34 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (2 responses)

> In a recent LWN article, there was a discussion of PostGRES, which as we know is a very fast, efficient, relational database. If you want to cache a 5GB database in RAM, you'll probably need about 15GB of ram to do so!

No, you don't. That's utter FUD.

The bcachefs filesystem

Posted Aug 25, 2015 8:53 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

So let's say I want to run a query, that is going to update about 1GB of database. How much ram do I need? Going by the article I linked to, it's probably going to want to (a) cache the data that is read from disk, (b) store all that data in logs, (c) buffer all the updates in user memory, and (d) flood the write cache.

Actually, I make that even worse than my initial estimate! A single query might result in *four* copies of the data lying around in ram, and if you read the article, then they make a point of saying they can't try and save memory by using the same physical ram for different buffers, because the OS won't let them guarantee write order, which they need to guarantee data integrity.

So, while I might have been *slightly* hyperbolic, if your database is being actively accessed, then you really do need ram equal to about three times your active dataset in order to successfully cache it.

Cheers,
Wol

The bcachefs filesystem

Posted Aug 25, 2015 9:13 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

> So let's say I want to run a query, that is going to update about 1GB of database. How much ram do I need?

A couple kilobytes (three pages a 8KB, a WAL buffer page also 8kb, some process local memory maybe 100kb). More *can* be used, but by no means you need it.

You can update a TB of data with only a couple megabytes of RAM. Obviously it's going to be relatively slow unless you have fast storage, but that's pretty fundamental.

> Going by the article I linked to, it's probably going to want to (a) cache the data that is read from disk,

It *can* cache all the data in shared_buffers *iff* you configure it to be so large.

> (b) store all that data in logs

The new values need to be stored in there for durability. It's sequential writes, marked as DONTNEED, unless you have replicas feeding of that log. Only changed columns are stored these days (if on the same page).

If you don't need durability for a relation, you can mark it as UNLOGGED.

> (c) buffer all the updates in user memory,

No.

> (d) flood the write cache.

Pages will only be written out immediately if a) there's not enough space in postgres' shared_buffers b) a checkpoint is in progress, which happens somewhere from every 5 to every 60 minutes, depending on configuration. If your working set fits into ram only b) matters.

> and if you read the article

I was taking part in the original discussion.

> then they make a point of saying they can't try and save memory by using the same physical ram for different buffers, because the OS won't let them guarantee write order, which they need to guarantee data integrity.

It actually sounds like you're referencing a paragraph in which I'm quoted... The point here only is that we can't just use mmap() and use directly kernel mapped memory, instead of using our own buffering mechanism. That means there's some interactions with the kernel's buffering on writeout, yes, but the kernel is reasonably good at recognizing pages that are only written once.

> So, while I might have been *slightly* hyperbolic, if your database is being actively accessed, then you really do need ram equal to about three times your active dataset in order to successfully cache it.

You need a littlebit more than once the amount of data, not more.

How does bcachefs beat btrfs?

Posted Aug 22, 2015 12:16 UTC (Sat) by gmatht (subscriber, #58961) [Link] (3 responses)

"the main goal of bcachefs to match ext4 and xfs on performance and reliability, but with the features of btrfs/zfs."

I am curious: How does bcachefs achieve this? Are there mistakes btrfs made that bcachefs learnt from? Does the benefit in speed come mainly from using large log structured b+ tree nodes over regular b+ tree nodes?

How does bcachefs beat btrfs?

Posted Aug 24, 2015 8:57 UTC (Mon) by Lennie (subscriber, #49641) [Link]

Not sure, but maybe the advantage of bcache is that they started at the bottom and focused on that alone, not thinking about all the stuff that goes on top.

How does bcachefs beat btrfs?

Posted Aug 24, 2015 21:00 UTC (Mon) by rahvin (guest, #16953) [Link]

I'm curious about the claim as well. I've got an 8 drive btrfs raid5 array and it's subjectively faster than a good hardware raid controller. btrfs is pretty darn state of the art, it's only problem is it's slow development speed because there are so few full time developers. I suspect that when it's feature complete and totally stable it's going to blow everything else out of the water.

How does bcachefs beat btrfs?

Posted Aug 28, 2015 16:48 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

"large log structured b+ tree nodes over regular b+ tree nodes" - that's a large part of it.

But really it mostly comes from working slowly, focusing on the design of the lower layers, keeping things clean and simple. It's been over 5 years to get to this point.

btrfs tried to do too much (in LOC) too quickly; they've got way too much code now and too many bad design decisions are baked in.

The bcachefs filesystem

Posted Aug 24, 2015 18:48 UTC (Mon) by orodeh (guest, #4219) [Link] (4 responses)

This looks like an interesting new filesystem. I read the design document, and understood the overall design, although not the details.

As a rule, one major difficultly in developing new Linux filesystems is that they take around ten years to mature. This means that it takes at least that long until users are willing to trust the new FS with their data.

On a technical issue, if I understand correctly, mark-sweep garbage collection is used instead of a more conventional free-space bitmap/counter-map. This will, most likely, work well for filesystems that are not close to full. For example, the original LFS paper (http://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf) showed that until around 70%, log-structured GC is a good choice. The trouble is that users sometimes wish to use all of their disk space, or nearly all of it. Therefore, filesystems have to be well behaved even if the disk is 95% used or more. In a typical GC, you might need to scan the entire FS metadata to figure out if there is a live pointer to a data block. This causes high CPU overheads when the system is close to full, and a general slowdown.

Now, you might say that it is a bad idea to operate a filesystem at close to full capacity, and I would agree. However, such situations are unfortunately common and long lasting.

Having said all of this, perhaps this is a good choice for moderately full filesystems. I wonder how much it saves from the overall overheads.

My two cents.
Ohad.

The bcachefs filesystem

Posted Aug 24, 2015 19:59 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Since this is a filesystem, it's pretty easy to keep an accurate track of changed blocks (i.e. keep an accurate remembered set). So it's not that hard to create a high-throughput generational GC that will degrade gracefully until the last few space percentages.

The bcachefs filesystem

Posted Aug 26, 2015 13:42 UTC (Wed) by orodeh (guest, #4219) [Link] (1 responses)

It is true that you can use a remembered set. However, with a filesystem GC there are new issues, that do not occur with memory GC.

A generational GC uses an empty contiguous area to write new data. When you are down to the last 5% of space, you will fill the new generation up, and then copy it out causing a second generation mark-sweep collection. In other words, every piece of data will be written twice, and, you will have many expensive mark-sweep collections. This will not work well for writing large sequential files.

In addition, in a filesystem, you end up having to maintain reverse mappings for every data block, to be able to address bad disk areas.

The bcachefs filesystem

Posted Aug 26, 2015 17:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

You can introduce a translation layer for block pointers, it's prohibitive for classic GCs but should be perfectly fine for filesystems.

The bcachefs filesystem

Posted Aug 28, 2015 16:53 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

bcachefs doesn't use mark and sweep at runtime - technically, it _can_, because it did originally do mark and sweep (upstream bcache does mark and sweep) and I'm still maintaining that functionality, but now we keep free space accounting up to date as the index is being updated and we overwrite extents.

Copying GC of the data itself is a distinct thing from mark and sweep of the index - we need copy GC for data because we don't overwrite partial buckets, but because the free space accounting is always up to date copy GC only runs when there's work for it to do.


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds