LWN.net Logo

Actually RAID/volume management is superlimited when not in filesystem...

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 1:57 UTC (Thu) by khim (subscriber, #9252)
In reply to: Btrfs aims for the mainline by lmb
Parent article: Btrfs aims for the mainline

I have to admit that the idea of merging yet another RAID/volume management implementation, this time within a filesystem, does not make me very happy.

Actually I can not imagine sane volume management outside of filesystem. For example here I have 4 HDD drivers in my system. What I really want?
1. Keep most of the data on just one drive (for movies from my own DVDs).
2. Keep the rest in RAID-5 form (for movies in games and such: PITA to reinstall but can be done if needed).
3. Keep my own personal files (1% of total size or so) duplicated 4 times (on 4 HDDs).
Pretty easy and simple requirements, right? Yet totally unachievable with usual LVM/filesystem separation. Currently btrfs can not support this mode of operation too, but potentially - it's doable...

P.S. Actually I got the idea after reading in GFS paper: Users can specify different replication levels for different parts of the file namespace. I was dumbfound when read this: this is exactly what I need from normal filesystem - why it was never actually done? There are two answers:
1. It's harder to do for normal filesystem (GFS works with huge chunks and so fragmentation is not an issue).
2. RAID/volume management is separated from filesystem - there are just not enough info in filesystem to make in happen!


(Log in to post comments)

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 2:54 UTC (Thu) by Da_Blitz (guest, #50583) [Link]

you can do some of the thinks you mentioned with LVM

to keep most of the data on one partion you create a new pv with the vgcreate -p 1 <name> <dev> <dev>, -p limits the maximum amount of physical volumes for pv

while you cant do raid 5 in LVM (something i am assuming can be patched in) you can do RAID 1 using LVM with support for migrating the data from one drive to another should you remove or fail a drive, use lvcreate --mirrors 3 <volume name> <pvname>/-m

this will create one partition that is mirrored to 3 other drives

at the end you would end up with 2 volume groups, one for your DVDs that contains one lv partition and another volume group that includes all 4 drives that is replicated 4 times by lvm

i would pay for raid5 support in lvm as i could then specify different levels of replication for different partitions like you suggested,

I don't need volume groups!

Posted Jan 8, 2009 3:48 UTC (Thu) by khim (subscriber, #9252) [Link]

Yes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier to do if I'll know what exactly was stored on that broken drive!

Sure I can develop a lot of palliatives (I do ls -lR regularly and store it on RAIDed partition), but it's all ugly and stupid.

I don't need volume groups!

Posted Jan 8, 2009 5:43 UTC (Thu) by Ze (guest, #54182) [Link]

In reply to: Actually RAID/volume management is superlimited when not in filesystem... by Da_Blitz Parent article: Btrfs aims for the mainline Yes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier toes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier to do if I'll know what exactly was stored on that broken drive! Sure I can develop a lot of palliatives (I do ls -lR regularly and store it on RAIDed partition), but it's all ugly and stupid. do if I'll know what exactly was stored on that broken drive!

So what you really want is one logical volume with arbitrary tags on block allocation and replication on a per file basis.

Tux3 approach of keeping metadata information in normal files strikes me as being the nicest way to map meta-data onto multiple physical volumes.

Perhaps in the future we'll handle proper separation of concerns when it comes to different layers properly instead of munging em all in one.

The real bitch though with that is backward compatibility.

Since we'd effectively be splitting existing file systems up into two separate concerns , a block format and a file system format composed of blocks.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 11:51 UTC (Thu) by lmb (subscriber, #39048) [Link]

Sure; you want redundancy, encryption etc being managed at the file/directory object level. That makes perfect sense, and is exactly what I was hinting at by being able to stack "filesystems".

But we need a general solution for that. Not yet a 3rd one inside one component. Or, rather, maybe we need to start with that now, and then phase out or at least depreciate the past.

But what we definitely don't need is three or more permanent solutions to this problem.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 16:58 UTC (Thu) by roblucid (subscriber, #48964) [Link]

> 1. Keep most of the data on just one drive (for movies from my own DVDs).
> 2. Keep the rest in RAID-5 form (for movies in games and such: PITA to
> reinstall but can be done if needed).
> 3. Keep my own personal files (1% of total size or so) duplicated 4 times > (on 4 HDDs).
> Pretty easy and simple requirements, right? Yet totally unachievable with
> usual LVM/filesystem separation. Currently btrfs can not support this mode > of operation too, but potentially - it's doable...

What's wrong with using partitions and bind mounts?

3) A partition that's mirrored on all disks
2) A RAID 5 (yuck!) using partition on 3 disks
1) A partition for your data disk

You can bind mount stuff to be at convenient points in the filesystem hierarchy. That is "specifiying different methods of replication for different parts of the namespace".

( Frankly if you have 4 disks, then I'd rather stripe, lower level mirrors with RAID 10 and accept the lower capacity for the performance and reliability benefits of avoiding RAID5 in a 3 disk configuration. )

Using a RAID layer to do RAID
An LVM layer to provide logical volumes (caveat on FS barriers)
File System layer to hand filesystem structure, and journalling

Would appear like a logical structure, though it does require initial planning.

Perhaps you want to be able to expand the alloted space, given over to RAID1, RAID5, RAID10 etc?

Couldn't that be done, by using LVM type block devices, used by the RAID layer which is then exposed to filesystems, so they can grow/shrink their capacity, as chunks of disk are (de)allocated to partitions?

Call me a cynic if you like but pushing every feature you could want into 1 layer, the filesystem, which should then become a generic dynamic disk management system, would appear to become a very complicated monolithic block of code. If sane implementations would then use generic layers, then you're really back to where you started.

That's the setup I have - and HATE it

Posted Jan 8, 2009 18:03 UTC (Thu) by khim (subscriber, #9252) [Link]

Perhaps you want to be able to expand the alloted space, given over to RAID1, RAID5, RAID10 etc?

I want to forget words "RAID", "LVM" and related. Forever. I want sane options. Like:
A. Store data cheaply (say... $0.10/GB) but unreliably: single disk failure - and be ready to redownload/reinstall)
B. Store data reliably but expensively ($.40/GB): up to three disks can fail without any problems
C. Some intermediate versions: cheap and reliable ($0.12/GB to $0.15/GB), Ok with single disk failure (if it'll happen OS will of course restore status quo if possible), but sloooooow (still much faster then DVD).
Just like filesystems were invented to make unnecessary manual manageent of data on a single disk I want something to hide all this RAID/LVM/etc stuff from me. Will it be btrfs or stack of other technologies - I don't care as long as I have nice simple option list in "Save As..." dialog.

Couldn't that be done, by using LVM type block devices, used by the RAID layer which is then exposed to filesystems, so they can grow/shrink their capacity, as chunks of disk are (de)allocated to partitions?

May be. But then - it'll need huge, very complex schemes to make it work well as whole. This is "microkernel vs monolitic kernel" discussion all over again.

Call me a cynic if you like but pushing every feature you could want into 1 layer, the filesystem, which should then become a generic dynamic disk management system, would appear to become a very complicated monolithic block of code. If sane implementations would then use generic layers, then you're really back to where you started.

Huh? Why "monolithic block of code"??? I'm perfectly happy with separating of functions - different filesystems are free to use the same implementation of RAID, LVM, etc - if their authors decide it's the best way to do things. Just as long as it's not exposed to userspace (or at least to user).

That's the setup I have - and HATE it

Posted Jan 8, 2009 18:56 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

what you want isn't anything resembling a traditional filesystem. what you want is something like the 'object based storage' things that are being discussed (but you want something far more complex than what has been proposed, let alone implemented or accepted).

defining the redundancy for each file as it is saved will also require changes to every single program out there, which is very unlikely to happen.

if you are willing to deal with different directories having different redundancy options, then what you want is doable today, with no kernel changes. it just needs userspace tools written to make it easier to deal with.

That's the setup I have - and HATE it

Posted Jan 8, 2009 20:11 UTC (Thu) by roblucid (subscriber, #48964) [Link]

> defining the redundancy for each file as it is saved will also require
> changes to every single program out there, which is very unlikely to happen.

Probably a lot of ppl still expect filesystems to be fast, and having every node in the hierarchy and all files, having some different form of backing storage depending on redundancy requirements... *sucks through teeth* sounds expensive to me.

The "I want to know nothing" and just have it managed by some kind of Storage management system that takes care of details, does sound a better requirement to me.

Actually I don't think "every program" would need modifying, as when files are created, they can inherit the characteristics of the parent directory, as would new sub-directories.

That's the setup I have - and HATE it

Posted Jan 8, 2009 20:26 UTC (Thu) by roblucid (subscriber, #48964) [Link]

> Huh? Why "monolithic block of code"??? I'm perfectly happy with
> separating of functions - different filesystems are free to use the same
> implementation of RAID, LVM, etc - if their authors decide it's the best
> way to do things. Just as long as it's not exposed to userspace (or at
> least to user).

Well you seemed to imply it by saying you couldn't imagine it being down outside of the filesystem.

Permitting layers, I could imagine some specialised "meta" filesystem being feasible, that would give you a name space, that lets you tag directories and files with storage characteristics using more traditional type file systems as backing stores for bulk data storage. The real files might end up in a file hierarchy, spread over a number of volumes a bit like http proxy caches, to provide manageable chunks of RAID1, RAID0, RAID5, RAID10 storage, which can be increased (and freed) on demand by the Disk Management System.

The only thing is, how many ppl would actually need that? And if it were provided, how many would ever use funky "Save As" options, in applications, compared to the number who would whinge about excessive options, being confusing and unclean in their precious GUI?

That's the setup I have - and HATE it

Posted Jan 9, 2009 18:43 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Permitting layers, I could imagine some specialised "meta" filesystem being feasible, that would give you a name space, that lets you tag directories and files with storage characteristics using more traditional type file systems as backing stores for bulk data storage.

I believe you're describing object storage. The lower of those layers is the object layer. But it differs from a traditional filesystem in that the only names the objects (files) have are made up by the system (essentially, inode numbers).

In most of the work on object storage, that layer actually lives in a hardware unit separate from the one with the POSIX filesystem image, but it could be in the kernel between the filesystem drivers and the block device drivers as well (if it hasn't been done already).

But we do need to ask whether there is a need to have more than one future filesystem type with this kind of storage function before we put a lot of effort into making a reusable layer.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 16, 2009 0:02 UTC (Fri) by topher (guest, #2223) [Link]

What you're asking for is quite doable right now, using RAID+LVM (actually, only RAID is needed).

1. Keep most of the data on just one drive (for movies from my own DVDs).
2. Keep the rest in RAID-5 form (for movies in games and such: PITA to reinstall but can be done if needed).
3. Keep my own personal files (1% of total size or so) duplicated 4 times (on 4 HDDs).

Assuming 4 500GB Disks:
Drive 1: 10GB partition, RAID1; 490GB partition, stand alone
Drive 2: 10GB partition, RAID1; 490GB partition, RAID5
Drive 3: 10GB partition, RAID1; 490GB partition, RAID5
Drive 4: 10GB partition, RAID1; 490GB partition, RAID5

The 10GB partitions are all part of a 4x replicated RAID1 for your personal files. For additional redundancy across the system, put /boot and / on that RAID1 also, install GRUB on the bootloader for each disk, and you can lose any disk and boot the system. The stand alone 490GB partition is for your movies. The 3 490GB partitions in the RAID5 are for the rest of your stuff.

It's not required, but you could make use of LVM on top of those to more easily split things out as desired.

What you're asking for in a couple of your other posts, however, is not a simple thing. It doesn't fit well with how computers work in general. You seem to be saying you want to just save something and have the computer magically understand that it's "important" or "not important" or "kind of important" and know what that means. But computers don't do that. Someone has to tell them what each of those categories means, and how they're defined. And since it's your data, it's going to be hard for someone else to do that.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 17, 2009 21:34 UTC (Sat) by speedster1 (subscriber, #8143) [Link]

What you're asking for in a couple of your other posts, however, is not a simple thing. It doesn't fit well with how computers work in general. You seem to be saying you want to just save something and have the computer magically understand that it's "important" or "not important" or "kind of important" and know what that means. But computers don't do that. Someone has to tell them what each of those categories means, and how they're defined. And since it's your data, it's going to be hard for someone else to do that.

I don't think this automatic classification of "importance" is really the killer feature khim is pining for! There are a couple of key things that are painful or impossible with current LVM+RAID (which khim does know about and is currently using, for lack of a better alternative):

  1. allocation of space among the areas of differing redundancy
  2. ability to store metadata in a higher-redundancy area than the corresponding normal data
If the normal filesystem in your example 10GB partition of high-redundancy storage fills up, your app will get an out-of-space error and you will have to hack around with LVM tools to try to reclaim some space from that big RAID 5 partition. Are you at all confident that a typical user could shrink it safely? I think a typical user would end up having to buy more disks, even though there was lots of unused space in the existing disks!

In a smart filesystem that handled levels of redundancy internally, you would not have this problem at all. The filesystem would have one big pool of storage, and would create additional regions of high redundancy as needed.

I know it is possible to put filesystem journals on separate partitions, but I don't think point #2 is even possible with current filesystems.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 18, 2009 0:33 UTC (Sun) by Kamilion (subscriber, #42576) [Link]

... Yknow, you've basically just described ZFS on OpenSolaris. Check it out.

There's already a FUSE module for linux. If someone bothered reimplimenting the ZFS filesystem with FUSE, shouldn't it be much easier to use it as the basis of a cleanroom GPL2 kernel implementation of ZFS?

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds