LWN: Comments on "Ceph distributed filesystem merged for 2.6.34"

Ceph distributed filesystem merged for 2.6.34

cmccabe — Thu, 08 Apr 2010 01:29:51 +0000

> And as for the network bandwidth, this is less of a problem with a
> distributed FS than with NFS since at least it has the chance of being
> stripped across various point to point connections instead of all points to
> one single connection.

You're missing a very important point here. Ethernet connections in an office are almost never point-to-point. Everyone uses a star topolgy. So all 30 of the computers on a floor might go through one gig-e switch.

In theory, the bandwidth of a dedicated gigE connection is comparable to the bandwidth between a PC and another PC on a 1 gig-E switch. In practice, the switch may not be able to handle everyone talking at once.

There may be latency issues with network I/O too. It depends on the network.

Ceph distributed filesystem merged for 2.6.34

joib — Sat, 27 Mar 2010 10:08:17 +0000

would you ask you local filesystem developers to ignore POSIX? Would you expect them to agree that it is too restrictive.

Why yes, after all things like noatime and relatime are existence proofs that sometimes local fs developers are willing to sacrifice POSIX compliance for performance reasons. Heck, relatime is even the default nowadays.

For performance, many if not all distributed or cluster fs's have some kind of lazy statvfs() and lazy *stat(), and so forth, in order to get around some of the serialization that POSIX would otherwise force upon them.

And of course, some aspects of POSIX are outright incompatible with a distributed fs. E.g. fcntl(..., F_GETLK) (which node does the PID refer to?).

Ceph distributed filesystem merged for 2.6.34

martinfick — Sat, 27 Mar 2010 04:15:38 +0000

Not to mention that the itty-bitty channel to the drives is there whether
the FS is local or distributed.

And as for the network bandwidth, this is less of a problem with a
distributed FS than with NFS since at least it has the chance of being
stripped across various point to point connections instead of all points to
one single connection.

Ceph distributed filesystem merged for 2.6.34

dlang — Sat, 27 Mar 2010 02:06:23 +0000

you are thinking of the itty-bitty data channel to the drives, but you are forgetting that the local system doesn't need to go to the drives most of the time. It can drastically speed things up by keeping the data cached in memory, which has a pretty fast and low latency interface to the cores.

Ceph distributed filesystem merged for 2.6.34

martinfick — Sat, 27 Mar 2010 01:32:06 +0000

POSIX semantics, especially atime and the serialization constraints, tend to consume even more of the scarce network bandwidth.

Me thinks you are confused. Those work better with (require is too strong a word) low latency links, not high bandwidth.

POSIX semantics are useful in local filesystems because programs rely on them for IPC. Distributed filesystems are rarely used for IPC-- or if they are, the semantics of the FS are customized ahead of time to work well for that, like in Hadoop's filesystem (HDFS), Google-FS, or MPI-IO.

I think you are stuck and cannot think beyond what is done today. The point is not to design something new to use a distributed POSIX filesystem, but to take currently working applications and entire virtual machine operating systems images and to simply run them over a distributed FS instead of locally so that you get all the benefits of a distributed FS (yes, that also means you get the downsides as well, low latency...). But in many cases, the gains will simply outweigh the downsides, not everything needs supercomputer low latency interconnects, but almost everything can benefit from HA and larger single namespaced filesystems.

Just because things were slow yesterday, does not mean they will be slow tomorrow. What is commonplace today for supercomputers will be commonplace tomorrow on desktops. I would even argue that distributed FSes should have been common yesterday, we are way behind the curve on this one. Fast networking gear is getting cheaper and cheaper and can already outpace disk bandwidth. With the growing size of hard disks, so much space is wasted, but even worse, the failure rates are going up and nothing is addressing this. The average desktop has way too big a drive, but rarely uses RAID since that requires a second drive. Why should we stick with NFS when it is a) not POSIX, and b) not HA or distributed, and C) does not scale disk space easily enough.

Ceph distributed filesystem merged for 2.6.34

cmccabe — Sat, 27 Mar 2010 01:08:09 +0000

First of all, Ceph is a big achievement! Block-level filesystems are hard, and distributed systems are often even harder!

> None of the proprietary solutions are things that your average company is
> going to install on their desktops which likely have virtually unused TB
> drives (except for the possibly large mp3 collections scattered here and
> there). There clearly is an opportunity to harness tons of corporate
> wasted disk space and CPU into a shared pool instead of wasting good money
> on obsolete high end fileservers which typically don't even provide HA.

Modern PCs have enormous multi-core, multi-GHz CPUs connected to itty-bitty I/O channels. POSIX semantics, especially atime and the serialization constraints, tend to consume even more of the scarce network bandwidth.

So just go the NFS route and make something useful, rather than something that emulates the behavior of a local filesystem circa 1976.

Ceph distributed filesystem merged for 2.6.34

martinfick — Fri, 26 Mar 2010 16:58:26 +0000

Bluntly saying POSIX is too restrictive seems over generalized. For some cases I am sure that is true, but again, would you ask you local filesystem developers to ignore POSIX? Would you expect them to agree that it is too restrictive. If you start with the mindset that you want distributed FSes for things that are typically done on distributed FSes, than it will likely seem restrictive. But if you start with the frame of reference of things that are typically done on local FSes, and you simply want to migrate to a distributed FS without recoding applications or worrying about consistency issues, then I can assure you that POSIX semantics become very important.

-GPFS, very nice.

-Lustre, does it provide HA, or just fast (stripped) access?

-GFS (I assume you mean RedHat's GFS, and not google FS) is not really a distributed FS, it requires a shared backend.

-Panasas, is this really Posix (hard to figure out form their site)? pNFS, yes. Also, this seems to use their custom hardware on the backend which happens to be distributed, a black box so to speak. So, this is not really a distributed FS that others can install and use.

So, it seems like possibly only 2or3 solutions and there are many many more distributed FSes out there. I would say that overall posix is quite rare for distributed FSes. And since none of those rare exceptions are free/libre/open source, surely you would concede that ceph is not filling an already crowded space?

None of the proprietary solutions are things that your average company is going to install on their desktops which likely have virtually unused TB drives (except for the possibly large mp3 collections scattered here and there). There clearly is an opportunity to harness tons of corporate wasted disk space and CPU into a shared pool instead of wasting good money on obsolete high end fileservers which typically don't even provide HA.

Ceph distributed filesystem merged for 2.6.34

roblatham — Fri, 26 Mar 2010 15:07:39 +0000

Yes, NFS is too weak. POSIX is too restrictive. I'd like to point out that
especially in the supercomputing domain there is a "third way" to do
consistency semantics: the MPI standard defines "MPI-IO consistency
semantics", which give applications assurances as to when data will be
visible (no bizzaro NFS client-side caching behaviors) but relax the POSIX
rules in ways that only help performance.

Because you asked: distributed file systems that implement POSIX consistency
semantics:

- GPFS
- Lustre
- GFS
- Panasas
- now Ceph apparently

Ceph distributed filesystem merged for 2.6.34

bronson — Thu, 25 Mar 2010 13:31:30 +0000

Merge windows are starting to feel like the paradox of the unexpected hanging...

http://2000clicks.com/MathHelp/JokeProofUnexpectedHanging...

Ceph distributed filesystem merged for 2.6.34

clugstj — Thu, 25 Mar 2010 13:07:28 +0000

It was unpredictable (that he would stretch the merge window out so far).

Ceph distributed filesystem merged for 2.6.34

cmccabe — Wed, 24 Mar 2010 00:13:02 +0000

Thanks for the informative reply.

It sounds like Ceph has some features that even Lustre doesn't have. I hope that the supercomputer guys get a chance to try it out sometime!

Ceph distributed filesystem merged for 2.6.34

martinfick — Tue, 23 Mar 2010 18:58:25 +0000

"Also, why do people still like implementing POSIX filesystem semantics on distributed filesystems? Intuitively it seems like POSIX semantics demand a really fast interconnect between nodes. In contrast, systems that implement weaker forms of consistency, like the Andrew File System, can perform well with lower-end networking gear. I think most system administrators would be satisfied with something like AFS's open-to-close consistency guarantee. I mean, most system administrators still use NFS, which has horrible consistency semantics.

Why do people implement them? Probably because users want them! I know I do. Afterall, surely you can agree that POSIX semantics are useful? We like them on our local filesystems. Why would I not like them on distributed ones? Granted, they are likely to have high latencies, but that doesn't mean that I don't want POSIX semantics.

Yes, AFS/NFS semantics are fine for some applications, but there are many for which it is not. With today's virtualization trend, it becomes more and more important to be able to use shared/distributed filesystems, but this is of little use if one has to think about which virtualized applications really need POSIX and which do not, and then segregate accordingly.

So, users want POSIX whether local or distributed (try asking ext4/ZFS/BTRFS why they feel the need to implement POSIX.) And, of course, the holy grail for many sysadmins would be a fast/distributed/scalable/HA/POSIX fs which ceph promises to be. I am happy that ceph is at least attempting to do this. You seem to imply that there are tons of FSes doing this, who else is?

Subscribe!

sunr2007 — Sun, 21 Mar 2010 13:25:36 +0000

thanks! i will subscribe by the end of this month. waiting for april 1st so
tat i ll get my salary!! :)

Ceph distributed filesystem merged for 2.6.34

joib — Sun, 21 Mar 2010 11:14:37 +0000

Disclaimer: I've never used ceph, and while I have used Lustre extensively I've never admined it.

Ceph has fault-tolerance as an integral part of the design. You specify data to be replicated N times, and ceph takes care of the rest; if a node fails, the data on the failed node is automatically replicated to other nodes in order to bring the system up to the desired replication level. To get HA with Lustre, you setup nodes in active-passive pairs with "traditional" HA software, Lustre itself has nothing to do with HA.

Ceph seems to have some kind of automatic data migration to avoid hotspots. Ceph has distributed metadata, Lustre has had it on the roadmap for quite a while, but AFAIK it's not yet delivered.

Lustre, AFAIK, uses file allocation tables to specify on which nodes a file is striped, Ceph uses some kind of hash function.

Lustre server components are kernel-space using a customized ext3/4 fs, Ceph servers are user space using a normal fs, though apparently snapshot functionality requires btrfs. I think the Lustre devs are in the process of moving the servers to user space and using ZFS DMU for storage.

Performance-wise, I suspect Lustre is very hard to beat, especially when using MPI-IO for which it has been tuned for years on some very large systems.

Wrt. POSIX semantics, yes I believe they might not always be a good match to distributed filesystems. Heck, even on local fs's we often relax POSIX semantics to improve performance (noatime, relatime). For concurrent operation, as such AFAICS the problem is not necessarily that it requires a fast interconnect, but rather that strict adherence to POSIX serializes many operations due to the requirement that metadata is always consistent. E.g. if you have multiple clients writing to a file concurrently, what is stat() supposed to return? I believe both Ceph and Lustre provide, at least optionally, relaxed POSIX semantics for metadata consistency. Also, strict POSIX compliance in the face of concurrent access limits IO buffering. Anyway, many applications seem to do just fine with the close-to-open consistency that NFS provides, and explicit locks (I believe at least as of NFSv4 acquiring a lock forces cache revalidation).

Altogether, Ceph looks very interesting IMHO. I'd love to be able to build a scalable fault-tolerant fs google-style using cheap commodity hardware.

Subscribe!

man_ls — Sun, 21 Mar 2010 10:06:17 +0000

There is a handy "Subscribe" link at the top of every LWN page, and xe.com converts the $2.50 of the basic rate to 113 rupees. It should be at worst about 1/200th of the living costs in your city (20k rupees), so roughly the equivalent to the highest rate ($10) for an European or US resident (assuming living costs of $2k). An excellent deal I would say.

Subscribe!

sunr2007 — Sun, 21 Mar 2010 09:24:09 +0000

I'm from bangalore, India. can u please suggest me the subscription rates in
terms of INR? how can i subscibe here?
warm regards,
Ravi Kulkarni

Subscribe!

man_ls — Sat, 20 Mar 2010 22:43:27 +0000

May I suggest you a great way of supporting LWN.net authors: subscribe! You get great value for your money, and can adjust your subscription level to your earnings. Note: I am not affiliated with LWN.net, I'm just a happy camper.

Ceph distributed filesystem merged for 2.6.34

cmccabe — Sat, 20 Mar 2010 21:01:19 +0000

Does anyone know how Ceph compares with Lustre?

I can think of a few similarities: both use POSIX semantics and have separate metadata and object data storage nodes. Also, it seems like both Lustre and Ceph do striping.

Of course, Lustre is still not in the mainline.

Also, why do people still like implementing POSIX filesystem semantics on distributed filesystems? Intuitively it seems like POSIX semantics demand a really fast interconnect between nodes. In contrast, systems that implement weaker forms of consistency, like the Andrew File System, can perform well with lower-end networking gear. I think most system administrators would be satisfied with something like AFS's open-to-close consistency guarantee. I mean, most system administrators still use NFS, which has horrible consistency semantics.

Xen

antn — Sat, 20 Mar 2010 16:04:08 +0000

Thanks for the answer.

So, even if the almost ready Xen 4.0 release will for the first time abandon the ancient non-paravirtops dom0 patches in favor of paravirtops-based ones, the inclusion in mainline is still not close to happen, contrary to what one might naively have guessed.

(FWIW, I'm currently running Slackware 13.0 as a dom0 through the SUSE patches, rebased for kernel.org 2.6.31 by A . Lyon)

Xen

corbet — Sat, 20 Mar 2010 15:12:04 +0000

There has been no pull request that I'm aware of for Xen stuff - and the nature of that code is such that it's hard to be unaware of an attempt to merge it. So no, we'll not see it in 2.6.34. I'd say 2.6.35 is unlikely too unless something is posted in the fairly near future.

Ceph distributed filesystem merged for 2.6.34

antn — Sat, 20 Mar 2010 14:50:20 +0000

With apologies for the somewhat offtopic and pointless question, is there any hope of seeing Xen dom0 support merged in this release? The Xen wiki paravirtops page still says that inclusion is planned for 2.6.34 o 2.6.35.

[Since this is my first post, I would like to state here my great appreciation for lwn.net authors, contents and posters).

Thanks

Ceph distributed filesystem merged for 2.6.34

csamuel — Sat, 20 Mar 2010 06:55:04 +0000

Yes, and Ceph was specifically listed there, so it shouldn't be a real
surprise.

Very glad to see it get in!

Ceph distributed filesystem merged for 2.6.34

gfarnum — Fri, 19 Mar 2010 19:16:21 +0000

Didn't Linus say when he put out rc1 that there would be more stuff coming?
http://lkml.org/lkml/2010/3/8/280

Ceph distributed filesystem merged for 2.6.34

proski — Fri, 19 Mar 2010 19:04:41 +0000

The rationale is that "the new stuff" cannot have regressions, unlike fixes to the existing code.

Ceph distributed filesystem merged for 2.6.34

dlang — Fri, 19 Mar 2010 18:12:08 +0000

the end of the merge window has never been the end of merging new stuff.

it's always been when Linus stops accepting new stuff to merge. From the very beginning, he has merged significant amounts of stuff between -rc1 and -rc2, it's only after about -rc2 or -rc3 that things really settle down to fixes.

Ceph distributed filesystem merged for 2.6.34

vonbrand — Fri, 19 Mar 2010 18:06:42 +0000

Bad Linus, bad.

(The whole point of "unpredictable" merge windows was supposed to be making last-minute-suprises unlikely...)