Ceph distributed filesystem merged for 2.6.34

Posted Mar 20, 2010 21:01 UTC (Sat) by cmccabe (guest, #60281)
Parent article: Ceph distributed filesystem merged for 2.6.34

Does anyone know how Ceph compares with Lustre?

I can think of a few similarities: both use POSIX semantics and have separate metadata and object data storage nodes. Also, it seems like both Lustre and Ceph do striping.

Of course, Lustre is still not in the mainline.

Also, why do people still like implementing POSIX filesystem semantics on distributed filesystems? Intuitively it seems like POSIX semantics demand a really fast interconnect between nodes. In contrast, systems that implement weaker forms of consistency, like the Andrew File System, can perform well with lower-end networking gear. I think most system administrators would be satisfied with something like AFS's open-to-close consistency guarantee. I mean, most system administrators still use NFS, which has horrible consistency semantics.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 21, 2010 11:14 UTC (Sun) by joib (subscriber, #8541) [Link] (1 responses)

Disclaimer: I've never used ceph, and while I have used Lustre extensively I've never admined it.

Ceph has fault-tolerance as an integral part of the design. You specify data to be replicated N times, and ceph takes care of the rest; if a node fails, the data on the failed node is automatically replicated to other nodes in order to bring the system up to the desired replication level. To get HA with Lustre, you setup nodes in active-passive pairs with "traditional" HA software, Lustre itself has nothing to do with HA.

Ceph seems to have some kind of automatic data migration to avoid hotspots. Ceph has distributed metadata, Lustre has had it on the roadmap for quite a while, but AFAIK it's not yet delivered.

Lustre, AFAIK, uses file allocation tables to specify on which nodes a file is striped, Ceph uses some kind of hash function.

Lustre server components are kernel-space using a customized ext3/4 fs, Ceph servers are user space using a normal fs, though apparently snapshot functionality requires btrfs. I think the Lustre devs are in the process of moving the servers to user space and using ZFS DMU for storage.

Performance-wise, I suspect Lustre is very hard to beat, especially when using MPI-IO for which it has been tuned for years on some very large systems.

Wrt. POSIX semantics, yes I believe they might not always be a good match to distributed filesystems. Heck, even on local fs's we often relax POSIX semantics to improve performance (noatime, relatime). For concurrent operation, as such AFAICS the problem is not necessarily that it requires a fast interconnect, but rather that strict adherence to POSIX serializes many operations due to the requirement that metadata is always consistent. E.g. if you have multiple clients writing to a file concurrently, what is stat() supposed to return? I believe both Ceph and Lustre provide, at least optionally, relaxed POSIX semantics for metadata consistency. Also, strict POSIX compliance in the face of concurrent access limits IO buffering. Anyway, many applications seem to do just fine with the close-to-open consistency that NFS provides, and explicit locks (I believe at least as of NFSv4 acquiring a lock forces cache revalidation).

Altogether, Ceph looks very interesting IMHO. I'd love to be able to build a scalable fault-tolerant fs google-style using cheap commodity hardware.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 24, 2010 0:13 UTC (Wed) by cmccabe (guest, #60281) [Link]

Thanks for the informative reply.

It sounds like Ceph has some features that even Lustre doesn't have. I hope that the supercomputer guys get a chance to try it out sometime!

Ceph distributed filesystem merged for 2.6.34

Posted Mar 23, 2010 18:58 UTC (Tue) by martinfick (subscriber, #4455) [Link] (8 responses)

"Also, why do people still like implementing POSIX filesystem semantics on distributed filesystems? Intuitively it seems like POSIX semantics demand a really fast interconnect between nodes. In contrast, systems that implement weaker forms of consistency, like the Andrew File System, can perform well with lower-end networking gear. I think most system administrators would be satisfied with something like AFS's open-to-close consistency guarantee. I mean, most system administrators still use NFS, which has horrible consistency semantics.

Why do people implement them? Probably because users want them! I know I do. Afterall, surely you can agree that POSIX semantics are useful? We like them on our local filesystems. Why would I not like them on distributed ones? Granted, they are likely to have high latencies, but that doesn't mean that I don't want POSIX semantics.

Yes, AFS/NFS semantics are fine for some applications, but there are many for which it is not. With today's virtualization trend, it becomes more and more important to be able to use shared/distributed filesystems, but this is of little use if one has to think about which virtualized applications really need POSIX and which do not, and then segregate accordingly.

So, users want POSIX whether local or distributed (try asking ext4/ZFS/BTRFS why they feel the need to implement POSIX.) And, of course, the holy grail for many sysadmins would be a fast/distributed/scalable/HA/POSIX fs which ceph promises to be. I am happy that ceph is at least attempting to do this. You seem to imply that there are tons of FSes doing this, who else is?

Ceph distributed filesystem merged for 2.6.34

Posted Mar 26, 2010 15:07 UTC (Fri) by roblatham (guest, #1579) [Link] (7 responses)

Yes, NFS is too weak. POSIX is too restrictive. I'd like to point out that
especially in the supercomputing domain there is a "third way" to do
consistency semantics: the MPI standard defines "MPI-IO consistency
semantics", which give applications assurances as to when data will be
visible (no bizzaro NFS client-side caching behaviors) but relax the POSIX
rules in ways that only help performance.

Because you asked: distributed file systems that implement POSIX consistency
semantics:

- GPFS
- Lustre
- GFS
- Panasas
- now Ceph apparently

Ceph distributed filesystem merged for 2.6.34

Posted Mar 26, 2010 16:58 UTC (Fri) by martinfick (subscriber, #4455) [Link] (6 responses)

Bluntly saying POSIX is too restrictive seems over generalized. For some cases I am sure that is true, but again, would you ask you local filesystem developers to ignore POSIX? Would you expect them to agree that it is too restrictive. If you start with the mindset that you want distributed FSes for things that are typically done on distributed FSes, than it will likely seem restrictive. But if you start with the frame of reference of things that are typically done on local FSes, and you simply want to migrate to a distributed FS without recoding applications or worrying about consistency issues, then I can assure you that POSIX semantics become very important.

-GPFS, very nice.

-Lustre, does it provide HA, or just fast (stripped) access?

-GFS (I assume you mean RedHat's GFS, and not google FS) is not really a distributed FS, it requires a shared backend.

-Panasas, is this really Posix (hard to figure out form their site)? pNFS, yes. Also, this seems to use their custom hardware on the backend which happens to be distributed, a black box so to speak. So, this is not really a distributed FS that others can install and use.

So, it seems like possibly only 2or3 solutions and there are many many more distributed FSes out there. I would say that overall posix is quite rare for distributed FSes. And since none of those rare exceptions are free/libre/open source, surely you would concede that ceph is not filling an already crowded space?

None of the proprietary solutions are things that your average company is going to install on their desktops which likely have virtually unused TB drives (except for the possibly large mp3 collections scattered here and there). There clearly is an opportunity to harness tons of corporate wasted disk space and CPU into a shared pool instead of wasting good money on obsolete high end fileservers which typically don't even provide HA.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 27, 2010 1:08 UTC (Sat) by cmccabe (guest, #60281) [Link] (4 responses)

First of all, Ceph is a big achievement! Block-level filesystems are hard, and distributed systems are often even harder!

> None of the proprietary solutions are things that your average company is
> going to install on their desktops which likely have virtually unused TB
> drives (except for the possibly large mp3 collections scattered here and
> there). There clearly is an opportunity to harness tons of corporate
> wasted disk space and CPU into a shared pool instead of wasting good money
> on obsolete high end fileservers which typically don't even provide HA.

Modern PCs have enormous multi-core, multi-GHz CPUs connected to itty-bitty I/O channels. POSIX semantics, especially atime and the serialization constraints, tend to consume even more of the scarce network bandwidth.

POSIX semantics are useful in local filesystems because programs rely on them for IPC. Distributed filesystems are rarely used for IPC-- or if they are, the semantics of the FS are customized ahead of time to work well for that, like in Hadoop's filesystem (HDFS), Google-FS, or MPI-IO.

So just go the NFS route and make something useful, rather than something that emulates the behavior of a local filesystem circa 1976.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 27, 2010 1:32 UTC (Sat) by martinfick (subscriber, #4455) [Link]

POSIX semantics, especially atime and the serialization constraints, tend to consume even more of the scarce network bandwidth.

Me thinks you are confused. Those work better with (require is too strong a word) low latency links, not high bandwidth.

I think you are stuck and cannot think beyond what is done today. The point is not to design something new to use a distributed POSIX filesystem, but to take currently working applications and entire virtual machine operating systems images and to simply run them over a distributed FS instead of locally so that you get all the benefits of a distributed FS (yes, that also means you get the downsides as well, low latency...). But in many cases, the gains will simply outweigh the downsides, not everything needs supercomputer low latency interconnects, but almost everything can benefit from HA and larger single namespaced filesystems.

Just because things were slow yesterday, does not mean they will be slow tomorrow. What is commonplace today for supercomputers will be commonplace tomorrow on desktops. I would even argue that distributed FSes should have been common yesterday, we are way behind the curve on this one. Fast networking gear is getting cheaper and cheaper and can already outpace disk bandwidth. With the growing size of hard disks, so much space is wasted, but even worse, the failure rates are going up and nothing is addressing this. The average desktop has way too big a drive, but rarely uses RAID since that requires a second drive. Why should we stick with NFS when it is a) not POSIX, and b) not HA or distributed, and C) does not scale disk space easily enough.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 27, 2010 2:06 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

you are thinking of the itty-bitty data channel to the drives, but you are forgetting that the local system doesn't need to go to the drives most of the time. It can drastically speed things up by keeping the data cached in memory, which has a pretty fast and low latency interface to the cores.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 27, 2010 4:15 UTC (Sat) by martinfick (subscriber, #4455) [Link] (1 responses)

Not to mention that the itty-bitty channel to the drives is there whether
the FS is local or distributed.

And as for the network bandwidth, this is less of a problem with a
distributed FS than with NFS since at least it has the chance of being
stripped across various point to point connections instead of all points to
one single connection.

Ceph distributed filesystem merged for 2.6.34

Posted Apr 8, 2010 1:29 UTC (Thu) by cmccabe (guest, #60281) [Link]

> And as for the network bandwidth, this is less of a problem with a
> distributed FS than with NFS since at least it has the chance of being
> stripped across various point to point connections instead of all points to
> one single connection.

You're missing a very important point here. Ethernet connections in an office are almost never point-to-point. Everyone uses a star topolgy. So all 30 of the computers on a floor might go through one gig-e switch.

In theory, the bandwidth of a dedicated gigE connection is comparable to the bandwidth between a PC and another PC on a 1 gig-E switch. In practice, the switch may not be able to handle everyone talking at once.

There may be latency issues with network I/O too. It depends on the network.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 27, 2010 10:08 UTC (Sat) by joib (subscriber, #8541) [Link]

would you ask you local filesystem developers to ignore POSIX? Would you expect them to agree that it is too restrictive.

Why yes, after all things like noatime and relatime are existence proofs that sometimes local fs developers are willing to sacrifice POSIX compliance for performance reasons. Heck, relatime is even the default nowadays.

For performance, many if not all distributed or cluster fs's have some kind of lazy statvfs() and lazy *stat(), and so forth, in order to get around some of the serialization that POSIX would otherwise force upon them.

And of course, some aspects of POSIX are outright incompatible with a distributed fs. E.g. fcntl(..., F_GETLK) (which node does the PID refer to?).