Ceph distributed filesystem merged for 2.6.34

Posted Mar 21, 2010 11:14 UTC (Sun) by joib (subscriber, #8541)
In reply to: Ceph distributed filesystem merged for 2.6.34 by cmccabe
Parent article: Ceph distributed filesystem merged for 2.6.34

Disclaimer: I've never used ceph, and while I have used Lustre extensively I've never admined it.

Ceph has fault-tolerance as an integral part of the design. You specify data to be replicated N times, and ceph takes care of the rest; if a node fails, the data on the failed node is automatically replicated to other nodes in order to bring the system up to the desired replication level. To get HA with Lustre, you setup nodes in active-passive pairs with "traditional" HA software, Lustre itself has nothing to do with HA.

Ceph seems to have some kind of automatic data migration to avoid hotspots. Ceph has distributed metadata, Lustre has had it on the roadmap for quite a while, but AFAIK it's not yet delivered.

Lustre, AFAIK, uses file allocation tables to specify on which nodes a file is striped, Ceph uses some kind of hash function.

Lustre server components are kernel-space using a customized ext3/4 fs, Ceph servers are user space using a normal fs, though apparently snapshot functionality requires btrfs. I think the Lustre devs are in the process of moving the servers to user space and using ZFS DMU for storage.

Performance-wise, I suspect Lustre is very hard to beat, especially when using MPI-IO for which it has been tuned for years on some very large systems.

Wrt. POSIX semantics, yes I believe they might not always be a good match to distributed filesystems. Heck, even on local fs's we often relax POSIX semantics to improve performance (noatime, relatime). For concurrent operation, as such AFAICS the problem is not necessarily that it requires a fast interconnect, but rather that strict adherence to POSIX serializes many operations due to the requirement that metadata is always consistent. E.g. if you have multiple clients writing to a file concurrently, what is stat() supposed to return? I believe both Ceph and Lustre provide, at least optionally, relaxed POSIX semantics for metadata consistency. Also, strict POSIX compliance in the face of concurrent access limits IO buffering. Anyway, many applications seem to do just fine with the close-to-open consistency that NFS provides, and explicit locks (I believe at least as of NFSv4 acquiring a lock forces cache revalidation).

Altogether, Ceph looks very interesting IMHO. I'd love to be able to build a scalable fault-tolerant fs google-style using cheap commodity hardware.

Ceph distributed filesystem merged for 2.6.34

Posted Mar 24, 2010 0:13 UTC (Wed) by cmccabe (guest, #60281) [Link]

Thanks for the informative reply.

It sounds like Ceph has some features that even Lustre doesn't have. I hope that the supercomputer guys get a chance to try it out sometime!