Distributed storage
At the lowest level, the DST code implements a simple network protocol which allows block devices to be exported across a network. The number of operations supported is small: block read and write operations and a "how big is your disk?" information request is about it. But it is intended to be fast, non-blocking, and able to function without copying the data on the way through. The zero-copy nature of the code allows it to perform I/O operations with no memory allocations at all - though the underlying network subsystem might do some allocations of its own.
There is no data integrity checking built into the DST networking layer; it relies on the networking code to handle that aspect of things. There is also no real security support at all. If a block device is exported for use in DST, it is exported to anybody who can reach the host. The addition of explicit export lists could certainly be done in the future, but, for now, hosts exporting drives via DST are probably best not exposed to anything beyond an immediate local network.
The upper layer of the DST code enables the creation of local disks. A simple ioctl() call would create a local disk from a remote drive, essentially reproducing the functionality offered by NBD. Evgeniy claims better performance than NBD, though, with non-blocking processing, no user-space threads, and a lack of busy-wait loops. There is also a simple failure recovery mechanism which will reconnect to remote hosts which go away temporarily.
Beyond that, though, the DST code can be used to join multiple devices - both local and remote - into larger arrays. There are currently two algorithms available: linear and mirrored. In a linear array, each device is added to the end of what looks like a much larger block device. The mirroring algorithm replicates data across each device to provide redundancy and generally faster read performance. There is infrastructure in place for tracking which blocks must be updated on each component of a mirrored array, so if one device drops out for a while it can be quickly brought up to date on its return. Interestingly, that information is not stored on each component; this is presented as a feature, in that one part of a mirrored array can be removed and mounted independently as a sort of snapshot. Block information also does not appear, in this iteration, to be stored persistently anywhere, meaning that a crash of the DST server could make recovery of an inconsistent mirrored array difficult or impossible.
Storage arrays created with DST can, in turn, be exported for use in other arrays. So a series of drives located on a fast local network can be combined in a sort of tree structure into one large, redundant array of disks. There is no support for the creation of higher-level RAID arrays at this time. Support for more algorithms is on the "to do" list, though Evgeniy has said that the Reed-Solomon codes used for traditional RAID are not fast enough for distributed arrays. He suggests that WEAVER codes might be used instead.
At this level, DST looks much like the device mapper and MD layers already supported by Linux. Evgeniy claims that the DST code is better in that it does all processing in a non-blocking manner, works with more network protocols, has simple automatic configuration, does not copy data, and can perform operations with no memory allocations. The zero-allocation feature is important in situations where deadlocks are a worry - and they are often a worry when remote storage is in use. Making the entire DST stack safe against memory-allocation deadlocks would require some support in the network layer as well - but, predictably, Evgeniy has some ideas for how that can be done.
This patch set is clearly in a very early state; quite a bit of work would
be required before it would be ready for production use with data that
somebody actually cares about. Like all of Evgeniy's patches, DST
contains a number of interesting ideas. If the remaining little details
can be taken care of, the DST code could eventually reach a point where it
is seen as a useful addition to the Linux storage subsystem.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | Device mapper |
Posted Aug 23, 2007 8:36 UTC (Thu)
by rwmj (subscriber, #5474)
[Link] (1 responses)
How is this different from things like the
Sistina / Red Hat GFS?
Rich.
Posted Aug 23, 2007 12:09 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Aug 23, 2007 12:17 UTC (Thu)
by osma (subscriber, #6912)
[Link] (1 responses)
Posted Aug 23, 2007 19:38 UTC (Thu)
by Felix_the_Mac (guest, #32242)
[Link]
I had been looking forward to DRBD being submitted and eventually included in the kernel, since I am planning to implement it at work this year.
However it looks to me (and you should take anything I say with a pinch of salt) like this proposal is going to lead to the common situation where there are 2 proposed ways of achieving some functionality.
This generally leads to a drawn out process in which each proposal is repeatedly modified and criticised until the weight of opinion causes one to be accepted or the other to give up and go home.
This may cause difficulty for the DRBD developers since they have an existing installed base of users and this may prevent them undertaking major redesigns/rewrites.
One hopes and expects that at the end of the day the kernel will end up with the best designed solution.
Posted Aug 23, 2007 17:51 UTC (Thu)
by brouhaha (subscriber, #1698)
[Link] (5 responses)
He should define some end-to-end checking, and allow it to be disabled by people that insist on living dangerously.
The checksum/CRC/whatever should be computed over the payload data AND the block identification (device ID, block number), so as to guarantee both that the data has not been corrupted in transit, and that it really is the requested block rather than some other block.
Posted Aug 23, 2007 19:22 UTC (Thu)
by alex (subscriber, #1355)
[Link] (4 responses)
You don't protect your data with just one number....
http://www.ukuug.org/events/spring2007/programme/ThatCoul...
Posted Aug 24, 2007 8:46 UTC (Fri)
by intgr (subscriber, #39733)
[Link] (2 responses)
The MD5 TCP checksum feature in Linux kernels might be useful, but as it is not offloaded to the networking hardware, it's too slow for >100Mbit Ethernet. Employing a faster checksum function on the application layer sounds like a more practical idea.
Posted Aug 24, 2007 16:30 UTC (Fri)
by brouhaha (subscriber, #1698)
[Link] (1 responses)
The issue for error detection is that the Ethernet FCS only applies for one hop of a route, and gets recomputed by each router along the way. Thus it does not offer end-to-end protection. The packet will have opportunities to be corrupted between hops, and the node that the packet finally arrives at can only trust the FCS to mean that it wasn't corrupted on the wire since leaving the last router.
A UDP checksum is both better and worse. It's better in that it is end-to-end, but it's far worse in that a 16 bit checksum is very weak in its error detection probability compared to a 32-bit CRC. Part of the weakness is the 16-bit size, but part of it is the nature of a checksum.
I'm not arguing that the integrity checking should be done at the application layer. Although there are certainly applications that should do that, what I'm arguing for is that the remote block device client and server code need to do end-to-end error checking at their own level in the protocol stack.
Posted Aug 25, 2007 19:29 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
I'm unclear on what corruptions you're concerned about. When people say "end to end," they're pointing to the fact that something could get corrupted before or after some other integrity check is done. Are you saying there's a significant risk that the data gets corrupted inside a router (outside of Ethernet integrity checks) or inside the client or server network stack (outside of UDP integrity checks)? Are we talking about OS bugs?
Just wondering, because while all kinds of failures are possible, it wouldn't make sense to protect against some risk that we routinely accept in other areas.
You also mention the UDP checksum as simply being too weak. If that's the problem, then I would just refer to "additional integrity checks" rather than emphasize "end to end."
Posted Aug 24, 2007 21:01 UTC (Fri)
by zdzichu (subscriber, #17118)
[Link]
Posted Aug 23, 2007 18:36 UTC (Thu)
by exco (guest, #4344)
[Link]
http://www.gluster.org/docs/index.php/GlusterFS_User_Guide
All the work is done at filesystem level, not at the block device level.
Posted Aug 26, 2007 17:28 UTC (Sun)
by JohnNilsson (guest, #41242)
[Link]
[1] http://sourceware.org/cluster/ddraid/
Posted Aug 30, 2007 18:13 UTC (Thu)
by renox (guest, #23785)
[Link]
Funny that, I just read a paper from Mr risk at google who had a file corruption because of a bad switch and the corruption went undectected because of a double bit error which went through the weaver codes, so he advocates CRC instead.
Distributed storage
GFS runs *atop* some distributed block device implementation (once iSCSI only, but now NBD+dm as well), and provides shared locking and so on so that lots of systems can access one filesystem at once. It could just as well run atop this (well, patches would probably be required since currently clustered LVM is utterly dependent upon dm, but I don't see why you couldn't run dm atop this as well.)Distributed storage
How does this relate to DRBD?What about DRBD?
http://www.drbd.org
What about DRBD?
Lack of data integrity checks
There is no data integrity checking built into the DST networking layer; it relies on the networking code to handle that aspect of things.
He's living in a fool's paradise if he thinks that TCP or UDP checksums or the link level FCS (e.g, Ethernet CRC) are going to be sufficient to guarantee data integrity. I've seen far too many times where NFS caused data corruption due to the lack of end-to-end checks.
There was a very interesting talk given by a friend of mine from Google about the sort of failures they experience. One example was a data corruption event that wasn't caught by either the TCP checksums and the filesystems own internal checksums.Lack of data integrity checks
One number is fine if it is long enough; relying on a 32-bit checksum is naive indeed.Lack of data integrity checks
The issue isn't whether a 32-bit CRC is good enough to protect a packet. For maximum length normal Ethernet frames, I would claim that it is good enough. We're trying to detect errors here, not to make it secure against deliberate alteration. If you need to protect against an adversary that may introduce deliberate alterations in your data, you need crytography.Lack of data integrity checks
Lack of data integrity checks
That what IPSec is for. AH or ESP without encryption (hash only) will catch errors missed by TCP/UDP checksums. Lack of data integrity checks
If you are interested in distributed storage:GlusterFS: Distributed storage at the filesystem level
GlusterFS implements many interesting concepts
and keep the whole system simple.
Does anyone know what happend to ddraid? [1]Distributed storage
>> Evgeniy has said that the Reed-Solomon codes used for traditional RAID are not fast enough for distributed arrays. He suggests that WEAVER codes might be used instead.<<Distributed storage