Distributed storage

By Jonathan Corbet
August 21, 2007

Evgeniy Polyakov is not an easily discouraged developer. He has been the source of a great deal of interesting kernel code - including a network channels implementation, an asynchronous crypto framework, the kevent subsystem, the "network tree" memory management layer, and the netlink connector code. Of all of those patches, only the netlink connector has made it into the mainline kernel - and that was back in 2005. Undeterred, Evgeniy has come forward with another significant patch set for consideration. His ambitions are no lower this time around: he would like to replace much of the functionality offered by the device mapper, iSCSI, and network block device (NBD) layers. He calls the new subsystem distributed storage, or DST for short. The goal is to allow the creation of high-performance storage networks in a reliable and easy manner.

At the lowest level, the DST code implements a simple network protocol which allows block devices to be exported across a network. The number of operations supported is small: block read and write operations and a "how big is your disk?" information request is about it. But it is intended to be fast, non-blocking, and able to function without copying the data on the way through. The zero-copy nature of the code allows it to perform I/O operations with no memory allocations at all - though the underlying network subsystem might do some allocations of its own.

There is no data integrity checking built into the DST networking layer; it relies on the networking code to handle that aspect of things. There is also no real security support at all. If a block device is exported for use in DST, it is exported to anybody who can reach the host. The addition of explicit export lists could certainly be done in the future, but, for now, hosts exporting drives via DST are probably best not exposed to anything beyond an immediate local network.

The upper layer of the DST code enables the creation of local disks. A simple ioctl() call would create a local disk from a remote drive, essentially reproducing the functionality offered by NBD. Evgeniy claims better performance than NBD, though, with non-blocking processing, no user-space threads, and a lack of busy-wait loops. There is also a simple failure recovery mechanism which will reconnect to remote hosts which go away temporarily.

Beyond that, though, the DST code can be used to join multiple devices - both local and remote - into larger arrays. There are currently two algorithms available: linear and mirrored. In a linear array, each device is added to the end of what looks like a much larger block device. The mirroring algorithm replicates data across each device to provide redundancy and generally faster read performance. There is infrastructure in place for tracking which blocks must be updated on each component of a mirrored array, so if one device drops out for a while it can be quickly brought up to date on its return. Interestingly, that information is not stored on each component; this is presented as a feature, in that one part of a mirrored array can be removed and mounted independently as a sort of snapshot. Block information also does not appear, in this iteration, to be stored persistently anywhere, meaning that a crash of the DST server could make recovery of an inconsistent mirrored array difficult or impossible.

Storage arrays created with DST can, in turn, be exported for use in other arrays. So a series of drives located on a fast local network can be combined in a sort of tree structure into one large, redundant array of disks. There is no support for the creation of higher-level RAID arrays at this time. Support for more algorithms is on the "to do" list, though Evgeniy has said that the Reed-Solomon codes used for traditional RAID are not fast enough for distributed arrays. He suggests that WEAVER codes might be used instead.

At this level, DST looks much like the device mapper and MD layers already supported by Linux. Evgeniy claims that the DST code is better in that it does all processing in a non-blocking manner, works with more network protocols, has simple automatic configuration, does not copy data, and can perform operations with no memory allocations. The zero-allocation feature is important in situations where deadlocks are a worry - and they are often a worry when remote storage is in use. Making the entire DST stack safe against memory-allocation deadlocks would require some support in the network layer as well - but, predictably, Evgeniy has some ideas for how that can be done.

This patch set is clearly in a very early state; quite a bit of work would be required before it would be ready for production use with data that somebody actually cares about. Like all of Evgeniy's patches, DST contains a number of interesting ideas. If the remaining little details can be taken care of, the DST code could eventually reach a point where it is seen as a useful addition to the Linux storage subsystem.

Index entries for this article
Kernel	Block layer
Kernel	Device mapper

Distributed storage

Posted Aug 23, 2007 8:36 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

How is this different from things like the Sistina / Red Hat GFS?

Rich.

Distributed storage

Posted Aug 23, 2007 12:09 UTC (Thu) by nix (subscriber, #2304) [Link]

GFS runs *atop* some distributed block device implementation (once iSCSI only, but now NBD+dm as well), and provides shared locking and so on so that lots of systems can access one filesystem at once. It could just as well run atop this (well, patches would probably be required since currently clustered LVM is utterly dependent upon dm, but I don't see why you couldn't run dm atop this as well.)

What about DRBD?

Posted Aug 23, 2007 12:17 UTC (Thu) by osma (subscriber, #6912) [Link] (1 responses)

How does this relate to DRBD?
http://www.drbd.org

What about DRBD?

Posted Aug 23, 2007 19:38 UTC (Thu) by Felix_the_Mac (guest, #32242) [Link]

I had been looking forward to DRBD being submitted and eventually included in the kernel, since I am planning to implement it at work this year.

However it looks to me (and you should take anything I say with a pinch of salt) like this proposal is going to lead to the common situation where there are 2 proposed ways of achieving some functionality.

This generally leads to a drawn out process in which each proposal is repeatedly modified and criticised until the weight of opinion causes one to be accepted or the other to give up and go home.

This may cause difficulty for the DRBD developers since they have an existing installed base of users and this may prevent them undertaking major redesigns/rewrites.

One hopes and expects that at the end of the day the kernel will end up with the best designed solution.

Lack of data integrity checks

Posted Aug 23, 2007 17:51 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (5 responses)

There is no data integrity checking built into the DST networking layer; it relies on the networking code to handle that aspect of things.

He's living in a fool's paradise if he thinks that TCP or UDP checksums or the link level FCS (e.g, Ethernet CRC) are going to be sufficient to guarantee data integrity. I've seen far too many times where NFS caused data corruption due to the lack of end-to-end checks.

He should define some end-to-end checking, and allow it to be disabled by people that insist on living dangerously.

The checksum/CRC/whatever should be computed over the payload data AND the block identification (device ID, block number), so as to guarantee both that the data has not been corrupted in transit, and that it really is the requested block rather than some other block.

Lack of data integrity checks

Posted Aug 23, 2007 19:22 UTC (Thu) by alex (subscriber, #1355) [Link] (4 responses)

There was a very interesting talk given by a friend of mine from Google about the sort of failures they experience. One example was a data corruption event that wasn't caught by either the TCP checksums and the filesystems own internal checksums.

You don't protect your data with just one number....

http://www.ukuug.org/events/spring2007/programme/ThatCoul...

Lack of data integrity checks

Posted Aug 24, 2007 8:46 UTC (Fri) by intgr (subscriber, #39733) [Link] (2 responses)

One number is fine if it is long enough; relying on a 32-bit checksum is naive indeed.

The MD5 TCP checksum feature in Linux kernels might be useful, but as it is not offloaded to the networking hardware, it's too slow for >100Mbit Ethernet. Employing a faster checksum function on the application layer sounds like a more practical idea.

Lack of data integrity checks

Posted Aug 24, 2007 16:30 UTC (Fri) by brouhaha (subscriber, #1698) [Link] (1 responses)

The issue isn't whether a 32-bit CRC is good enough to protect a packet. For maximum length normal Ethernet frames, I would claim that it is good enough. We're trying to detect errors here, not to make it secure against deliberate alteration. If you need to protect against an adversary that may introduce deliberate alterations in your data, you need crytography.

The issue for error detection is that the Ethernet FCS only applies for one hop of a route, and gets recomputed by each router along the way. Thus it does not offer end-to-end protection. The packet will have opportunities to be corrupted between hops, and the node that the packet finally arrives at can only trust the FCS to mean that it wasn't corrupted on the wire since leaving the last router.

A UDP checksum is both better and worse. It's better in that it is end-to-end, but it's far worse in that a 16 bit checksum is very weak in its error detection probability compared to a 32-bit CRC. Part of the weakness is the 16-bit size, but part of it is the nature of a checksum.

I'm not arguing that the integrity checking should be done at the application layer. Although there are certainly applications that should do that, what I'm arguing for is that the remote block device client and server code need to do end-to-end error checking at their own level in the protocol stack.

Lack of data integrity checks

Posted Aug 25, 2007 19:29 UTC (Sat) by giraffedata (guest, #1954) [Link]

I'm unclear on what corruptions you're concerned about. When people say "end to end," they're pointing to the fact that something could get corrupted before or after some other integrity check is done. Are you saying there's a significant risk that the data gets corrupted inside a router (outside of Ethernet integrity checks) or inside the client or server network stack (outside of UDP integrity checks)? Are we talking about OS bugs?

Just wondering, because while all kinds of failures are possible, it wouldn't make sense to protect against some risk that we routinely accept in other areas.

You also mention the UDP checksum as simply being too weak. If that's the problem, then I would just refer to "additional integrity checks" rather than emphasize "end to end."

Lack of data integrity checks

Posted Aug 24, 2007 21:01 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

That what IPSec is for. AH or ESP without encryption (hash only) will catch errors missed by TCP/UDP checksums.

GlusterFS: Distributed storage at the filesystem level

Posted Aug 23, 2007 18:36 UTC (Thu) by exco (guest, #4344) [Link]

If you are interested in distributed storage:
GlusterFS implements many interesting concepts
and keep the whole system simple.

http://www.gluster.org/docs/index.php/GlusterFS_User_Guide

All the work is done at filesystem level, not at the block device level.

Distributed storage

Posted Aug 26, 2007 17:28 UTC (Sun) by JohnNilsson (guest, #41242) [Link]

Does anyone know what happend to ddraid? [1]

[1] http://sourceware.org/cluster/ddraid/

Distributed storage

Posted Aug 30, 2007 18:13 UTC (Thu) by renox (guest, #23785) [Link]

>> Evgeniy has said that the Reed-Solomon codes used for traditional RAID are not fast enough for distributed arrays. He suggests that WEAVER codes might be used instead.<<

Funny that, I just read a paper from Mr risk at google who had a file corruption because of a bad switch and the corruption went undectected because of a double bit error which went through the weaver codes, so he advocates CRC instead.