The btrfs filesystem is widely regarded as being the long-term future
choice for Linux. But what if btrfs is taking the wrong direction,
fighting an old war? If the nature of our storage devices changes
significantly, our filesystems will have to change as well. A lot of
attention has been paid to the increasing prevalence of flash-based
devices, but there is another upcoming technology which should be planned
for: object storage devices (OSDs). The recent posting of a new
provides a good opportunity to look at OSDs and how they might be supported
The developers of OSDs were driven by the idea that traditional,
block-based disk drives offer an overly low-level interface. With
contemporary hardware, it should be possible to push more intelligence into
storage devices, offloading work from the host while maintaining (or
improving) performance and security. So the interface offered by an OSD
does not deal in blocks; instead, the OSD provides "objects" to the host
system. Most objects will simply be files, but a few other types of
objects (partitions, for example) are supported as well. The host
manipulates these objects, but need not (and cannot) concern itself with
how those objects are implemented within the device.
A file object is identified by two 64-bit numbers. It contains whatever
data the creator chooses to put in there; an OSD does not interpret the
data in any way. Files also have a collection of attributes and metadata;
this includes much of the information stored in an on-disk inode in a
traditional filesystem - but without the block layout information, which
the OSD hides from the rest of the world. All of the usual operations can
be performed on files - reading, writing, appending, truncating, etc. -
but, again, the implementation of those operations is handled by the OSD.
One thing that is not handled by the OSD, though, is the creation of
a directory hierarchy or the naming of files. It is expected that the host
filesystem will use file objects to store its directory structure,
providing a suitable interface to the filesystem's users. One could,
presumably, also use an OSD as a sort of hardware-implemented object
database without a whole lot of high-level code, but that is not where the
focus of work with OSDs is now.
The OSD designers decided to offload
another task from the host systems: security.
The OSD protocol
[PDF] is a T10-sanctioned extension to the SCSI protocol. It is thus
expected that OSD devices will be directly attached to host systems; the
protocol has been designed to perform well in that mode. It is also
expected, though, that OSDs will be used in network-attached storage
environments. For such deployments, the OSD designers decided to offload
another task from the host systems: security.
To that end, the OSD protocol includes an extensive set of security-related
commands. Every operation on an object must be accompanied by a "capability," a
cryptographically-signed ticket which names the object and the access
rights possessed by the owner of the capability. In the absence of a
suitable capability, the drive will deny access.
It is expected that capabilities will be handed out by a security policy
daemon running somewhere on the network. That daemon may be in possession
of the drive's root key, which allows unrestricted access to the drive, or
it may have a separate, partition-level key instead. Either way, it can
use that key to sign capabilities given out to processes elsewhere in the
system. (Drives also have a "master" key, used primarily to change the
root key. Loss of the master key is probably a restore-from-backup sort of
Capabilities last for a while (they include an expiration time) and
describe all of the allowed operations. So the act of actually obtaining a
capability should be relatively rare; most OSD operations will be performed
using a capability which the system already has in hand. That is an
important design feature; adding "ask a daemon for a capability" to the
filesystem I/O path would not be a performance-enhancing move.
In theory, it should be relatively easy to make a standard Linux filesystem
support an OSD. It's mostly a matter of hacking out much of the low-level
block layout and inode management code, replacing it with the appropriate
object operations. The osdfs filesystem was created in this way; the
developers started with ext2. After taking out all the code they no longer
needed, the osdfs developers simply added code translating VFS-level
requests into operations understood by the OSD. Those requests are then
executed by way of the low-level osd-initiator
code (which was also recently submitted for consideration).
Directories are implemented as simple files containing names and
associated object IDs. There is no separate on-disk inode; all of that
information is stored as attributes to the file itself. The end result is
that the osdfs code is relatively small; it is mostly concerned with
remapping VFS operations into OSD operations.
Anybody wanting to test this code may run into one small problem: there are
few OSDs to be found in the neighborhood computer store. It would appear
that most of the development work so far has been done using OSD
simulators. The OSC software
OSD is, like osdfs, part of the open-osd project; it implements the OSD
protocol over an SQLite database. There is also an OSD simulator
hosted at IBM, but it would not appear to be under current development.
Simulator-based development and testing may not be as rewarding as having a
shiny new device implementing OSD in hardware, but it will help to insure
that both the software and the protocol are in good shape by the time such
hardware is available.
It should be noted that the success of OSDs is not entirely assured. An
OSD takes much of the work normally done in an operating system kernel and
shoves it into a hardware firmware blob where it cannot be inspected or
fixed. A poor implementation will, at best, not perform well; at worst,
the chances of losing data could increase considerably. It may yet prove
best to insist that storage devices just concentrate on placing bits where
the operating system tells them to and leave the higher-level decisions to
higher-level code. Or it may turn out that OSDs are the next step forward
in smarter, more capable hardware. Either way, it is an interesting
article at Sun for more information on how OSD works.
to post comments)