Development
An introduction to GlusterFS
Vijay Bellur, who is the co-maintainer of GlusterFS, gave a presentation at the first-ever Vault conference with an introduction to the filesystem and a look at where it is headed. GlusterFS is a distributed filesystem that will aggregate storage to provide a unified namespace for users' files. That data is then accessible via a wide variety of mechanisms.
Bellur began with a brief explanation of the need for GlusterFS (or simply Gluster). It comes down to the amount of data that is being generated these days—on the order of 2.5 exabytes (which is 2500 petabytes or 2.5 million terabytes) daily. In fact, 90% of the data ever generated by humans has been created in the last two years. All of that data must be stored somewhere and that storage should be commoditized and democratized, he said.
![Vijay Bellur [Vijay Bellur]](https://static.lwn.net/images/2015/vault-bellur-sm.jpg)
Gluster is a scale-out distributed storage system that collects up a variety of storage devices that are spread out across the network to present a global namespace for users. Gluster uses regular Linux filesystems that support extended attributes (e.g. ext4, XFS, Btrfs) to store the data. It provides file, object, and block interfaces to access the data.
All of Gluster is implemented as software that runs on commodity hardware, he said. It can run in virtual machines and may be able to be run in containers some day. Traditionally, distributed filesystems rely on metadata servers, but Gluster does away with those. Metadata servers are a single point of failure and can be a bottleneck for scaling. Instead, Gluster uses a hashing mechanism to find data.
Storage elasticity is another attribute of Gluster. It can scale out or scale down as needed. It is based on a modular architecture that is extensible. Most of it is implemented in user space, Bellur said.
Gluster concepts
A Gluster volume is a logical collection of exports from various storage servers, which are called "bricks". Volumes have an administrative name associated with them; users access a volume or part of a volume for their file operations (i.e. create, read, update, and delete, or CRUD).
There are several different types of volumes that are supported by Gluster. The first is a distributed volume that distributes files across the bricks in the volume. When the file is created, a hash is calculated from the file name; that determines which brick it will be placed on. Different clients will calculate the same hash value so they can find the right brick to access the file.
Another volume type is the replicated volume. As the name implies, it makes multiple copies of the file and stores those copies on separate bricks. The number of copies is set at volume-creation time.
A distributed replicated volume is the one used by most Gluster deployments, he said. In those volumes, multiple copies of a file are stored within a replicated volume and distributed across those replicated volumes. It provides high availability while also allowing the storage to grow as needed. More distributed volumes can simply be added to the filesystem as needed.
A new type of volume is the dispersed volume, which became available with Gluster 3.6. It provides RAID 5 over the network using erasure coding, which reduces the amount of storage needed for replication while still providing redundancy. It disperses the file's data across multiple bricks. The algorithm used is Reed-Solomon with a non-systematic erasure coding. All of the encoding and decoding is done on the client side.
Access
Gluster has multiple mechanisms available for clients to access the data stored in the filesystem. The first that was developed is the Filesystem in Userspace (FUSE) implementation that uses the GlusterFS protocol to access the data in the bricks. Much of the functionality in Gluster is client-based, including replication and erasure coding. The FUSE filesystem talks directly to the servers and has built-in failover, so an additional high-availability solution is not needed.
But FUSE is not available on all platforms and it is more mature on Linux than on other operating systems, so NFSv3 access was added. Gluster created its own NFS client in user space that talks NFS to the servers. In that model, distribution and replication are done by the servers.
A representational state transfer (REST) access method was also created, which allows access using web protocols. It uses the OpenStack Swift object storage API as its REST interface. Any combination of access methods can be used interchangeably; files could be created using FUSE, then accessed via REST, for example.
For those wanting to do data analysis using the data in a Gluster filesystem, there is a Hadoop Distributed File System (HDFS) support. Hadoop worker processes are run on the bricks and use FUSE to access the data on that server.
There is also a libgfapi that applications can use to bypass the other access methods and talk to Gluster directly. It is good for workloads that are sensitive to context switches or copies from and to kernel space. Integration with the NFS-Ganesha user-space NFS server is done using libgfapi. That allows using NFSv4 or Parallel NFS (pNFS) for Gluster file access. SMB is supported in a similar way. There is also experimental iSCSI support.
Features
Beyond being a scalable storage system that provides elasticity and quotas, it also provides data protection and recovery features. Volume and file-level snapshots are available and those snapshots can be requested directly by users, which means users won't have to bother administrators to create them. Archiving is supported with both read-only volumes and write once read many (WORM) volumes.
For multi-tenancy support, Gluster has encryption for data at rest and TLS/SSL for its data connections. For better performance, Gluster does caching of data, metadata, and directory entries for readdir(). There are built-in I/O statistics and a /proc-like interface for introspection of the filesystem state.
For provisioning servers with Gluster, there is puppet-gluster. It is also integrated with the oVirt virtualization manager as well as the Nagios monitor for servers. In fact, the sheer number of open-source projects that Gluster interfaces with is rather eye-opening.
Implementation
Gluster is implemented as a series of "translators", which are shared libraries that handle some piece of the functionality. Translators are self-contained units that can be stacked to enable multiple features. For example, distribution is a translator, as is replication; stacking the two of them provides the distributed replicated behavior for those types of volumes.
Translators can be deployed on the server, client, or both because they are "deployment agnostic". There are translators to handle protocols, performance features (e.g. caching, readahead), statistics gathering, access control, and so on. During development, swapping translators in and out of the stack can usually narrow down problems to a particular translator for further debugging.
A user survey in 2014 showed the main Gluster use cases. The two biggest are file synchronization/sharing and virtual machine image storage. After those two, backup and web content delivery network (CDN) uses were the next biggest, though other uses, especially for media files, also showed up in the survey.
Future
Gluster 3.5 was released in April 2014, followed by 3.6 in October 2014. The next release, 3.7, is currently in development and is planned for release in April 2015. The project is moving to a model with two major releases per year, Bellur said.
New features coming in 3.7 include "data tiering", which is a way to provide policies for moving data to and from hot and cold storage tiers based on access patterns. For example, the hot tier could consist of SSD storage while the cold tier is on spinning disks.
Bitrot detection is another feature bound for 3.7. The idea is to detect corruption while the data is at rest. A checksum is added to each object asynchronously and will be checked during periodic data scrubbing operations. Bitrot will also be detected when files are accessed.
A new sharding volume type is being added. Those volumes will split the data in files across multiple bricks. It will help reduce fragmentation in Gluster volumes as well as provide more parallelism for large-file workloads.
The netgroups feature that was developed at Facebook will appear in 3.7. It adds a more advanced configuration and authentication interface for NFS that is similar to /etc/exports. The patches were forward-ported from Gluster 3.4 for the upcoming release.
There are improvements to the support for NFS-Ganesha coming too, including high-availability support based on Pacemaker. Many performance improvements have been made, especially for small-file workloads. There is a TrashCan translator being added to protect from "fat finger" deletions and truncations. It also will capture deletions from system operations like self-healing (automatically resolving synchronization problems) and rebalancing (shuffling files around the bricks when new storage is added to the filesystem).
Another replication mode, arbiter replication, will keep two copies of the data and three copies of the metadata. The third metadata copy can be used to arbitrate in a "split-brain" scenario, where the two file copies get out of sync. In addition, administrative policies to resolve split-brains are coming in 3.7. The current behavior is to simply return an EIO for those files, but users will now be able to view the file versions and resolve the split-brain. There is a laundry list of other improvements coming in 3.7, including the inevitable "loads of bug fixes".
For releases beyond 3.7, the project is looking at a number of different features, including compression of data at rest and deduplication. A translator that provides overlay functionality is in the idea stage. REST interfaces for Gluster management are being planned, as is more integration with OpenStack and containers.
Gluster nodes that can also provide virtualization are on the horizon as well. This "hyperconvergence" is based on oVirt and KVM. There are also plans for a native Gluster driver for OpenStack Manila, which will provide "filesharing as a service" capabilities.
There is a long way to go before it gets there, but the project is already thinking about Gluster 4.0, Bellur said. The key things that will be addressed in that release are features meant to make the filesystem able to scale to larger systems. Currently there are limitations in the management framework that stop Gluster filesystems from growing beyond a certain size. Supporting a thousand nodes or more is part of those plans.
Beyond those features, the project would like to support heterogeneous environments better. Environments with multiple operating systems, many different types of storage, and multiple networks are being targeted. There are also plans to increase the flexibility that deployments have in choosing replication options, erasure codes, and more. There is a new style of replication being looked at, too, which is completely handled by the servers without clients being involved at all.
The feature set for Gluster 4.0 is still up in the air, though implementation of a few key features has already started. New feature ideas can still be submitted and there are plans to vote on which features will be included as part of a Gluster design summit that is tentatively planned for May 2016.
In answer to a question from the audience, Bellur gave a comparison between Gluster and the Ceph distributed filesystem. The architecture of Ceph is quite different than that of Gluster, since Ceph started as an object store and built file storage on top of that, while Gluster did the reverse. Thus file access is more flexible from Gluster, while object or block access is more flexible from Ceph. Gluster may be a better choice for systems that will start relatively small and possibly grow from there, while Ceph may be a good choice when the system is known to need to be huge from the outset.
It would seem that the overarching advantage that Gluster provides is its flexibility in terms of volume types, access methods, and integration with various other tools. It certainly appears to be an active project with lots of interesting plans for the future.
[I would like to thank the Linux Foundation for travel support to Boston for Vault.]
Brief items
Quotes of the week
But the people who don’t see a personal value in free software are missing a larger, more important freedom. One implied by the first four, though not specifically stated. A fifth freedom if you will, which I define as:
- Freedom 4: The freedom to have the program improved by a person or persons of your choosing, and make that improvement available back to you and to the public.
Most of the days however, I tear my hair when fixing bugs, or I try to rephrase my emails to not sound old and bitter (even though I can very well be that) when I once again try to explain things to users who can be extremely unfriendly and whining. I spend late evenings on curl when my wife and kids are asleep. I escape my family and rob them of my company to improve curl even on weekends and vacations. Alone in the dark (mostly) with my text editor and debugger.
There’s no glory and there’s no eternal bright light shining down on me. I have not climbed up onto a level where I have a special status. I’m still the same old me, hacking away on code for the project I like and that I want to be as good as possible.
Firefox 36.0.4
Firefox 36.0.4 has been released. This update includes security and bug fixes, support for the full HTTP/2 protocol, and more. The release notes contain the details.Newscoop 4.4 released
Version 4.4 of the news-site content-management system Newscoop has been released. Updates in this version include a framework for attaching editorial notes to in-process articles, support for "featured article" lists, and a refactored topic-management interface.
GTK+ 3.16.0 released
GTK+ version 3.16 is now available. Major changes include GDK support for rendering windows with OpenGL, a completely overhauled implementation of scrolling (including support for overlayed scrollbars), and an experimental Mir backend. New widgets include GtkGLArea, GtkStackSidebar, GtkModelButton, and GtkPopoverMenu, all of which seem to implement more or less what their names indicate. Interested developers would still be advised to read the documentation, however.
GNOME 3.16 released
The GNOME 3.16 release is out. "This is another exciting release for GNOME, and brings many new features and improvements, including redesigned notifications, a new shell theme, new scrollbars, and a refresh for the file manager. 3.16 also includes improvements to the Image Viewer, Music, Photos and Videos. We are also including three new preview apps for the first time: Books, Calendar and Characters." See the release notes for more information.
LibreOffice Online announced
The LibreOffice project has announced the accelerated development of a new online offering. "Development of LibreOffice Online started back in 2011, with the availability of a proof of concept of the client front end, based on HTML5 technology. That proof of concept will be developed into a state of the art cloud application, which will become the free alternative to proprietary solutions such as Google Docs and Office 365, and the first to natively support the Open Document Format (ODF) standard." The current effort is supported by IceWarp and Collabora; see this FAQ and Michael Meeks's posting for more information. For those wanting to download it, though, note the "
the availability of LibreOffice Online will be communicated at a later stage."
Newsletters and articles
Development newsletters from the past week
- What's cooking in git.git (March 20)
- What's cooking in git.git (March 23)
- LLVM Weekly (March 23)
- OCaml Weekly News (March 24)
- OpenStack Community Weekly Newsletter (March 20)
- Perl Weekly (March 23)
- PostgreSQL Weekly News (March 22)
- Python Weekly (March 19)
- Ruby Weekly (March 19)
- This Week in Rust (March 24)
- Tor Weekly News (March 25)
- Wikimedia Tech News (March 23)
Snellman: On open sourcing existing code
Juho Snellman has an interesting treatise on the oft-overlooked challenges that face developers attempting to release an existing, proprietary codebase under open-source terms. "As soon as you get outside of the "one self-contained file or directory" level of complexity, the threshold for releasing code becomes much higher. And likewise every change to a program that was made in order to open source it will make it less likely that the two versions can really be kept in sync in the long term. In this case the core code is maybe 2k-3k lines and won't require much work. It's all the support infrastructure that's going to be an issue.
" Snellman also reflects on possible strategies for writing internal code that may some day be released to the public.
Windows 10 to make the Secure Boot alt-OS lock out a reality (Ars Technica)
Ars Technica is one of several news outlets to report on a change announced in Microsoft's Windows 10 plans. Though the headlines (including Ars Technica's) paint a rather bleak scenario, the details are not as clear-cut. The UEFI "Secure Boot" mechanism was introduced with Windows 8, at which time Microsoft's OEM-certification rules mandated that hardware must include a means for the local user to disable Secure Boot. The Windows 10 certification rules does not include the mandated disable switch. Writes Peter Bright: "Should this stand, we can envisage OEMs building machines that will offer no easy way to boot self-built operating systems, or indeed, any operating system that doesn't have appropriate digital signatures. This doesn't cut out Linux entirely—there have been some collaborations to provide Linux boot software with the 'right' set of signatures, and these should continue to work—but it will make it a lot less easy.
" Note, also, that the only source for this story appears to be a presentation from a Microsoft event in Shenzhen, China. Bright adds that he has contacted Microsoft seeking clarification, but has so far received no reply.
Page editor: Nathan Willis
Next page:
Announcements>>