User: Password:
|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.13-rc7, released on January 4. Linus says: "Anyway, things have been nice and quiet, and if I wasn't travelling, this would probably be the last -rc: there isn't really anything holding up a release, even if there are a couple of patches still going through discussions and percolating through maintainers. But rather than do a real 3.13 next weekend, I'll be on the road and decidedly *not* opening the merge window, so I'll do an rc8 next week instead, needed or not."

Stable updates: 3.4.76 was released on January 8. The 3.12.7, and 3.10.26, updates are in the review process as of this writing.

Comments (none posted)

Quotes of the week

In short, you should verify your code early, often, and as intensively as you possibly can. This is not simply case of “trust but verify,” but rather a case of “verify first and trust later, if at all.”

Perhaps this should be termed the Indiana Jones philosophy of validation.

Paul McKenney

If you're going to do something that's horrible, try to do it well.
— Rusty Russell (at linux.conf.au)

Comments (none posted)

Kernel development news

Some 3.13 development statistics

By Jonathan Corbet
January 8, 2014
As of this writing, the current development kernel snapshot is 3.13-rc6. Linus has said that this cycle will almost certainly go to -rc8, even if things look stable (as they indeed do) to avoid opening the merge window while he is attending linux.conf.au. Your editor, wishing to avoid writing highly technical articles during that period for exactly the same reason, deems this the right time for our traditional, non-technical look at the 3.13 development cycle and where the patches came from this time around.

There have been just under 12,000 non-merge changesets pulled into the mainline kernel for 3.13 so far; the total will almost certainly exceed 12,000 by the time the final release happens. 3.13 is thus a significantly busier cycle than its immediate predecessors; indeed, only three previous cycles (2.6.25, 3.8, and 3.10) have brought in more changes. Those changes, which added 446,000 lines and deleted 241,000 for a net growth of 205,000 lines, were contributed by 1,339 developers. The most active of those developers were:

Most active 3.13 developers
By changesets
Sachin Kamat3613.0%
Jingoo Han3232.7%
Marcel Holtmann2251.9%
Viresh Kumar1691.4%
Lars-Peter Clausen1501.3%
H Hartley Sweeten1471.2%
Ville Syrjälä1451.2%
Joe Perches1351.1%
Mark Brown1221.0%
Takashi Iwai1201.0%
Lee Jones1130.9%
Linus Walleij1030.9%
Peter Zijlstra920.8%
Wei Yongjun880.7%
Ben Widawsky880.7%
Al Viro870.7%
Ian Abbott850.7%
Russell King830.7%
Thierry Reding800.7%
Ingo Molnar760.6%
By changed lines
Ben Skeggs190143.5%
Greg Kroah-Hartman173783.2%
Jovi Zhangwei163773.0%
Guenter Roeck130132.4%
Eugene Krasnikov100821.8%
Patrick McHardy88631.6%
Joe Perches70761.3%
Ralf Baechle66871.2%
Archit Taneja62461.1%
Akhil Bhansali62141.1%
Aaro Koskinen61641.1%
Ard Biesheuvel58141.1%
Dave Chinner53111.0%
David Howells52871.0%
Russell King51250.9%
Hisashi Nakamura46050.8%
Ian Abbott44520.8%
Kent Overstreet43490.8%
Thierry Escande42360.8%
Jens Axboe37450.7%

Sachin Kamat's and Jongoo Han's extensive janitorial work throughout the driver subsystem put them in the top two positions for changesets merged for the second cycle in a row. Marcel Holtmann did extensive surgery in the Bluetooth layer, Viresh Kumar did a lot of cleanup work in the cpufreq subsystem, and Lars-Peter Clausen did a lot of development in the driver tree, focusing especially on industrial I/O and audio drivers.

In the "lines changed" column, Ben Skeggs's work is concentrated, as always, on the nouveau driver. Greg Kroah-Hartman and Jovi Zhangwei do not properly belong on the list this month; they show up as a result of the addition of ktap to the staging tree (by Jovi) and its subsequent removal (by Greg). Guenter Roeck removed support for the Renesas H8/300 architecture, and Eugene Krasnikov contributed a single patch adding a driver for Qualcomm WCN3660/WCN3680 wireless adapters. Patrick McHardy's #6 position, resulting from the addition of the nftables subsystem, also merits a mention.

A minimum of 217 companies supported work on the 3.13 kernel; the most active of those were:

Most active 3.13 employers
By changesets
Intel142811.9%
(None)132311.1%
Linaro11669.7%
Red Hat10829.0%
Samsung5945.0%
(Unknown)5704.8%
IBM4193.5%
(Consultant)3422.9%
SUSE3282.7%
Texas Instruments2632.2%
Outreach Program for Women2181.8%
Freescale2061.7%
Google1981.7%
NVidia1801.5%
Vision Engraving Systems1471.2%
Oracle1351.1%
Renesas Electronics1231.0%
Free Electrons1211.0%
Huawei Technologies1191.0%
ARM1110.9%
By lines changed
Red Hat6358311.7%
Intel5978011.0%
(None)514589.4% 11.0%
Linaro320545.9%
(Unknown)267124.9%
Texas Instruments202193.7%
Linux Foundation182623.4%
Huawei Technologies181823.3%
IBM154352.8%
(Consultant)148022.7%
Samsung147392.7%
Ericsson137222.5%
NVidia108842.0%
Astaro88631.6%
Wind River84211.5%
Renesas Electronics73371.3%
SUSE72301.3%
Fusion-IO69561.3%
Western Digital65901.2%
Nokia64791.2%

The percentage of contributions from volunteers is up a bit this time around, but not by enough to suggest any real change in its long-term decline. Perhaps the biggest surprise here, though, is that, for the first time, Red Hat has been pushed down in the "by changesets" column by Linaro. If there was ever any doubt that the mobile and embedded industries are playing an ever larger role in the development of the kernel, this should help to dispel them. That said, if one looks at the employers of the subsystem maintainers who merged these patches, the picture looks a bit different:

Employers with the most non-author signoffs
Red Hat211519.2%
Intel170415.5%
Linux Foundation128211.6%
Linaro9128.3%
Google5535.0%
Samsung4644.2%
(None)4033.7%
Texas Instruments3503.2%
Novell3483.2%
IBM2892.6%

The situation is changing here, with the mobile/embedded sector having a bigger presence than it did even one year ago, but, for the most part, entry into subsystem trees is still controlled by developers working for a relatively small number of mostly enterprise-oriented companies.

Finally, it can be interesting to look at first-time contributors — developers whose first patch ever went into 3.13. There were 219 of these first-time contributors in this development cycle. Your editor decided to look at the very first patch from each first-time contributor and see which files were touched. These changes are spread out throughout the kernel tree, but the most common places for first-time contributors to make their first changes in 3.13 were:

DirectoryContributors
drivers/staging24
drivers/net21
include21
net19
arch/arm14
drivers/gpu10
arch/powerpc10
arch/x867
drivers/media7
Documentation7

One of the justifications behind the staging tree was that it would serve as an entry point for new developers; these numbers suggest that it is working. That said, if one looks at longer periods, more new contributors work in drivers/net than anywhere else.

Another interesting question is: what is the employment situation for first-time contributors to the kernel? Are new kernel hackers still volunteers, or do they have jobs already? The numbers are hazy, but there are still some conclusions that can be drawn:

EmployerCount
(Unknown)97
Intel 21
Huawei Technologies 6
Samsung 6
Linaro 5
(None) 4
AMD 3
Texas Instruments 3
Outreach Program for Women3

Another way to put this information is that 118 of the first-time contributors in 3.13 were working for companies, 97 of them were unknown, and four were known to be volunteers. Many (but not all) of the unknowns will eventually turn out to have been working on their own time. But, even if every single one of them were a volunteer, we would still have more first-time contributors coming from companies than working on their own. In a time when experienced kernel developers can be hard to hire, companies will have little choice but to grow their own; some companies, clearly, are working to do just that.

And that, in turn, suggests that the long-term decline in volunteer contributions may not be a big problem in the end. Getting code into the kernel remains a good way to get a job, but, it seems, quite a few developers are successful at getting the job first, and contributing afterward. With luck, that will help us to continue to have a stream of new developers coming into the kernel development community.

Comments (7 posted)

Understanding the Jailhouse hypervisor, part 2

January 7, 2014

This article was contributed by Valentine Sinitsyn

In the first part of this series, we discussed what Jailhouse is, had a look at its data structures, covered how it is enabled, and what it does to initialize CPUs. This part concludes the series with a look at how Jailhouse handles interrupts, what is done to create a cell, and how the hypervisor is disabled.

Handling interrupts

Modern x86 processors are equipped with a "local advanced programmable interrupt controller" (LAPIC) that handles delivery of inter-processor interrupts (IPIs) as well as external interrupts that the I/O APIC, which is part of the system's chipset, generates. Currently, Jailhouse virtualizes the LAPIC only; the I/O APIC is simply mapped into the Linux cell, which is not quite safe because a malicious guest (or Linux kernel module) could reprogram it to tamper with other guests' work.

The LAPIC works in one of two modes: "xAPIC" or "x2APIC". The xAPIC mode is programmed via memory mapped I/O (MMIO), while the x2APIC uses model-specific registers (MSRs). x2APIC mode is backward-compatible with xAPIC, and its MSR addresses directly map to offsets in the MMIO page. When Jailhouse's apic_init() function initializes the LAPIC, it checks to see if x2APIC mode is enabled and sets up its apic_ops access methods appropriately. Internally, Jailhouse refers to all APIC registers by their MSR addresses. For xAPIC, these values are transparently converted to the corresponding MMIO offsets (see the read_xapic() and write_xapic() functions in apic.c as examples).

Jailhouse virtualizes the LAPIC in both modes, however the mechanism is slightly different. For xAPIC mode, a special LAPIC access page (apic_access_page[PAGE_SIZE] defined in vmx.c) is mapped into the guest's physical address space at XAPIC_BASE (0xfee00000); this happens in vmx_cell_init(). Later, in vmcs_setup(), LAPIC virtualization is enabled; this way, every time a guest tries to access the virtual LAPIC MMIO region, a trap back to the hypervisor (a "VM exit") occurs. No data is really read from the virtual LAPIC MMIO page or written to it, so CPUs can share this page. For x2APIC, instead, normal MSR bitmaps are used. By default, Jailhouse traps access to all LAPIC registers; however, if apic_init() detects that host LAPIC is in x2APIC mode, the bitmap is changed so that only ICR (interrupt control register) access is trapped. This happens when the master CPU executes vmx_init().

There is a special case when a guest tries to access a virtual x2APIC on a system where x2APIC is not enabled. In this case, the MSR bitmap remains unmodified. Jailhouse intercepts accesses to all LAPIC registers and passes incoming requests to xAPIC using the apic_ops access methods, effectively emulating an x2APIC on top of xAPIC. Since LAPIC registers are referred to in apic.c by their MSR addresses regardless the mode, this emulation has very little overhead.

The main reason behind Jailhouse's trapping of ICR (and few other registers) access is isolation: a cell shouldn't be able to send an IPI to a CPU that is not in its own CPU set, and the ICR is what defines an interrupt's destination. To achieve this isolation, apic_cpu_init() is called by the master CPU during initialization; it stores the mapping from the apic_id to the associated cpu_id in an array called, appropriately, apic_to_cpu_id. When a CPU is assigned a logical LAPIC ID, Jailhouse ensures that it is equal to cpu_id. This way, when an IPI is sent to a physical or logical destination, the hypervisor is able to map it to cpu_id and check if the CPU is in the cell's set. See apic_deliver_ipi() for details.

Now let's turn to interrupt handling. In vmcs_setup(), Jailhouse does not enable traps to the hypervisor on external interrupts and sets the exception bitmaps to all zeroes. This means that the only interrupt that results in a VM exit is a non-maskable interrupt (NMI); everything else is dispatched through the guest's IDT and handled in guest mode. Since cells assert full control over their own resources, this makes sense.

Currently, NMIs can only come from the hypervisor itself, which uses them to control guest CPUs (arch_suspend_cpu() in apic.c is an example). When an NMI occurs in a guest, that guest exits VM mode and Jailhouse re-throws the NMI in host mode. The CPU dispatches it through the host IDT and jumps to apic_nmi_handler(). It schedules another VM exit using a virtual machines extensions (VMX) feature known as a "preemption timer." vmcs_setup() sets this timer to zero, so, if it is enabled, a VM exit occurs immediately after VM entry. The reason behind this indirection is serialization: this way, NMIs (which are asynchronous by nature) are always delivered after entry into the guest system and cannot interfere with the host-to-guest transition.

Jailhouse runs with interrupts disabled so no interrupt other than an NMI can occur. Any exception in host mode is considered to be a serious fault and results in panic.

Creating a cell

To create a new cell, Jailhouse needs to "shrink" the Linux cell by moving hardware resources to the new cell. It also obviously needs to load the guest image and perform a CPU reset to jump to the guest's entry point. This process starts in the Linux cell with the JAILHOUSE_CELL_CREATE ioctl() command, leading to a jailhouse_cell_create() function call in the kernel. This function copies the cell configuration and guest image from user space (the jailhouse user-space tool reads these from files and stores them in memory). Then, the cell's physical memory region is mapped and the guest image is moved to the target (physical) address specified by the user.

After that, jailhouse_cell_create() calls the standard Linux cpu_down() function to offline each CPU assigned to the new cell; this is required so that the kernel won't try to schedule processes on those CPUs. Finally, the loader issues a hypercall (JAILHOUSE_HC_CELL_CREATE) using the VMCALL instruction and passes a pointer to a struct jailhouse_cell_desc that describes the new cell. This causes a VM exit from the Linux cell to the hypervisor; vmx_handle_exit() dispatches the call to the cell_create() function defined in hypervisor/control.c. In turn, cell_create() suspends all CPUs assigned to the cell except the one executing the function (if it is in the cell's CPU set) to prevent races. This is done in cell_suspend(), which indirectly signals an NMI (as described above) to each CPU and waits for the cpu_stopped flag to be set on the target's cpu_data. Then, the cell configuration is mapped from the Linux cell to a per-CPU region above FOREIGN_MAPPING_BASE in the host's virtual address space (the loader copies this structure into kernel space).

Memory regions are checked as with the Linux cell, and the new cell is allocated and initialized. After that, the Linux cell is shrunk: all of the new cell's CPUs are removed from the Linux cell's CPU set, the Linux cell's mappings for the guest's physical addresses are destroyed, and the new cell's I/O resources have their bits set in the Linux cell's io_bitmap, so accessing them will result in VM exit (and panic). Finally, the new cell is added to the list of cells (which is a singly linked list having linux_cell as its head) and each CPU in the cell is reset using arch_cpu_reset().

On the next VM entry, the CPU will start executing code located at 0x000ffff0 in real mode. If one is running apic-demo according to the instructions in the README file, this is where apic-demo.bin's 16-bit entry point is. The address 0x000ffff0 is different from the normal x86 reset vector (0xfffffff0), and there is a reason: Jailhouse is not designed to run unmodified guests and has no BIOS emulation, so it can simplify the boot process and skip much of the work required for a real reset vector to work.

Cell initialization and destruction

Cells are represented by struct cell, defined in x86/include/asm/cell.h. This structure contains the page table directories for use with the VMX and VT-d virtualization extensions, the io_bitmap for VMX, cpu_set, and other fields. It is initialized as follows. First, cell_init() copies a name for the cell from a descriptor and allocates cpu_data->cpu_set if needed (sets less than 64 CPUs in size are stored within struct cell in the small_cpu_set field). Then, arch_cell_create(), the same function that shrinks the Linux cell, calls vmx_cell_init() for the new cell; it allocates VMX and VT-d resources (page directories and I/O bitmap), creates EPT mappings for the guest's physical address ranges (as per struct jailhouse_cell_desc), maps the LAPIC access page described above, and copies the I/O bitmap to struct cell from the cell descriptor (struct jailhouse_cell_desc). For the Linux cell, the master CPU calls this function during system-wide initialization.

When the Linux cell is shrunk, jailhouse_cell_create() has already put the detached CPUs offline. Linux never uses guest memory pages since they are taken from the region reserved at boot as described in part 1. However, Jailhouse currently takes no action to detach I/O resources or devices in general. If they were attached to the Linux cell, they will remain attached, and it may cause a panic if a Linux driver tries to use an I/O port that has been moved to another cell. To prevent this, you should not assign these resources to the Linux cell.

As of this writing, Jailhouse has no support for cell destruction. However this feature has recently appeared in the development branch and will likely be available soon. When a cell is destroyed, its CPUs and memory pages are reassigned back to the Linux cell, and other resources are also returned to where they originated from.

Disabling Jailhouse

To disable Jailhouse, the user-space tool issues the JAILHOUSE_DISABLE ioctl() command, causing a call to jailhouse_disable(). This function calls leave_hypervisor() (found in main.c) on each CPU in the Linux cell and waits for these calls to complete. Then the hypervisor_mem mapping created in jailhouse_enable() is destroyed, the function brings up all offlined CPUs (which were presumably moved to other cells), and exits. From this point, Linux kernel will be running on bare metal again.

The leave_hypervisor() call issues a JAILHOUSE_HC_DISABLE hypercall, causing a VM exit at the given CPU, after which vmx_handle_exit() calls shutdown(). For the first Linux CPU that called it, this function iterates over CPUs in all cells other than Linux cell and calls arch_shutdown_cpu() for each of these CPUs. A call to arch_shutdown_cpu() is equivalent to suspending the CPU, setting cpu_data->shutdown_cpu to true, then resuming the CPU. As described above, this sequence transfers the control to apic_handle_events(), but this time this function detects that the CPU is shutting down. It disables the LAPIC and effectively executes a VMXOFF; HLT sequence to disable VMX on the CPU and halt it. This way, the hypervisor is disabled on all CPUs outside of the Linux cell.

When shutdown() returns, VT-d is disabled and the hypervisor restores the Linux environment for the CPU. First, the cpu_data->linux_* fields are copied from VMCS guest area. Then, arch_cpu_restore() is called to disable VMX (without halting the CPU this time) and restore various register values from cpu_data->linux_*. Afterward, the general-purpose registers are popped from the hypervisor stack, the Linux stack is restored, the RAX register is zeroed and a RET instruction is issued. For the Linux kernel, everything will look like leave_hypervisor() has returned successfully; this happens to each CPU in the Linux cell. After that, any offlined CPUs (likely halted by arch_shutdown_cpu()) are brought back to the active state, as described earlier.

Conclusion

Jailhouse is a young project that is developing quickly. It is a lightweight system that does not intend to replace full-featured hypervisors like Xen or KVM, but this doesn't mean that Jailhouse itself is feature-limited. It is rare project that has a potential both in a classroom and in production, and we hope this article helped you to understand it better.

Comments (5 posted)

Btrfs: Subvolumes and snapshots

By Jonathan Corbet
January 6, 2014
LWN's guide to Btrfs
The previous installment in LWN's ongoing series on the Btrfs filesystem covered multiple device handling: various ways of setting up a single filesystem on a set of physical devices. Another interesting aspect of Btrfs can be thought of as working in the opposite manner: subvolumes allow the creation of multiple filesystems on a single device (or array of devices). Subvolumes create a number of interesting possibilities not supported by other Linux filesystems. This article will discuss how to use the subvolume feature and the associated snapshot mechanism.

Subvolume basics

A typical Unix-style filesystem contains a single directory tree with a single root. By default, a Btrfs filesystem is organized in the same way. Subvolumes change that picture by creating alternative roots that function as independent filesystems in their own right. This can be illustrated with a simple example:

    # mkfs.btrfs /dev/sdb5
    # mount /dev/sdb5 /mnt/1
    # cd /mnt/1
    # touch a

Thus far, we have a mundane btrfs filesystem with a single empty file (called "a") on it. To create a subvolume and create a file within it, one can type:

    # btrfs subvolume create subv
    # touch subv/b
    # tree
    .
    ├── a
    └── subv
	└── b

    1 directory, 2 files

The subvolume has been created with the name subv; thus far, the operation looks nearly indistinguishable from having simply created a directory by that name. But there are some differences that pop up if one looks for them. For example:

    # ln a subv/
    ln: failed to create hard link ‘subv/a’ => ‘a’: Invalid cross-device link

So, even though subv looks like an ordinary subdirectory, the filesystem treats it as if it were on a separate physical device; moving into subv is like crossing an ordinary Unix mount point, even though it's still housed within the original btrfs filesystem. The subvolume can also be mounted independently:

    # btrfs subvolume list /mnt/1
    ID 257 gen 8 top level 5 path subv
    # mount -o subvolid=257 /dev/sdb5 /mnt/2
    # tree /mnt/2
    /mnt/2
    └── b

    0 directories, 1 file

The end result is that each subvolume can be treated as its own filesystem. It is entirely possible to create a whole series of subvolumes and mount each separately, ending up with a set of independent filesystems all sharing the underlying storage device. Once the subvolumes have been created, there is no need to mount the "root" device at all if only the subvolumes are of interest.

Btrfs will normally mount the root volume unless explicitly told to do otherwise with the subvolid= mount option. But that is simply a default; if one wanted the new subvolume to be mounted by default instead, one could run:

    btrfs subvolume set-default 257 /mnt/1

Thereafter, mounting /dev/sdb5 with no subvolid= option will mount the subvolume subv. The root volume has a subvolume ID of zero, so mounting with subvolid=0 will mount the root.

Subvolumes can be made to go away with:

    btrfs subvolume delete path

For ordinary subvolumes (as opposed to snapshots, described below), the subvolume indicated by path must be empty before it can be deleted.

Snapshots

A snapshot in Btrfs is a special type of subvolume — one which contains a copy of the current state of some other subvolume. If we return to our simple filesystem created above:

    # btrfs subvolume snapshot /mnt/1 /mnt/1/snapshot
    # tree /mnt/1
    /mnt/1
    ├── a
    ├── snapshot
    │   ├── a
    │   └── subv
    └── subv
        └── b

    3 directories, 3 files

The snapshot subcommand creates a snapshot of the given subvolume (the /mnt/1 root volume in this case), placing that snapshot under the requested name (/mnt/1/snapshot) in that subvolume. As a result, we now have a new subvolume called snapshot which appears to contain a full copy of everything that was in the filesystem previously. But, of course, Btrfs is a copy-on-write filesystem, so there is no need to actually copy all of that data; the snapshot simply has a reference to the current root of the filesystem. If anything is changed — in either the main volume or the snapshot — a copy of the relevant data will be made, so the other copy will remain unchanged.

Note also that the contents of the existing subvolume (subv) do not appear in the snapshot. If a snapshot of a subvolume is desired, that must be created separately.

Snapshots clearly have a useful backup function. If, for example, one has a Linux system using Btrfs, one can create a snapshot prior to installing a set of distribution updates. If the updates go well, the snapshot can simply be deleted. (Deletion is done with "btrfs subvolume delete" as above, but snapshots are not expected to be empty before being deleted). Should the update go badly, instead, the snapshot can be made the default subvolume and, after a reboot, everything is as it was before.

Snapshots can also be used to implement a simple "time machine" functionality. While working on this article series, your editor set aside a Btrfs partition to contain a copy of /home. On occasion, a simple script runs:

    rsync -aix --delete /home /home-backup
    btrfs subvolume snapshot /home-backup /home-backup/ss/`date +%y-%m-%d_%H-%M`

The rsync command makes /home-backup look identical to /home; a snapshot is then made of that state of affairs. Over time, the result is the creation of a directory full of timestamped snapshots; returning to the state of /home at any given time is a simple matter of going into the proper snapshot. Of course, if /home is also on a Btrfs filesystem, one could make regular snapshots without the rsync step, but the redundancy that comes with a backup drive would be lost.

One can quickly get used to having this kind of resource available. This also seems like an area that is just waiting for the development of some higher-level tools. Some projects are already underway; see Snapper or btrfs-time-machine, for example. There is also an "autosnap" feature that has been posted in the past, though it does not seem to have seen any development recently. For now, most snapshot users are most likely achieving the desired functionality through their own sets of ad hoc scripts.

Subvolume quotas

It typically will not take long before one starts to wonder how much disk space is used by each subvolume. A naive use of a tool like du may or may not produce a useful answer; it is slow and unable to take into account the sharing of data between subvolumes (snapshots in particular). Beyond that, in many situations, it would be nice to be able to divide a volume into subvolumes but not to allow any given subvolume to soak up all of the available storage space. These needs can be met through the Btrfs subvolume quota group mechanism.

Before getting into quotas, though, a couple of caveats are worth mentioning. One is that "quotas" in this sense are not normal, per-user disk quotas; those can be managed on Btrfs just like with any other filesystem. Btrfs subvolume quotas, instead, track and regulate usage by subvolumes, with no regard for the ownership of the files that actually take up the space. The other thing worth bearing in mind is that the quota mechanism is relatively new. The management tools are on the rudimentary side, there seem to be some performance issues associated with quotas, and there's still a sharp edge or two in there waiting for unlucky users.

By default, Btrfs filesystems do not have quotas enabled. To turn this feature on, run:

    # btrfs quota enable path

A bit more work is required to retrofit quotas into an older Btrfs filesystem; see this wiki page for details. Once quotas are established, one can look at actual usage with:

    # btrfs qgroup show /home-backup
    qgroupid rfer        excl       
    -------- ----        ----       
    0/5      21184458752 49152      
    0/277    21146079232 2872635392 
    0/281    20667858944 598929408  
    0/282    20731035648 499802112  
    0/284    20733419520 416395264  
    0/286    20765806592 661327872  
    0/288    20492754944 807755776  
    0/290    20672286720 427991040  
    0/292    20718280704 466567168  
    0/294    21184458752 49152      

This command was run in the time-machine partition described above, where all of the subvolumes are snapshots. The qgroupid is the ID number (actually a pair of numbers — see below) associated with the quota group governing each subvolume, rfer is the total amount of data referred to in the subvolume, and excl is the amount of data that is not shared with any other subvolume. In short, "rfer" approximates what "du" would indicate for the amount of space used in a subvolume, while "excl" tells how much space would be freed by deleting the subvolume.

...or, something approximately like that. In this case, the subvolume marked 0/5 is the root volume, which cannot be deleted. "0/294" is the most recently created snapshot; it differs little from the current state of the filesystem, so there is not much data that is unique to the snapshot itself. If one were to delete a number of files from the main filesystem, the amount of "excl" data in that last snapshot would increase (since those files still exist in the snapshot) while the amount of free space in the filesystem as a whole would not increase.

Limits can be applied to subvolumes with a command like:

    # btrfs qgroup limit 30M /mnt/1/subv

One can then test the limit with:

    # dd if=/dev/zero of=/mnt/1/subv/junk bs=10k
    dd: error writing ‘junk’: Disk quota exceeded
    2271+0 records in
    2270+0 records out
    23244800 bytes (23 MB) copied, 0.0334957 s, 694 MB/s

One immediate conclusion that can be drawn is that the limits are somewhat approximate at best; in this case, a limit of 30MB was requested, but the enforcement kicked in rather sooner than that. This happens even though the system appears to have a clear understanding of both the limit and current usage:

    # btrfs qgroup show -r /mnt/1
    qgroupid rfer     excl     max_rfer 
    -------- ----     ----     -------- 
    0/5      16384    16384    0        
    0/257    23261184 23261184 31457280 

The 0/257 line corresponds to the subvolume of interest; the current usage is shown as being rather less than the limit, but writes were limited anyway.

There is another interesting complication with subvolume quotas, as demonstrated by:

    # rm /mnt/1/subv/junk
    rm: cannot remove ‘/mnt/1/subv/junk’: Disk quota exceeded

In a copy-on-write world, even deleting data requires allocating space, for a while at least. A user in this situation would appear to be stuck; little can be done until somebody raises the limit for at least as long as it takes to remove some files. This particular problem has been known to the Btrfs developers since 2012, but there does not yet appear to be a fix in the works.

The quota group is somewhat more flexible than has been shown so far; it can, for example, organize quotas in hierarchies that apply limits at multiple levels. Imagine one had a Btrfs filesystem to be used for home directories, among other things. Each user's home could be set up as a separate subvolume with something like this:

    # cd /mnt/1
    # btrfs subvolume create home 
    # btrfs subvolume create home/user1
    # btrfs subvolume create home/user2
    # btrfs subvolume create home/user3

By default, each subvolume is in its own quota group, so each user's usage can be limited easily enough. But if there are other hierarchies in the same Btrfs filesystem, it might be nice to limit the usage of home as a whole. One would start by creating a new quota group:

    # btrfs qgroup create 1/1 home

Quota group IDs are, as we have seen, a pair of numbers; the first of those numbers corresponds to the group's level in the hierarchy. At the leaf level, that number is zero; IDs at that level have the subvolume ID as the second number of the pair. All higher levels are created by the administrator, with the second number being arbitrary.

The assembly of the hierarchy is done by assigning the bottom-level groups to the new higher-level groups. In this case, the subvolumes created for the user-level directories have IDs 258, 259, and 260 (as seen with btrfs subvolume list), so the assignment is done with:

    # btrfs qgroup assign 0/258 1/1 .
    # btrfs qgroup assign 0/259 1/1 .
    # btrfs qgroup assign 0/260 1/1 .

Limits can then be applied with:

    # btrfs qgroup limit 5M 0/258 .
    # btrfs qgroup limit 5M 0/259 .
    # btrfs qgroup limit 5M 0/260 .
    # btrfs qgroup limit 10M 1/1 .

With this setup, any individual user can use up to 5MB of space within their own subvolume. But users as a whole will be limited to 10MB of space within the home subvolume, so if user1 and user2 use their full quotas, user3 will be entirely out of luck. After creating exactly such a situation, querying the quota status on the filesystem shows:

    # btrfs qgroup show -r .
    qgroupid rfer     excl     max_rfer 
    -------- ----     ----     -------- 
    0/5      16384    16384    0        
    0/257    16384    16384    0        
    0/258    5189632  5189632  5242880  
    0/259    5189632  5189632  5242880  
    0/260    16384    16384    5242880  
    1/1      10346496 10346496 10485760 

We see that the first two user subvolumes have exhausted their quotas; that is also true of the upper-level quota group (1/1) that we created for home as a whole. As far as your editor can tell, there is no way to query the shape of the hierarchy; one simply needs to know how that hierarchy was built to work with it effectively.

As can be seen, subvolume quota support still shows signs of being relatively new code; there is still a fair amount of work to be done before it is truly ready for production use. Subvolume and snapshot support in general, though, has been around for years and is in relatively good shape. All told, subvolumes offer a highly useful feature set; in the future, we may well wonder how we ran our systems without them.

At this point, our survey of the major features of the Btrfs filesystem is complete. The next (and final) installment in this series will cover a number of loose ends, the send/receive feature, and more.

Comments (46 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds