|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.8-rc3, released on August 21. According to Linus: "It all looks pretty sane, I'm not seeing anything hugely scary here."

Stable updates: 4.7.2, 4.4.19, and 3.14.77 were released on August 21.

Comments (none posted)

Quotes of the week

The latest Jason Bourne movie was sufficiently bad that I spent time thinking how the tree_lock could be batched during reclaim.
Mel Gorman shows how memory-management development is done.

I talked a lot with Linus about design at this time, but never really participated in the kernel work (partly because disagreeing with Linus is a high-stress thing).
Lars Wirzenius looks back

I want the code, and I want the company that produced that code to join our community. So far we are doing really well in achieving that goal.
Greg Kroah-Hartman

Comments (13 posted)

Kernel development news

Restartable sequences restarted

By Jonathan Corbet
August 24, 2016
"Restartable sequences" is starting to look a bit like one of those bright ideas that floats around on the kernel list for years, but which never quite seems to make it into the mainline. In this case, the idea was first proposed over one year ago without, yet, having made appreciable progress toward merging; activity on this patch set died down after a while. But development on restartable sequences has picked up again under a new developer who has come up with yet another API for the feature.

As has happened in the kernel, scalability pressures are driving some user-space applications toward the use of lockless algorithms. In kernel space, such algorithms tend to be based on either disabling preemption or retrying an operation after contention is detected. Disabling preemption in user space is not an option, so retries are the primary option remaining. That is where restartable sequences come in; they combine a kernel-facilitated mechanism for detecting possible contention with a means to quickly force a retry when contention happens.

The current version of restartable sequences, as posted by Mathieu Desnoyers, retains the core idea of its predecessors. A restartable sequence is based around a short segment of code; only the final instruction of that segment is allowed to have side effects visible outside of the current thread. There is also an abort sequence, called to clean up and retry should the thread be preempted while executing the sequence. The specifics have changed, though.

Code using restartable sequences needs to start with an rseq structure:

    struct rseq {
    	int32_t cpu_id;
	uint32_t event_counter;
        struct rseq_cs *rseq_cs;
    };

(The actual structure is a bit more complex; various architecture-specific details have been omitted here in the interest of readability.) The cpu_id field always contains the number of the CPU on which the thread is running; event_counter is incremented whenever the thread is preempted — but only if rseq_cs is not null. The purpose of rseq_cs will be discussed below.

This structure must be registered with the kernel before restartable sequences can be used; the operative system call is:

    int rseq(struct rseq *rseq, int flags);

Only one rseq structure can be registered at a time in any given thread, but that structure can be registered multiple times, and the kernel will keep track of how many registrations (and unregistrations) there have been. The flags argument must be zero when registering a new structure. Unregistration is done by passing a null pointer for the rseq structure; setting flags to RSEQ_FORCE_UNREGISTER will cause the immediate removal of the structure, even if it has been registered multiple times.

In the past there have been concerns about how the restartable sequences feature would work when there are multiple users within an application (libraries, for example) that do not know about each other. If those users fight over which rseq structure is used, there will be problems with this interface as well; if, instead, they can all agree on the same structure, all will be well. Restartable sequences must be simple, so it makes no sense for code running within one to call another function at all, much less one that would start its own sequence. So there can only be a single sequence running at any given time.

To ensure that all users share a single rseq structure, the documentation recommends that each user declare it as a weak symbol and name it __rseq_abi. The linker will then ensure that, if there are multiple declarations within a given program, they will all refer to the same structure.

The other half of the puzzle is the rseq_cs structure pointed to from within the rseq structure above. This structure looks like (again, with some simplification applied):

    struct rseq_cs {
        void *start_ip;
	void *post_commit_ip;
	void *abort_ip;
    };

This structure describes an actual critical section that runs in the restartable mode. Here, start_ip is the address of the first instruction in the section, and post_commit_ip is the first instruction beyond the end of the section; any code running between those two instructions is running within the critical section. The abort_ip pointer is the address of the cleanup code to be executed should the thread be preempted while executing within the section.

With those pieces, a restartable sequence is run using something like this sequence of steps (assuming that the rseq structure is already registered):

  1. The event_counter field from the rseq structure is read and saved.
  2. The rseq_cs pointer in the rseq structure is set to point to the rseq_cs structure describing the critical section to be executed.
  3. The event_counter is read again and compared to the value read previously; if the values do not match, the rseq_cs field should be cleared and the process must be restarted from the beginning.
  4. The critical section can now be executed. In most cases, only the final instruction in the critical section should have visible side effects.
  5. The rseq_cs field should be set to NULL.

If execution makes it past the end of the section, then all is well. If, instead, the thread is preempted while running within the critical section, the kernel will cause it to jump to the abort_ip address. The code found there should clean up and prepare to retry.

In principle, that is all there is to it. In practice, applications using this feature must still include some assembly code to set up the various instruction pointers; there is some complexity involved in making it all work properly. Those interested in examples can have a look at the self-tests included with the patch and, in particular, the rather frightening assembly-in-CPP code found here and here.

There have not been many comments on the implementation this time around; it seems that, perhaps, things are finally getting to a point where the developers who are paying attention are reasonably happy. The next obstacle, though, may be Linus, who wants more evidence that this is a feature that will actually be used. Convincing him is likely to require demonstrating some real-world code that benefits from the feature and benchmarks to prove that it is all worthwhile. Since restartable sequences are said to have been in use in places like Google for some time, that proof should be possible to come by. If the developers involved follow through, perhaps this sequence of patches will not need to be restarted too many more times.

Comments (5 posted)

Btrfs and high-speed devices

By Jake Edge
August 24, 2016

LinuxCon North America

At LinuxCon North America in Toronto, Chris Mason relayed some of the experiences that his employer, Facebook, has had using Btrfs, especially with regard to its performance on high-speed solid-state storage devices (SSDs). While Mason was the primary developer early on in the history of Btrfs, he is one of a few maintainers of the filesystem now, and the project has seen contributions from around 70 developers throughout the Linux community in the last year.

[Chris Mason]

He is on the kernel team at Facebook; one of the main reasons the company wanted to hire him was because it wanted to use Btrfs in production. Being able to use Btrfs in that kind of environment is also the primary reason he chose to take the job, he said. As the company is rolling Btrfs out, it is figuring out which features it wants to use and finding things that work well and not so well.

Mason went through the usual list of high-level Btrfs features, including efficient writable snapshots, internal RAID with restriping, online device management, online scrubbing to check in the background if the CRCs are the same as when the data was written, and so on. The CRCs for both data and metadata are a feature that "saved us a lot of pain" at Facebook, he said.

The Btrfs CRC checking means that a read from a corrupted sector will cause an I/O error rather than return garbage. Facebook had some storage devices that would appear to store data correctly in a set of logical block addresses (LBAs) until the next reboot, at which point reads to those blocks would return GUID partition table (GPT) data instead. He did not name the device maker because it turned out to actually be a BIOS problem. In any case, the CRCs allowed the Facebook team to quickly figure out that the problem was not in Btrfs when it affected thousands of machines as they were rebooted for a kernel upgrade.

Volume management in Btrfs is done in terms of "chunks", which are normally 1GB in size. That is part of what allows the filesystem to handle differently sized devices for RAID volumes, for example. Volumes can have specific chunks reserved for data or metadata and different RAID levels can be applied to each (e.g. RAID-1 for the metadata and RAID-5 for the data).

But Btrfs has had some lock-contention problems; it still has some of them, he said, though there are improvements coming. The filesystem is optimized for use on SSDs, but he ran an fs_mark benchmark in a virtual machine (for comparative rather than hard numbers) creating zero-length files and found that XFS could create roughly four times the number of files per second (33,000 versus 9,000). That was "not good", but before he started tuning Btrfs, he wanted to make XFS go as fast as he could.

To that end, he looked at what XFS was blocked on, which turned out to be locks for allocating filesystem objects. By increasing the allocation groups in the filesystem when it was created (from four to sixteen to match the number of CPUs in his test system), he could increase its performance to 200,000 file-creations per second. At that point, it was mostly CPU bound and the function using the most CPU was one that could not be easily tweaked away with a mkfs option.

So then he turned to Btrfs. Using perf, he was able to see that there was lock contention on the B-tree locks. The Btrfs B-tree stores all of its data in the leaves of the tree; when it is updating the tree, it has to lock non-leaf nodes on the way to the leaf, starting with the root node. For some operations, those locks have to be held as it traverses the tree. Hopefully only the leaf needs to be locked, but sometimes that is not the case and, since everything starts at the root, it is not surprising that there is contention for that lock.

As an experiment to make Btrfs go faster, he used the subvolume feature to effectively create more root nodes. Instead of the usual one volume (with one root node), he created sixteen subvolumes so that there was one per CPU, each with its own root node and lock. That allowed Btrfs to get close to the XFS performance at 175,000 file-creations per second.

But the goal was to make the filesystem faster without resorting to subvolumes, which led to a new B-tree locking scheme. By default, Btrfs has 16KB nodes, which is not changing, but instead of being treated as a single group, each node will now be broken up into sixteen groups, each with its own lock.

He has not yet picked the best number of groups for each node, but the change allows a default Btrfs filesystem create 90,000 files per second. There are a lot of assumptions in Btrfs that there is only one lock per node, which he is working on removing. In addition, Btrfs switched to reader/writer locks a ways back and it turns out that those perform worse than expected, so he will be looking into that.

By some other measures, though, Btrfs compares favorably with XFS on the benchmark. XFS writes 120MB/second and does 3000 I/O operations/second (IOPS) for the benchmark, while Btrfs does 50MB/second and 300 IOPS to accomplish the same amount of work. That means that Btrfs is ordering things better and doing less I/O, Mason said.

The Gluster workloads at Facebook, which use rotational storage, are extremely sensitive to metadata latency to the point where one node's high latency can make the entire cluster slower than it should be. In the past, the company has used flashcache (which is similar to bcache) for both XFS and Btrfs to cache some data and metadata on SSDs, which improves the metadata latencies, but not enough.

To combat that, he has a set of patches to automatically put the Btrfs metadata on SSDs. The block layer provides information on whether the storage is rotational; for now, his patch assumes that if it is not rotational then it is fast. The patch has made a huge difference in the latencies and requires less flash storage (e.g. 450GB for 40TB filesystem) for Facebook's file workload that consists of a wide variety of file sizes. "You will need a lot more metadata if you have all 2KB files", he said.

That patch set is small (73 lines of code added), which is nice, he said. It is not entirely complete, though, as btrfs-utils needs changes to support it, but that should be a similarly sized change.

Another bottleneck he has encountered is in using the trim (or discard) command to tell SSDs about blocks that are no longer in use by the filesystem. That allows the flash translation layer to ignore those blocks when it is doing garbage collection and should, in theory, provide better performance. But many devices are slow when handling trim commands. Both XFS and Btrfs keep lists of blocks to trim, submit them as trim commands, and then must wait for those commands to complete during transaction commits, which stalls new filesystem operations. Those stalls can be huge, on the order of "tens of seconds", he said.

Ric Wheeler spoke up to say that trim is simply a request that the drive is free to ignore. He suggested that trim should not be performed during regular filesystem operations. Ted Ts'o agreed and said that the best practice for ext4 and probably other filesystems was to run the fstrim batch-trimming command regularly out of cron.

In answer to a question, Mason said that the disadvantages of not trimming are device-dependent. In some cases, it may reduce the lifetime of the device or add latencies during garbage collection, but it may also do nothing. Wheeler pointed out that if you are using thin provisioning, though, failing to trim could cause the storage to run out of space when there is actually space available.

Though it is not a flash-specific change, there have been some problems with large (> 16TB) Btrfs filesystems because of the free-space cache. Originally, free extents were not tracked, but that required scanning the entire filesystem at mount time, which was slow. When free-space was added, the cache was per-block-group and large filesystems have a lot of block groups, which meant that there was more caching on each commit. In the 4.5 kernel, Omar Sandoval added a new free-space cache (which can be enabled with -o space_cache=v2) that is "dramatically faster", with commit latencies dropping from four to zero seconds.

For the near future, he plans to finalize the new B-tree locking and improve some fsync() bottlenecks, though he thinks that the new space cache will help there. There are also some other spinlocks slowing things down that he wants to look at.

He mentioned a few of the tools that he uses to find bottlenecks. Perf is the right tool when processing is "pegged in the CPU", but finding problems when things are blocking is much harder. For that, he recommended BPF and BCC. In particular, Brendan Gregg's offcputime BPF script is useful to show both kernel and application stack traces to help show the reasons why a process is blocked. In fact, Facebook likes offcputime so much that fellow Btrfs maintainer Josef Bacik has created a way to aggregate the output of the program across multiple systems.

There were a few questions at the end of the session. One person asked whether Mason had seen any uptake of Btrfs for smaller devices. Mason said that the filesystem "needs love and care" when it is being used, which is why Facebook can use it. Someone with an ARM background would need to be working on Btrfs upstream in order to provide that kind of care if it were to be adopted on ARM-powered devices, he said.

Another asked how much faster the current design of Btrfs could go. Mason seemed quite optimistic that it could go "much faster". The metadata format is flexible, so "if things are broken, we can fix them".

The last two questions regarded two different benchmarks, both of which are interesting, but neither of which Mason has run. Flashcache versus bcache would likely provide similar numbers, he thought, but flashcache worked for Facebook so there was no need to try bcache. He also has not run benchmarks against ZFS. When he started Btrfs, ZFS was not available. There is no reason not to do so now, he said, but he hasn't, though he would be interested in the results.

[I would like to thank the Linux Foundation for travel assistance to Toronto for LinuxCon North America.]

Comments (12 posted)

Network filtering for control groups

By Jonathan Corbet
August 24, 2016
Control groups (cgroups) perform two basic functions in the kernel: they allow the hierarchical grouping of processes, and they enable the use of controllers to apply resource limits to the processes in each group. Now there is interest in extending cgroups to allow for the control of network traffic as well, but there is a significant difference of opinion over the best way to implement this control. Naturally, the discussion involves another kernel technology that seems to be spreading out into all areas: the Berkeley packet filter (BPF) virtual machine.

The objective is to be able to apply a filter to network traffic going to or from any process contained within a given cgroup. The intent may be to improve security, by restricting the traffic that a particular system service or application (contained within its own cgroup) can generate. Or it could be a desire for simple resource control or accounting. Either way, the point is to have this control at the cgroup level, something that the kernel does not support now.

One possible solution, posted by Daniel Mack, is to allow a BPF program to be attached to a cgroup. To that end, the bpf() system call is extended with a new BPF_PROG_ATTACH operation. Exactly what the program is attached to depends on the type of the program; for now the only type supported is BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, but the possibility exists that other types (to make other sorts of policy decisions for cgroups) could be supported in the future. Programs may be attached as either an ingress or an egress filter, controlled by a flag passed to the bpf() call. Naturally, there is also a BPF_PROG_DETACH operation to remove a BPF program from a cgroup.

Once the program is attached, it will be run on each packet sent to or from a process in the cgroup, depending on how it was attached — though only the ingress side is implemented in the current patch set. If the program returns one, the packet will be allowed to pass; otherwise it will be dropped.

The idea is thus relatively straightforward; it is similar to the socket filters that an individual process can apply to a socket it owns now. Cgroup maintainer Tejun Heo had some quibbles with the implementation, but had no real objection to the overall design. It seems like something that could be added without a whole lot of trouble — except that one developer has different ideas.

That developer is Pablo Neira Ayuso, the maintainer of the netfilter subsystem. Perhaps unsurprisingly, he thinks that the proper solution is based on netfilter rather than BPF; in particular, he would like to see the establishment of a special table of rules that could be attached to a cgroup. In his opinion, a set of rules that can be queried with existing tools would be easier for administrators to deal with than a relatively opaque BPF program. Multiple sets of netfilter rules can be composed, while the BPF approach only allows for a single program to be attached to a cgroup, limiting flexibility in situations where more than one entity wants to add filtering rules. A netfilter-based approach could also take advantage of the connection tracking that, likely, is already being done, speeding the processing of most packets. Those reasons, he says, make netfilter the better tool for this particular job.

Daniel acknowledged the downsides of the BPF implementation, though he was less convinced about the importance of some of them. It seems that this project was looking at a netfilter-based solution early on, but chose to refocus on BPF. There were concerns that the netfilter developers did not actually want a cgroup-level hook, and that the performance of the netfilter system might not be up to the task. He summarized things this way:

The whole 'eBPF in cgroups' idea was born because through the discussions over the past months we had on all this, it became clear to me that netfilter is not the right place for filtering on local tasks. I agree the solution I am proposing in my patch set has its downsides, mostly when it comes to transparency to users, but I considered that acceptable. After all, we have eBPF users all over the place in the kernel already, and seccomp, for instance, isn't any better in that regard.

Even so, he said, he would be willing to look again at a solution based on netfilter, especially if Pablo were willing to help with the implementation — something that Pablo said he could do. BPF developer Alexei Starovoitov was rather less impressed, suggesting that a netfilter-based solution should be considered as a separate facility in the future, if a way can be found to implement it without slowing things down too much.

And that is where the discussion stands as of this writing. In a sense, netfilter and BPF were always destined to come into conflict at some point; both are, in essence, mechanisms for loading packet-filtering policy into the kernel. Even if this particular disagreement is solved without undue drama, this question is likely to come up again in other contexts. Thus far, there seem to be few bounds on places where BPF may be applicable but, perhaps, it still isn't the solution to every policy problem that comes along.

Comments (4 posted)

Semantics of MMIO mapping attributes across architectures

August 24, 2016

This article was contributed by Paul E. McKenney, Will Deacon, and Luis R. Rodriguez

Although both memory-mapped I/O (MMIO) and normal memory (RAM) are ultimately accessed using the same CPU instructions, they are used for very different purposes. Normal memory is used to store and retrieve data, of course, while MMIO is instead primarily used to communicate with I/O devices, to initiate I/O transfers and to acknowledge interrupts, for example. And while concurrent access to shared memory can be complex, programmers need not worry about what type of memory is in use, with only a few exceptions. In contrast, even in the single-threaded case, understanding the effects of MMIO read and write operations requires a detailed understanding of the specific device being accessed by those reads and writes. But the Linux kernel is not single-threaded, so we also need to understand MMIO ordering and concurrency issues.

This article looks under the hood of the Linux kernel's MMIO implementation, covering a number of topics:

  1. MMIO introduction
  2. MMIO access primitives
  3. Memory types
  4. x86 implementation
  5. ARM64 implementation
  6. PowerPC implementation
  7. Summary and conclusions

MMIO introduction

[MMIO write]

MMIO offers both read and write operations. MMIO writes are used for one-way communication, causing the device to change its state, as shown in the diagram on the right. The MMIO write operation transmits the data and a portion of the address to the device, and the device uses both quantities to determine how it should change its state.

Quick quiz 1: Why can't the device make use of the full address?
Answer
The size of the MMIO write is also significant, in fact, the device might react completely differently to a single-byte MMIO write than to (say) a four-byte MMIO write. The size of the access could therefore be thought of as additional bits feeding into the device's state-change logic.

MMIO reads are used for two-way communication, causing the device to return a value based on its current state. [MMIO read] The MMIO read operation transmits a portion of the address to the device, which the device can use to determine how to query its state in order to compute the return value. Interestingly enough, the device can also change its state based on the read, and many devices do exactly that. For example, the MMIO read operation that reads a character from a serial input device would be expected to also remove that character from the device's internal queue, so that the next MMIO read would read the next input character. As with writes, the size of the MMIO read is significant.

An MMIO read operation signals its completion by returning the value from the device. In contrast, the only way to determine when an MMIO write operation has completed is to do an MMIO read to poll for completion. Such polling is completely and utterly device-dependent, sometimes even requiring time delays between the initial MMIO write and the first subsequent MMIO read.

Given that both MMIO reads and writes can change device state, ordering is extremely important and, as will be discussed below, many of the Linux kernel's MMIO access functions provide strong ordering, both with each other and with locking primitives and value-returning atomic operations. However, for devices such as frame buffers, it is not helpful to provide strict ordering, as the order of writes to independent pixels is irrelevant. In fact, write combining (WC) is an important frame-buffer performance optimization, and this optimization explicitly ignores the order of non-overlapping writes. This means that the hardware and the Linux kernel need some way of specifying which MMIO locations can and cannot tolerate reordering.

The x86 family responded to this need with memory type range registers (MTRRs), which were used to set WC cache attributes for VGA memory. MTRRs have been used to enable different caching policies for memory regions on different PCI devices. When an x86 system boots, the default cache attributes for physical memory are set by setting the model-specific register that sets the default memory type (MSR_MTRRdefType), and then MTRRs are used to modify memory in other ranges to other cache attributes, for example, uncached (UC) or WC for MMIO regions. Some BIOSes set MSR_MTRRdefType to writeback, which is a common default for DRAM. Other BIOSes might set MSR_MTRRdefType to UC, and then use MTRRs to set DRAM to writeback. One of the biggest issues with MTRRs is the limited number of them. In addition, using MTRRs on x86 requires the use of the heavyweight stop_machine() call whenever the MTRR configuration changes.

Quick quiz 2: Can you set up UC access to normal (non-MMIO) RAM?
Answer

The page attribute table (PAT) relies on paging to lift this limitation. Unfortunately, the BIOS runs in real mode, in which paging is not available, which means that the BIOS must continue to use MTRRs so x86 systems will continue to have them. However, Linux kernels can use paging, and can therefore use PAT when running on hardware providing it.

In short, MMIO reads and writes can be thought of as a message-passing communication mechanism for interacting with devices; they can be uncached for traditional device access or write combining for access to things like frame buffers and InfiniBand, and they require special attention in order to interact properly with synchronization primitives such as locks. The Linux kernel therefore provides architecture-specific primitives that implement MMIO accesses, as described in the next section.

MMIO access primitives

The readX() function does MMIO reads, with the X specifying the size of the read, so that readb() reads one byte, readw() reads two bytes, readl() reads four bytes, and, on some 64-bit systems, readq() reads eight bytes. These functions are all little-endian, but in some cases, big-endian behavior can be specified using an additional "_be" component to the X suffix.

The writeX() function does MMIO writes, with the X specifying write size as above, and again in some cases with an additional "_be" component to the X suffix.

There are also inX() and outX() functions that map back to the x86 in and out instructions, respectively. The X suffix contains the size in bits and a "be" or "le" component to specify endianness. These are sometimes mapped to MMIO on non-x86 systems. The ioreadX() and iowriteX(), where X is the number of bits to operate on, can also be used to read and write MMIO; they were added to hide the differences between in/out operations and MMIO.

Linus Torvalds created the following example on a Kernel Summit whiteboard to illustrate ordering requirements:

     1 unsigned long global = 0;
     2
     3 void locked_device_output(void)
     4 {
     5   spin_lock(&a);
     6   i = global++;
     7   writel(i, dev_slave_address);
     8   spin_unlock(&a);
     9 }

Line 5 acquires lock a, line 6 increments a global variable global under that lock, line 7 writes the previous value of global to an MMIO location at dev_slave_address, and finally line 8 releases the lock. In an ideal world, both lines 6 and 7 would be protected by the lock when locked_device_output() is invoked concurrently.

Of course, the normal variable global is protected by the lock, that being what locks are for. However, for the MMIO write to dev_slave_address, such protection requires that the implementations of spin_lock(), spin_unlock(), and writel() cooperate so as to provide the ordering required to force the writel() to dev_slave_address of the value 0 from the first locked_device_output() call to happen before the second call writes the value 1. This is what x86 does and what most developers would expect to happen. Weakly ordered systems must therefore insert whatever memory barriers are required to enforce this ordering.

Quick quiz 3: What do weakly ordered systems do to enforce this ordering?
Answer

Providing the required ordering can be expensive on weakly ordered systems. Because there are a number of situations where ordering is not required (for example, frame buffers), the Linux kernel provides relaxed variants (readX_relaxed() corresponding to readX() and writeX_relaxed() corresponding to writeX()) that do not guarantee strong ordering, which can be used as follows:

     1 unsigned long global = 0;
     2
     3 void locked_device_output(void)
     4 {
     5   spin_lock(&a);
     6   i = global++;
     7   writel_relaxed(i, dev_slave_address);
     8   spin_unlock(&a);
     9 }

Because this example uses writel_relaxed() instead of writel(), the writel_relaxed() can be reordered with the spin_unlock(), so that the write of the value 1 might well precede the write of the value 0. An MMIO write memory barrier, called mmiowb(), may be used to prevent MMIO writes from being reordered with each other or or with locking primitives and value-returning atomic operations. This mmiowb() primitive can be used as shown below:

     1 unsigned long global = 0;
     2
     3 void locked_device_output(void)
     4 {
     5   spin_lock(&a);
     6   i = global++;
     7   writel_relaxed(i, dev_slave_address);
     8   mmiowb();
     9   spin_unlock(&a);
    10 }

Again, without the mmiowb(), the writel_relaxed() call might be reordered with its counterpart from a later instance of this critical section that was running on a different CPU.

Quick quiz 4: Why can't mmiowb() be used with writel()?
Answer

A more useful version of this example might do several writel_relaxed() invocations in the critical section followed by a final mmiowb(). It is worth noting that mmiowb() is a no-op on most architectures.

However, _relaxed() accesses from a given CPU to a specific device are guaranteed to be ordered with respect to each other. Tighter semantics can of course be used: per-bus or even global, for example.

Nevertheless, the _relaxed() functions are not primitives that most device driver developers normally consider using and, even if they did, there are still some kernel calls, such as locking calls, that might nullify such relaxed effects. Its unclear if these implications have always been well thought-out throughout the entire kernel. For instance, it is now understood that PowerPC's default kernel writel() uses a memory barrier; although typically one would expect write-combining to happen in user space for frame buffers, kernel writes could nullify the write-combining effects.

Asking more developers to use the relaxed primitives when write combining might be the first instinct to address this situation, there are other possible issues which still need to be considered, however. For instance, would using a spin_lock() nullify any write-combining effects on some architectures even if relaxed primitives were used? If so, which architectures would be affected? Are we nullifying write combining in some areas in the kernel even on x86 if locks are used? To answer these questions, we must review each architecture's MMIO and locking primitive helpers and the implications of them on ordering.

Memory types

As noted earlier, this article covers the two most common flavors of MMIO, uncached MMIO and write-combining MMIO. In uncached MMIO, each read and write is independent and in some sense atomic, with no combining, prefetching, or caching of any kind. The ioremap_nocache() function is used to map uncached MMIO registers. In write-combining MMIO, both reads and writes can be both coalesced and reordered, even the non-_relaxed() reads and writes. Memory that is write combining is also normally "prefetchable", and these terms sometimes appear to be used interchangeably. The ioremap_wc() function is used to map write-combining MMIO registers.

For any given architecture, there are some questions about write combining:

  • What prevents reordering and combining? (Presumably mmiowb() and perhaps also mb().)
  • What operations flush the write buffers? (Hardware dependent, but reads from a given device typically flush prior writes to that same device.)
Five additional per-architecture questions will be addressed in tabular form:
  1. Must non-relaxed accesses to MMIO regions be confined to lock-based critical sections? (Presumably the answer is "yes".)
  2. Must relaxed accesses to MMIO regions be confined to lock-based critical sections? (Prudence would suggest "yes" as the answer, at least in the UC case.)
  3. Must reads from MMIO regions be ordered with each other? (Presumably the answer is "no" for _relaxed() primitives.)
  4. Must reads from MMIO regions be ordered with writes to other locations within the region? (Presumably the answer is "no".)
  5. Must accesses to specific locations in MMIO regions be ordered with other accesses to that same location? (Presumably the answer is "yes", even for accesses to WC MMIO regions, at least for completely overlapping updates. Otherwise you would get old pixels on your display, after all.)

It is natural to wonder what happens if a given range of MMIO registers is mapped as write combining at one virtual address and as uncached (non-write-combining) at some other address. The answer varies across both architectures and devices, so that the current Linux-kernel stance is "don't do that" unless absolutely necessary.

Regardless of what the answers are, they clearly need to be better documented.

Existing practice

There are more than 2,000 uses of writel_relaxed() and more than 1,000 uses of readl_relaxed(), so existing practice must be taken into account: changes might be made, but not lightly. Many uses of these primitives are in architecture-specific code, but there are common-code uses in some drivers. We took a look at a few of them:

  • drivers/ata/ahci_brcmstb.c: This driver uses brcm_sata_readreg() and brcm_sata_writereg() to wrap readl_relaxed() and writel_relaxed(), respectively. The code appears to expect that relaxed reads and writes from/to the same device will be ordered.
  • drivers/crypto/atmel-aes.c: This driver uses atmel_aes_read() and atmel_aes_write() to wrap readl_relaxed() and writel_relaxed(), respectively. The code appears to expect that relaxed reads and writes from/to the same device will be ordered.
  • drivers/crypto/img-hash.c: This driver uses img_hash_read() and img_hash_write() to wrap readl_relaxed() and writel_relaxed(), respectively. The code appears to expect that relaxed reads and writes from/to the same device will be ordered.
  • drivers/crypto/ux500/cryp/cryp.c appears to expect that relaxed reads and writes from/to the same device will be ordered. At present, this driver does not seem to be used outside of ARM, but crypto IP blocks are not necessarily tied to ARM.

Although the bulk of the uses of the relaxed I/O accessors are confined to one architecture or another, it would not necessarily be wise to define CPU-family-specific changes to their semantics. Such changes are likely to cause serious problems should one of the corresponding hardware IP blocks ever be used by an implementation of the some other CPU family.

x86 implementation

The x86 mapping is as follows:

API Implementation Ordering
mmiowb() barrier() Provided by x86 ordering
spin_unlock() arch_spin_unlock() Provided by x86 ordering
inb()
inw()
inl()
inb instruction
inw instruction
inl instruction
See table below
outb()
outw()
outl()
outb instruction
outw instruction
outl instruction
See table below
readb(), readb_relaxed(), ioread8()
readw()
, readw_relaxed(), ioread16()
readl()
, readl_relaxed(), ioread32()
readq()
, readq_relaxed()
MMIO read See table below
writeb(), writeb_relaxed(), iowrite8()
writew()
, writew_relaxed(), iowrite16()
writel()
, writel_relaxed(), iowrite32()
writeq()
, writeq_relaxed()
MMIO write See table below

The readX() and writeX() definitions are built by the build_mmio_read() and build_mmio_write() macros, respectively.

The x86 answers to the other questions appear to be as follows, based on a scan through "Intel 64 and IA-32 Architectures Software Developer's Manual V3":

  • What prevents reordering and combining? For non-write-combining MMIO regions, everything. For write-combining MMIO regions, mmiowb(), smp_mb(), an access to a non-write-combining MMIO region, an interrupt, or a locked instruction, which is an instruction having the LOCK prefix that signals that the instruction is to be an atomic read-modify-write instruction.
  • What operations flush the write buffers? The same operations that prevent reordering and combining.
x86 Within ioremap_wc() Within ioremap_nocache() Against normal memory
_relaxed() Unordered, ordered to same location. Ordered. Ordered.
non-relaxed() Unordered, ordered to same location. Ordered. See [*] below.

[*]: Accesses to ioremap_wc() memory are not ordered with accesses to normal memory unless:
  1. Either there is an intervening smp_mb(), or
  2. The normal-memory access uses the lock prefix, and
  3. The I/O fabric is "sane" in that it avoids reordering and buffering invisible to the CPU. I/O fabrics that have multiple layers of I/O bus are all too often not sane.

Quick quiz 5: But if x86 always uses the same instructions for MMIO, how can the ordering semantic differ for ioremap_wc() and ioremap_nocache() regions?
Answer

To reiterate that last point, note that this all assumes sane hardware. It is possible to construct x86 systems with I/O bus structures that do not follow the above rules. Drivers written for such systems typically need to "confirm" prior MMIO writes by doing a later MMIO read that either forces ordering or verifies the state changes caused by the write. The exact confirmation method will depend on the details of the I/O device in question.

On older MTRR-only x86 systems, some frame-buffer drivers must also use arch_phys_wc_add(), because on such systems ioremap_wc() would otherwise produce an uncached non-write-combining mapping for the corresponding device. This inability of ioremap_wc() to Do The Right Thing can be due to limited numbers of MTRRs, limited MTRR size, I/O-mapping alignment constraints, page aliasing (for example, to provide both kernel- and user-mode access to MMIO registers), and because some old hardware simply cannot be shoehorned into the nice new PAT-based Linux-kernel APIs.

For more details, see commits "drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()", "drivers/video/fbdev/atyfb: Clarify ioremap() base and length used", and "drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper". Commit "drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC" is particularly instructive, as it describes how old hacks were replaced by newer less-hacky hacks based on the new API.

Those interested in page aliasing should refer to Documentation/ia64/aliasing.txt, particularly the "POTENTIAL ATTRIBUTE ALIASING CASES" section. Fortunately, most device manufacturers now dedicate one full PCI base address register (BAR) to MMIO and another for frame-buffer use, which means that developers writing drivers for modern devices can for the most part simply use the ioremap_nocache() and ioremap_wc() APIs.

One important last note: On x86 systems, spinlock-release primitives usually use a plain store instruction. This will not order accesses within ioremap_wc() regions. Although this might seem strange at first glance, it has the advantage that the effectiveness of write combining is not limited by spin_unlock() invocations.

ARM64 implementation

The arm64 mapping is as follows:

API Implementation Ordering
mmiowb() do { } while (0) Provided by ARM64 ordering
spin_unlock() arch_spin_unlock() llsc or lse instruction
readb(), ioread8()
readw()
, ioread16()
readl()
, ioread32()
readq()
MMIO read Follow MMIO read by rmb()
writeb(), iowrite8()
writew(), iowrite16()
writel(), iowrite32()
writeq()
MMIO write Precede MMIO write with wmb()
readb_relaxed()
readw_relaxed()
readl_relaxed()
readq_relaxed()
MMIO read See table below
writeb_relaxed()
writew_relaxed()
writel_relaxed()
writeq_relaxed()
MMIO write See table below

Note that although ARM does distinguish between WC and non-WC flavors of MMIO regions in terms of ordering; the type of accessor (_relaxed() vs. non-_relaxed()) also has a big role to play. Note that ARM64's non-_relaxed() accessors have ordering properties similar to total store order (TSO), that is, they order prior reads against later reads and writes, and also order prior writes against later writes, but they do not order prior writes against later reads.

The ARM64 answers to the questions are as follows:

  • What prevents reordering and combining? A non-relaxed MMIO access (aside from not ordering prior writes against later reads) or either mb(), rmb(), or wmb().
  • What operations flush the write buffers for write-combining regions? Either mb() or wmb(). But please note that this flushing has effect only within the CPU. These memory barrier do not necessarily affect any write buffers that might reside on external I/O buses.
ARM64 Within ioremap_wc() Within ioremap_nocache() Against normal memory
_relaxed() Unordered, but fully ordered for accesses to the same address. Unordered, but fully ordered for accesses to same device. Unordered.
non-relaxed() "TSO", but fully ordered for accesses to the same address. "TSO", but fully ordered for accesses to same device. See below.

In the above table, "TSO" allows prior writes to be reordered with later reads, but prevents any other reordering.

The lower right-hand cell's rules are as follows:

Prior Access Next Access Ordering
Non-Relaxed Read Plain Read Ordered (useful for reading from a DMA buffer).
Non-Relaxed Read Plain Write Ordered.
Non-Relaxed Write Plain Read Unordered.
Non-Relaxed Write Plain Write Unordered.
Plain Read Non-Relaxed Read Unordered (departure from TSO).
Plain Read Non-Relaxed Write Unordered (departure from TSO).
Plain Write Non-Relaxed Read Unordered.
Plain Write Non-Relaxed Write Ordered (useful for triggering DMA).

Just as with x86, it is possible to construct ARM systems with I/O bus structures that do not follow the above rules. Drivers written for such systems typically need to "confirm" prior MMIO writes by doing a later MMIO read that either forces ordering or verifies the writes' state changes. The exact confirmation method will depend on the details of the I/O device in question.

PowerPC implementation

Finally, the PowerPC mapping uses an ->io_sync field in the Linux kernel's PowerPC-specific per-CPU data. This field is set by PowerPC MMIO writes, and tested at unlock time. If this field is set, the unlock primitive executes a heavyweight sync instruction, which forces the last MMIO write to be contained within the critical section.

The mapping is as follows:

API Implementation Ordering
mmiowb() sync and clear ->io_sync
spin_unlock() arch_spin_unlock() If ->io_sync set, sync and clear ->io_sync
in_8()
in_be16()
in_be32()
in_le16()
in_le32()
MMIO read sync followed by read followed by twi;isync
out_8()
out_be16()
out_be32()
out_le16()
out_le32()
MMIO write sync followed by write followed by set ->io_sync
readb(), inb(), ioread8()
readw()
, inw(), ioread16()
readw_be()
, ioread16be()
readl()
, inl(), ioread32()
readl_be()
, ioread32be()
readq()
readq_be()
in_8()
in_le16()
in_be16()
in_le32()
in_be32()
in_le64()
in_be64()
sync followed by read followed by twi;isync
writeb(), outb(), iowrite8()
writew()
, outw(), iowrite16()
writew_be()
, iowrite16be()
writel()
, outl(), iowrite32()
writel_be()
, writel_be()
writeq()
writeq_be()
out_8()
out_le16()
out_be16()
out_le32()
out_be32()
out_le64()
out_be64()
sync followed by write followed by set ->io_sync

The alert reader will note the duplication of some names in the "API" and "Implementation" columns, for example, in_8(). The definitions are in arch/powerpc/include/asm/io.h and arch/powerpc/include/asm/io-defs.h.

Other implementation strategies are possible, of course. One approach would be for mmiowb() and arch_spin_unlock() to both unconditionally execute the sync instruction and to dispense with the ->io_sync flag. Another approach would be to make mmiowb() an no-op, eliminate the test and sync instruction from arch_spin_unlock(), and replace setting of ->io_sync by a sync instruction. However, both of these approaches would greatly increase the number of executions of the expensive sync instruction, so, for PowerPC, the implementation in the above table is preferred.

Currently (v4.3) PowerPC's _relaxed() interfaces operate exactly the same as do their non-relaxed counterparts. Part of the motivation for the MMIO discussion during the technical day at the 2015 Linux Kernel Summit was to determine how and to what extent PowerPC could actually relax the _relaxed() implementations. However, this article limits itself to documenting current reality.

The PowerPC answers to the questions appears to be as follows:

  • What prevents reordering and combining? Any MMIO access, mmiowb(), or smp_mb(). This is not a good thing, as it makes for slow frame buffers.
  • What operations flush the write buffers? The same operations that prevent reordering and combining.
PowerPC Within ioremap_wc() Within ioremap_nocache() Against normal memory
_relaxed() Fully Ordered. Fully Ordered. Fully Ordered.
non-relaxed() Fully Ordered. Fully Ordered. Fully Ordered.

Summary and conclusions

MMIO should be thought of as a message-passing mechanism that communicates with hardware rather than a variant of normal memory. As such, MMIO is not only device-specific, but also specific to the hardware path between the CPU and the device. In the general case, which includes ill-considered hardware designs, even memory barriers cannot always order accesses: in some cases, the device's state must be polled to determine when a prior access has completed.

Nevertheless, the Linux kernel offers a rich set of primitives with which to interact with MMIO devices, and this article has given a brief overview of how they work and how they may be used.

Acknowledgments

We are grateful to Michael Ellerman, Gautham Shenoy, Peter Zijlstra, Andy Lutomirski, and Boqun Feng for their review and comments. We owe thanks to Toshimitsu Kani, Dave Airlie, Christoph Hellwig, and Matt Fleming for a number of important discussions, and to Jim Wasko for his support of this effort.

Answers to Quick quizzes

Quick quiz 1: Why can't the device make use of the full address?

Answer: Because part of the address is used to select the device.

Back to Quick quiz 1.

Quick quiz 2: Can you set up UC access to normal (non-MMIO) RAM?

Answer: You can, and this is in fact actually used for GPU memory for GPUs that cannot snoop the CPU caches. One example may be found in ati_create_page_map(), which uses __get_free_page() to allocate a page of DRAM and then later uses set_memory_uc() to change the cache attribute. There is also a set_memory_wc(). Although set_memory_uc() and set_memory_wc() may also be used to set up MMIO, such use is likely to be strongly discouraged. In addition, it is quite possible that the set_memory_uc() and set_memory_wc() APIs will change.

Back to Quick quiz 2.

Quick quiz 3: What do weakly ordered systems do to enforce this ordering?

Answer: They enforce this ordering by a combination of hardware and software ordering constraints. Please read on for more information, leading up to descriptions of the ARM64 and the PowerPC implementations.

Back to Quick quiz 3.

Quick quiz 4: Why can't mmiowb() be used with writel()?

Answer: Actually, they really can be used together. But there is little point in doing so because writel() already provides strong ordering. Therefore, placing an mmiowb() after a writel() has no effect other than to slow things down.

Of course, in this case it would be simpler to just use writel() instead of both writel_relaxed() and mmiowb(). However, mmiowb() is quite useful when there are multiple writel_relaxed(), all of which need to be contained within the critical section. A single mmiowb() placed between the last writel_relaxed() and the unlock will contain all of them, and with the added memory-barrier overhead incurred only once at mmiowb() time instead of once for each and every writel().

Back to Quick quiz 4.

Quick quiz 5: But if x86 always uses the same instructions for MMIO, how can the ordering semantic differ for ioremap_wc() and ioremap_nocache() regions?

Answer: Because the ordering is controlled not by the instructions, but rather by the MTRR settings (in older systems) or by PAT (in newer systems).

Back to Quick quiz 5.

Comments (4 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.8-rc3 Aug 21
Greg KH Linux 4.7.2 Aug 20
Greg KH Linux 4.4.19 Aug 20
Ben Hutchings Linux 3.16.37 Aug 23
Greg KH Linux 3.14.77 Aug 20
Ben Hutchings Linux 3.2.82 Aug 23

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Zhiyong Tao AUXADC: Mediatek auxadc driver Aug 18
Minghsiu Tsai Add MT8173 MDP Driver Aug 19
Suravee Suthikulpanit iommu/AMD: Introduce IOMMU AVIC support Aug 18
Stanimir Varbanov Venus remoteproc driver Aug 19
Benjamin Tissoires Synaptics RMI4 over SMBus Aug 18
Stanimir Varbanov Qualcomm video decoder/encoder driver Aug 22
Martin Blumenstingl meson: Meson8b and GXBB DWMAC glue driver Aug 20
Raghu Vatsavayi liquidio CN23XX support Aug 21
Thierry Reding Initial Tegra186 support Aug 19
Noralf Trønnes drm: add SimpleDRM driver Aug 22

Device driver infrastructure

Guenter Roeck Type-C Port Manager Aug 17
Rob Herring UART slave device bus Aug 17
Marek Szyprowski New feature: Framebuffer processors Aug 22

Documentation

Filesystems and block I/O

Andreas Gruenbacher Xattr inode operation removal Aug 22
Ross Zwisler re-enable DAX PMD support Aug 23

Networking

Security-related

Jens Wiklander generic TEE subsystem Aug 22

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds