Kernel development
Brief items
Kernel release status
The current development kernel is 4.6-rc1, released on March 26. "So I'm closing the merge window a day early, partly because I have some upcoming travel, but partly because this has actually been one of the bigger merge windows in a while, and if somebody was planning on trying to sneak in any last-minute features, I really don't want to hear about it any more."
Stable updates: none have been released in the last week.
Quote of the week
SystemTap 3.0 released
Version 3.0 of the SystemTap kernel tracing system has been released. Significant changes include an interactive script-building mechanism, a "monitor" mode allowing ongoing display of accumulated statistics, much faster associative arrays, function overloading, a lot of tapset improvements, and more.Linux at 25: Q&A With Linus Torvalds (Spectrum)
IEEE Spectrum interviews Linus Torvalds. "The kernel is actually doing very well. People continue to worry about things getting too complicated for people to understand and fix bugs. It’s certainly an understandable worry. But at the same time, we have a lot of smart people involved. The fact that the system has grown so big and complicated and so many people depend on it has forced us to have a lot of processes in place. It can be very challenging to get big and have invasive changes accepted, so I wouldn’t call it one big happy place, but I think kernel development is working."
Kernel development news
The end of the 4.6 merge window
Linus released 4.6-rc1 and closed the merge window on March 26, one day earlier than might have been expected; at that time, he cited the size of this development cycle as one of the reasons for the early release. This merge window ended with 12,172 non-merge changesets being pulled into the mainline repository, which is indeed a significant number — indeed, it is the busiest merge window in the history of the kernel project. The only other cycles that came close are 4.2 (12,092 changesets during the merge window) and 3.15 (12,034). Given that there are some significant cleanup patches that may go in shortly, 4.6 has a good chance of being the busiest development cycle ever.Just over 1,000 of those changesets were merged after last week's summary; as was suggested at the time, the pace slowed toward the end. Still, some significant changes were merged, including:
- The infrastructure for tracing histogram
triggers has been merged, though the complete feature is still
waiting for a bit more testing.
- The pNFS subsystem (and the NFS server in particular) now supports a
"SCSI layout" mode. See pnfs-scsi-server.txt for some
more information.
- The out-of-memory reaper
patches have been merged. This patch set makes the kernel more
aggressively take memory away from processes being terminated by the
out-of-memory killer, hopefully making the out-of-memory response
faster and more reliable.
- After a long time and many rounds of review, the OrangeFS distributed filesystem has been
merged. There are seemingly a few issues to be worked on still, but
the code was deemed to be in good enough shape to go into the mainline.
- New hardware support includes:
Qualcomm IPQ4019 global clock controllers,
Qualcomm NAND flash controllers,
Texas Instruments dm814x ADPLL clocks, and
Texas Instruments Message Manager hardware mailboxes.
Additionally, last week's summary missed the addition of a number of system-on-chip platforms and boards, including Allwinner A83T, Annapurna Labs Alpine v2, Axis Artpec-6, Broadcom Vulcan, NXP i.MX 6QuadPlus, ARM Juno R2 development boards, Marvell Armada 3700, 7XXX and 8XXX, MediaTek MT7623, Amlogic S905 (Meson GXBaby), Qualcomm Snapdragon MSM8996 TI KeyStone K2G, and SocioNext UniPhier PH1.
- The slab memory allocator now has support for the KASAN debugging tool.
If the usual (for recent years) 63-day pattern holds, the final 4.6 release can be expected on May 15, though unforeseen problems or opportunities to do some good diving can always cause delays.
Advanced usage of last branch records
Last branch records (LBRs) are hardware registers on Intel CPUs that allow sampling branches. These registers hold a ring buffer of the most recent branch decisions that can be used to reconstruct the program's behavior. Last week, we examined the basics of LBRs using Linux perf. Now we look at some more advanced uses.
Transactional memory
The Transactional Synchronization Extension (TSX) is a hardware transactional memory implementation, available in Intel Broadwell or later, that can improve performance for critical regions by speculatively executing them in parallel. For tuning TSX programs, the goal is usually to reduce unnecessary transaction aborts. The perf --branch-history option can also be useful to see why TSX transactions aborted. Normally we cannot see into TSX transactions because any profiling interrupt aborts the transaction. But LBRs can log branches even inside transactions, so --branch-history can show why the abort happened.
This is particularly useful for internal aborts that are caused by the code inside the transaction itself, such as those caused by transaction-aborting instructions like system calls. That allows seeing what led to the system call being executed. For conflict aborts caused by other threads, they may be visible if the abort happens to be near the instruction that touched the conflicting cache line. There is no guarantee of that, though, as the conflict could hit at any point in the transaction.
The LBRs also have flags that show that a particular branch was in a transaction or is an abort. This is currently not displayed by --branch-history, but can be examined manually through perf report or perf script. For more details see the perf TSX tutorial.
Branch mispredictions
In addition to the "from" and to "addresses" of branches, LBRs also provide fine-grained information on branch misprediction. Modern CPUs have a long pipeline to execute instructions, and rely on accurately predicting the branch targets to keep the pipeline filled. This is done by a branch predictor. When branches are often mispredicted, performance is typically poor, as the CPU has to throw away a lot of work. Often code that is difficult to predict can be restructured to make it easier for the branch predictor to handle (the Intel optimization manual [644-page PDF] describes some techniques on how to do this).
Mispredicted branches can be sampled directly using CPU-performance-counter events, but deriving them from LBRs gives better coverage and does not require an additional performance counter. perf report can directly display the branch misprediction state, but currently only at function level.
Here is an example based on a Stack Overflow question, asking why processing a sorted array is faster than processing an unsorted array. It fills an array with random numbers and then executes a loop with conditional jumps depending on the random data. These jumps are random and often mispredicted.
1 /* mispredict.c */ 2 #include <stdlib.h> 3 #define N 200000000 4 int main(void) 5 { 6 int *data = malloc(N * sizeof(int)); 7 int i, k; 8 for (i = 0; i < N; i++) 9 data[i] = rand() % 256; 10 11 volatile int sum; 12 for (k = 0; k < 50; k++) 13 for (i = 0; i < N; i++) 14 if (data[i] >= 128) 15 sum += data[i]; 16 }
% gcc -g -o mispredict mispredict.c % perf record -b perf stat ./mispredict Performance counter stats for './mispredict': 23,293,062,786 branches (100.00%) 5,008,912,919 branch-misses # 21.50% of all branches (100.00%) % perf report --sort symbol_from,symbol_to,mispredict --stdio ... # Overhead Source Symbol Target Symbol Branch Mispredicted # ........ ............... .................. ................... # 53.46% [k] main [k] main N 27.21% [.] main [.] main N 17.22% [.] main [.] main Y
Now we know that 17% of the total branches are mispredicted in main(). We could now attempt to fix them to speed up the program, such as by sorting the array first to make the if statement more predictable. Unfortunately, perf doesn't tell us directly on which line they occur, at least yet.
LBR filtering
Previously we looked at all branches using LBRs, but the CPU also supports filtering the branch types, so that not all branch types are logged. This can be useful to get more context. perf record supports branch filtering using the -j option with the following types:
For example:
Name Meaning any_call any function call or system call any_ret any function return or system call return ind_call any indirect branch u only when the branch target is at the user level k only when the branch target is in the kernel hv only when the target is at the hypervisor level in_tx only when the target is in a hardware transaction no_tx only when the target is not in a hardware transaction abort_tx only when the target is a hardware transaction abort cond conditional branches
% perf record -j any_call,any_ret ...would record any function calls and returns. Note that not all filter types are supported by all CPUs. If a filter is missing, perf will fall back to filtering in software (which does not increase the effective size of the LBR) and the hypervisor option is only supported on POWER CPUs currently.
Filtering is mostly useful when LBRs are being used for debugging. There is an exception, which is described next, but we usually want to see all branches.
LBR call graph
In many cases while profiling we want the function call graph for each sample (perf record -g); otherwise it is not clear why a function was executed. Traditionally this was implemented by using the frame pointer information set up by the compiler that resides on the stack. Since frame pointers can be somewhat expensive on some x86 CPUs, 64-bit binaries often don't include them; GCC defaults to not enabling them. This results in incomplete call graphs.
Newer perf versions also support using the DWARF unwind information to get the call graph, but this is quite expensive, since the stack needs to be copied for each sample, and it also doesn't always work.
Since Haswell, the LBRs have a new mode where the CPU logs every call and return into the LBR and treats them as a stack. This results in the CPU keeping track of the current call graph. Since version 3.19, perf supports this mode to collect this LBR-based call graph. In this case, it also saves/restores the LBRs in the context switch.
% perf record --call-graph lbr program % perf reportThere are some limitations; for example, C++ exceptions or user-space threading can corrupt the call stack. But for typical usage, it works quite well. This feature is especially useful with just-in-time (JIT) compilers that cannot generate frame pointers.
Timed LBR
The Skylake CPUs, beyond just extending the number of LBRs to 32 entries, also added a new "timed LBR" feature. The CPU logs how many cycles occurred between the branches logged in the LBR.In an aggressive out-of-order CPU like Skylake, the time when something "occurred" can be a somewhat fuzzy concept: the CPU pipeline executes many instructions in parallel, and parts of the instructions in the basic block may have been executed before or after the branches. The LBR cycles are logged when the branches are issued. Still, the cycles provide a useful, rough indication how long the code area between the branches took.
This allows much more fine-grained accounting of cycles than what is normally possible with sampling. Also, unlike manual instrumentation of programs with timing calls, it has only a minor and tunable overhead — only the sampling overhead, which can be tuned by lowering the sampling frequency.
![[Cycle annotation]](https://static.lwn.net/images/2016/lbr-cycles-sm.png)
Starting with version 4.2, perf uses the timed LBR information in perf report and in perf annotate to report "instructions per cycle" and the average number of cycles for specific basic blocks. The aggregated average cycles per basic block are reported by perf report in the branch view at a function level. The interactive version of perf annotate (available when browsing samples through perf report) can also show the average cycles and instructions per cycle directly with the source and assembler code (see screen shot at right):
% perf record -b ./div % perf report <navigate to a sample> <press right cursor key> <select annotate>This example uses the div.c program from last week's article. The first column shows the average number of cycles for a block. In this case, the generated code jumps into the middle of a block, so we have an overlapping short and long block, but we see that the long block, which includes the two divisions, takes ~25 cycles on average. The short block takes about 6 cycles. The second column is the IPC (instructions per cycle) for the block.
This allows analysis of how long it takes to execute blocks of instructions in real programs without having to write custom micro-benchmarks.
Virtualization
LBRs are a model-specific feature and are normally not available in virtual machines since most hypervisors do not virtualize them. There is some work in KVM to support LBR virtualization (which has not been merged so far), but other hypervisors, such as Xen and VMware, do not support it. That means to use LBRs the workload currently has to run in a non-virtualized environment.LBRs work great in containers, though.
Virtual LBR with Processor Trace
When using LBRs it would be often useful to have more entries than the 8-32 that are currently available to see more context. However, the CPU cannot provide more than it has implemented in hardware. There is a way of using Processor Trace (PT) to generate virtual, arbitrary-sized LBRs. PT records all jumps for a particular area in memory. The PT decoder can generate virtual perf samples from such a stream. Unlike normal LBRs, PT has more overhead. It requires a CPU with Processor Trace support (grep intel_pt /proc/cpuinfo), such as Broadwell or Skylake.
% start program # capture 1 second of execution % perf record -e intel_pt//u -p $(pidof program) sleep 1 % perf report --itrace=10usl60c --branch-historyThis example samples the PT stream every 10μs with a call graph, and attaches the last 60 branches as LBR entries to each sample. Normally, the result is analyzed using --branch-history. This allows seeing much longer paths. Note that it is often not feasible to record long program executions, as it may generate data faster than the disk can keep up with. Virtual LBRs with PT have been added in perf version 4.5.
Debugging
Last branch records can be also used for debugging to find out what happened before a crash. One problem is that the crash handler often "pollutes" the LBRs before their contents can be logged. The perf code is able to save the LBRs on each context switch, but there is currently no interface for a user-space debugger like GDB to access that information.Using them for system-wide debugging typically requires using LBR filtering or custom kernel changes to disable and re-enable them on exceptions. They work fairly well from JTAG debuggers (such as the Intel System Debugger) because the JTAG debugger does not execute branches and pollute the LBRs while executing. Generally Processor Trace is more versatile for debugging because the traces can provide much larger windows into program execution.
Conclusion
LBRs are a powerful mechanism that can help with performance tuning and debugging. They can be used to look into TSX transactions and to get call stacks without the need for frame pointers. On recent CPUs they allow fine-grained timing of instruction blocks, often doing away with the need to write custom micro-benchmarks to understand performance at the instruction level.
Perf already has good support for many performance tuning uses of LBRs. Some improvements, especially better support for resolving source lines and better display of hot paths, will hopefully be implemented in the future.
Blurred boundaries in the storage stack
It has been said that an important part of a maintainer's role is to say "no". Just how this "no" is said can define the style and effectiveness of a maintainer. Linus Torvalds recently displayed just how effective his style can be when saying "no" to a pair of fairly innocuous patches to add a new ioctl() command for block devices — patches in their fifth revision that had already received "Reviewed-by" tags from Christoph Hellwig:
It became clear that Torvalds only had a fairly general understanding
of the underlying functionality and didn't much care about it anyway.
What he cared about, as he said, was the interface. It seemed both
"too specific
" and too generic; "too 'future-proofing'
".
These complaints led to a wide-ranging discussion that brought out a number of underlying issues, drew parallels between disparate parts of the storage stack, and resulted in a new interface proposal that gives quite a different flavor to the same basic operations.
The heart of the matter
Modern storage devices can do a lot more with stored data than simply read or write arbitrary blocks. Of the other operations the best known is doubtlessly "discard". This operation, named TRIM in the ATA protocol and UNMAP in SCSI, tells the storage device that the data in some data blocks is no longer needed. It is well-known because it is both valuable and problematic. Some SSDs work better if unused data is regularly trimmed, but trim implementations work differently on different devices, both in terms of efficiency and effectiveness. This variation means that users often need to know precise details of their hardware to achieve the best performance.
There is an operation that is the inverse of discard that is important for thin-provisioned devices. Thin provisioning allows a storage array to appear to be extremely large, while only having physical capacity for a much smaller amount of storage. As data is written, the available storage is allocated to the target addresses. As the free space shrinks, the device administrator is alerted and action can be taken, which could include acquiring extra physical capacity.
A particularly useful operation when using a thin-provisioned device is to request that storage space be allocated before actually writing data to it. This makes it possible to report allocation problems earlier and to avoid unpleasant surprises. The SCSI spec refers to these unwritten allocations as ANCHORED blocks, and supports anchoring with the WRITE SAME SCSI command, which writes a particular block of data (often zeros) to multiple locations over a given range of addresses.
The Linux block layer has an interface, blkdev_issue_zeroout(), that combines both the de-allocation of discard and the pre-allocation of WRITE SAME with the more generic goal of zeroing out a range of blocks on a device. Depending on the capabilities of the device and on the "discard" flag that is passed to the function as a hint, it will issue a discard request (i.e. TRIM or UNMAP), a WRITE SAME request, or write a zeroed page of memory to every block in the range. Future reads are guaranteed to return zeros, but pre-allocation or de-allocation happens on a best-effort basis.
The "discard" hint flag and the possible issuing of a discard request is a relatively recent addition and is, importantly, different from the similar blkdev_issue_discard() interface. The latter will issue a discard even if the result might be that subsequent reads return random data. blkdev_issue_zeroout() will only issue a discard if future reads will reliably return zeros.
Simple patches for a simple problem
The pair of patches that Darrick Wong posted does two things. Primarily they add a new ioctl() command so that the "discard" flag can be set from user space; the existing BLKZEROOUT ioctl() calls blkdev_issue_zeroout() but always sets the "discard" flag to zero. Hoping not to have to create yet-another-command if even more functionality is ever added to blkdev_issue_zeroout(), Wong defined the new BLKZEROOUT2 with room for expansion: 32 flags of which only one was used, and even some "padding" fields that must be zero now, but could be defined later.
The other effect of these patches is to purge parts of the page cache for the block device when blocks are zeroed. Normal reads and writes on a block device (e.g. /dev/sda) are cached in the page cache. An O_DIRECT write is instead sent directly to the device, which could make it inconsistent with the page cache. To avoid such inconsistency, the corresponding pages of the page cache are removed when an O_DIRECT write happens. BLKZEROOUT is much like an O_DIRECT write, so, with the patches applied, both it and BLKZEROOUT2 will purge the page cache.
Torvalds's response seems to be based on an intuitive "it doesn't feel
right" rather than clear logical reasoning. One flaw he identified
was not actually present in the code; it boiled down to "I absolutely
detest code that tries to be overly forward-thinking", which is a
little surprising given the problems there
have been with system
calls not having a suitable flags argument. Most of the rest is
summed up by his comment: "So the whole patch looks
pointless.
" He
did approve of purging the page cache, though.
As the discussion progressed and requirements were more explicitly stated, the source of Torvalds's discomfort became clearer. The operations of interest deserved to be thought about at a much higher level than just ioctl() commands for a block device. They are much more like operations on a file — to allocate and de-allocate backing store.
The Linux fallocate() system call has a flag FALLOC_FL_PUNCH_HOLE, which is a lot like TRIM, particularly the style of TRIM that causes future reads to return zeros. fallocate() also has that FALLOC_FL_ZERO_RANGE flag, which is a good match for WRITE SAME or writing zeros. Rather than providing an ioctl() command that seems focused on matching low-level functionality provided by certain hardware, using fallocate() would reuse an existing high-level interface that is described in terms of the needs of applications. Existing fallocate() implementations already purge the page cache as appropriate, so had this approach been used instead of the initial BLKZEROOUT ioctl() command, it is likely that those implementations would have been used as a guide, so we would not have the current situation where zeros can be written without any purge.
Wong provided a new patch set that added fallocate() support for
block devices; this received much warmer support from Torvalds. He
found a few little nits, but admitted that "on the whole, I like it
".
This was a fitting close to a maintainership interaction done really
well: Torvalds followed his intuition and complained about things
that bothered him, despite not having a full picture of the problem
space. Wong responded directly, called Torvalds out where he was
clearly wrong, and attempted to justify other choices with extra
details. A more complete picture was formed, against which preferences
could be explained more coherently. Finally a resolution was found,
implemented, and approved — apparently to everyone's satisfaction.
This is a model worth following.
An enlightening tangent
While the conclusion to the main thread of discussion was that treating block devices a bit more like files could make it easier to work with new hardware, there was a sub-thread that seemed to head in a complementary direction.
There appear to be a number of user-space file servers — Ceph was given as an example — that use a local filesystem to store data, but aren't really interested in many of the traditional semantics of a filesystem. A good example of this is the O_NOMTIME flag that was discussed last year. These file servers really just want space to store data and want reads and writes to that space to be passed down to the device with minimal friction from the filesystem.
In much the same way as described earlier for thin provisioning, these file servers need to be able to allocate space and write to it later. While they wouldn't object to that space being filled with zeros, they really don't care about the contents of the space, but they do care about the allocation and subsequent writes being fast.
Filesystems do support pre-allocating space with fallocate(), but they typically do so by recording which blocks have been written and which have only been anchored. This means that each subsequent write needs to spend time updating metadata: extra work that brings no value to the file server.
At the beginning of the sub-thread, Ted Ts'o mentioned in passing that he had out-of-tree patches that provide a flag, FALLOC_FL_NO_HIDE_STALE, that would do exactly what the file servers want: allocate space so that future writes happen with no further metadata updates. In general, this can be a security issue since reading data from those ranges could return potentially sensitive data belonging to some other user.
Ts'o's patches restrict this operation to a single privileged
group ID. There were suggestions that a mount option should be used instead
of, or maybe as well as, a special group ID. There were also
observations that using the flag in containers could lead to unexpected
information leaks. Possibly the most vocal critic was Dave Chinner
who was blunt: "it is dangerous and compromises system
security. As such, it does not belong in upstream kernels.
"
An example he gave of possible information leaks was automated
backups. While the application that pre-allocated space may be
trusted to never look at the stale data, once it leaks out in
backups it seems to be more exposed.
Torvalds wasn't convinced by Chinner's fears; his only requirement is that it isn't too easy to do something dangerous. He has always been in favor of providing functionality if people are actually going to use it, so the fact that Ts'o has this out-of-tree patch that is widely used within Google does carry weight. It was also noted that the presence of these performance issues has already caused Ceph developers to give up on using a local filesystem and to instead start using block devices directly, so the issues are clearly real. If performance benefits can be clearly demonstrated and application developers affirm that they would use the functionality, then remaining barriers are unlikely to stand for long.
If we step back for a moment to grasp the big picture, what we see here is the cluster filesystem using a local filesystem a lot like a logical volume manager. It wants storage space of arbitrary size with the ability to expand later. It doesn't care about any metadata except the size, and doesn't care about the initial contents, which in practice could be stale data. This sounds exactly like the logical volumes that LVM2 can provide, though by being embedded in a filesystem they are much easier to manage than LVM2 volumes. In a mirror image of the decision to treat block devices more like files so as to meet the needs of low-level hardware, it seems that we might want to treat files more like block devices so as to meet the needs of high-level filesystems.
As Chinner himself noted, there are synergies here with the "splitting filesystems in two" idea that he floated at the Linux Storage, Filesystem, and Memory Management Summit in 2014. While nothing appears to have come of that yet, it is valuable food for thought and something may yet arise as needs and options become clearer. The distinction that Chinner made between "names" and "storage" certainly seems stronger than the distinction between "files" and "block devices", which is showing its weakness. If the old lines are going to blur, it might be useful to have new lines to focus our thoughts on a clearer overall picture. That way, we might not need to depend so much on the intuition of experienced maintainers.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>