LWN.net Weekly Edition for November 2, 2017

Welcome to the LWN.net Weekly Edition for November 2, 2017

This edition contains the following feature content:

GStreamer: state of the union: A talk by Tim-Philipp Müller at the GStreamer Conference.
Using eBPF and XDP in Suricata: At Kernel Recipes, Eric Leblond described how the Suricata IDS/IPS uses eBPF and XDP.
The state of the realtime union: Thomas Gleixner updated attendees of the Realtime Summit on the state of the project.
The 2017 Kernel and Maintainers Summits: This year, the Kernel Summit has a new format with open Kernel Summit talks followed by a half-day invite-only Maintainers Summit.
Another attempt to address the tracepoint ABI problem: What can be done to reduce maintainers' concerns about tracepoints as a kernel ABI.
Restartable sequences and ops vectors: Restartable sequences, which provide a mechanism for lockless concurrency control in user space, have been percolating in the kernel community for some time; a new version may be close to finally getting merged.
Kernel regression tracking, part 1: Tracking of kernel regressions has been picked up by Thorsten Leemhuis over the last year, but he reported that it is rather harder to do than he expected.
Improving printk(): The venerable printk() utility function is suffering from scalability problems that Steven Rostedt (and others) would like to fix.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

GStreamer: state of the union

October 30, 2017

This article was contributed by Mischa Spiegelmock

The annual GStreamer conference took place October 21-22 in Prague, (unofficially) co-located with the Embedded Linux Conference Europe. The GStreamer project is a library for connecting media elements such as sources, encoders and decoders, filters, streaming endpoints, and output sinks of all sorts into a fully customizable pipeline. It offers cross-platform support, a large set of plugins, modern streaming and codec formats, and hardware acceleration as some of its features. Kicking off this year's conference was Tim-Philipp Müller with his report on the last 12 months of development and what we can look forward to next.

The core team has been sticking to a more or less six-month release schedule, adjusted somewhat for other timelines. The project is aiming to land the next 1.14 release enough ahead of the Ubuntu 18.04 long-term support version so that for the next few years there is a relatively recent version for developers to base their work on.

Project components

There is a system of categorization (worth a read for the amusing descriptions) for plugins based on the film The Good, the Bad and the Ugly. These plugins comprise most of the useful functionality of GStreamer. "gst-plugins-good" is made up of plugins with solid documentation, tests, and well-written code that should be used as examples for writing new plugins. "gst-plugins-ugly" is similar in quality to "good" but may pose distribution issues because of patents or licenses. "bad" is for all the rest that are of varying quality, perhaps not well documented, or not recommended for production use and to base new plugins on.

There is an ongoing mission in the GStreamer project to consolidate the platform by trying to promote plugins and pieces of code from the "bad" repository into "good" by fixing whatever may be at issue. There is now an effort to clearly document why a plugin remains in the "bad" category so that contributors know what needs to be fixed and maintainers can remember why a plugin was considered unfit and re-assess it at a future date.

Patents on the MP3 and AC-3 audio codecs have expired. The mpg123 decoder and the lame MP3 encoder have been moved to "good", though the GPL liba52-based a52dec decoder for AC-3 must remain in "ugly". GStreamer itself is released under the LGPL, so GPL plugins that could pose problems for distributors wind up in "ugly".

Performance and multiprocess improvements

Ongoing efforts to improve parallelism have paid off; the operations of video scaling and conversion are now multithreaded, and an upcoming ipcpipeline can be used to split pipelines across multiple processes. Multiprocessing can be used to isolate potentially dangerous demuxers, parsers, and decoders. This is a concern for anyone parsing user input in the form of media files, which is a longstanding source of application vulnerabilities.

There have been fruitful efforts to use an abstracted zero-copy mechanism in DMABuf and other operations involving passing buffers between sources, elements, and sinks. Memory allocation queries for the tee pipeline element are now aggregated for zero-copy. High-speed playback in the DASH HTTP adaptive streaming format used by Chromecast among other things is being enhanced. Playing a media file at faster than the normal rate, such as listening to a podcast at 2x speed, normally consumes more bandwidth than regular playback. New support has landed to reduce those bandwidth requirements for DASH.

Hardware acceleration support continues to improve in the form of improvements to the integration with the video acceleration API (VA API). Encoders now have ranks so that they can be chosen to prefer those that have hardware acceleration. Support for libva 2.0 (VA API 1.0) has been enhanced. Static analysis issues found by the Coverity tool were all fixed. There is a new low-latency mode for H.264 decoding. Constant bit rate for VP9, variable bit rate for H.265, and a "quality" parameter for encoding are all now supported.

Other new features

There is now comprehensive support in the upstream gst-plugins-bad for using Intel's Media SDK, which is an SDK offered for recent Intel chip platforms such as Apollo Lake, Haswell, Broadwell, and Skylake. This enables hardware acceleration for encoding and decoding common video formats, video post-processing, and rendering. The goal is to make it easy for developers to "use MSDK in their GStreamer-based applications with minimal knowledge of the MSDK API".

Work to allow x264enc (using the GPL-licensed libx264 encoder) to be used with multiple bit depths was described as "very hard" due to the fact that the bit depth must be specified at compile time for the library. Now multiple versions of the library can be compiled and then selected at run time.

Timed Text Markup Language support has been added. This is part of the SMPTE closed-captioning standard for online video, and has potential to be a general intermediary representation for text subtitles and captions.

rtpbin is now enhanced for accepting and demultiplexing bundled RTP streams for purposes of constructing a WebRTC pipeline. This greatly simplifies the process for doing W3C-standards-based live video streaming and conferencing in a web browser using gstreamer and will soon be mandated by WebRTC. More on new WebRTC features further down.

A rewrite of splitmuxsink to be more deterministic about splitting output streams based on time or size has been done. Typical uses of this element involve segmenting recordings of things such as surveillance video inside of older container formats (e.g. classic MP4) that cannot be split in arbitrary locations without properly finalizing the file by writing out headers and tables and beginning a new file. In the upcoming release the hlssink2 element will take elementary streams as input and output chunks for HTTP Live Streaming (HLS) making use of the splitmuxsink element.

In addition, GstShark was demonstrated at the conference, which enables rich pipeline tracing and graphing capabilities. It is particularly useful for pinpointing causes of poor frame rates or high latency.

A casual mention was made of the fact that GStreamer now has the first implementation of the Real Time Streaming Protocol (RTSP) version 2.0, both for client and server. RTSP is in wide use for controlling live media streams from devices such as IP cameras.

There is interest in using the systems programming language Rust for GStreamer development to improve memory and thread safety and to use modern language features. In another talk, Sebastian Dröge described the current state of the GStreamer Rust bindings. They have been in development for quite some time and many people are actively developing with them. The bindings provide a mostly complete set of functionality for both application and plugin development, and no longer require the use of "unsafe" sections by users of the bindings. They are mostly auto-generated now via GObject introspection and have a native Rust feel while retaining the same API usage patterns familiar to anyone used to working with GStreamer.

Future work

Debugging improvements slated for the coming 1.14 release include debug log ring buffer storage, which is useful for keeping recent logs for long running tasks or in disk-constrained environments, Müller said. A "new more reliable" leak tracer that is now more thread-safe and supports snapshots and dumping a list of live objects is also planned. The leak tracer is currently very Unix-specific as it relies on Unix signals, so work is needed to come up with a suitable mechanism on Windows as well.

Plans of varying concreteness were mentioned for future work and improvements:

Adaptive streaming with multiple bit rates could be improved for DASH and HLS.
Internally implementing a stream-handling API could be done in more demuxers as well as adding better handling of stream deactivation support.
More native sink elements for output on Windows, iOS, and Android are needed along with native UI toolkits.
Windows code needs to be updated to use newer APIs and legacy support for Windows XP should be dropped.
Android, iOS, and macOS "need some love" to catch up with the latest versions.
Adding support for the ONVIF surveillance camera interoperability standard, including for things like audio back channel and special modes.
Better OpenCV integration for the popular computer vision library.
Support for virtual reality formats.

The conference kickoff concluded with an open question about how the project can better interact with users and contributors. The existing workflow of attaching patch files to Bugzilla entries may feel cumbersome to some contributors (such as the author) compared to modern pull-request workflows. There is a desire to move to an open source solution such as GitLab, which would provide pull requests and help track sets of changes that may span multiple repositories. GitHub was explicitly mentioned as a platform that the project will not be moving to due to its proprietary nature, which is something the free-software project is exactly not about. While there is already a mailing list and excellent IRC channel, there is a possibility that a "proper" forum may be coming soon as well for people to have discussions and post multimedia.

GStreamer and WebRTC

A hot topic right now both at the recent IETF conference and at the GStreamer conference was WebRTC support; Müller mentioned that he was getting asked 30 times a day "how do I stream to my web browser?" WebRTC is a draft standard being worked on by the W3C and IETF to enable live video and audio streaming in a web browser, something that until recently was only achievable with Flash, in practice, or in limited server-side use with HLS for cross-platform and browser compatibility. WebRTC makes peer-to-peer videoconferencing in a web browser possible, although it has other uses cases, such as simplifying the streaming of live video to or from a web browser and even telephony, as in several existing WebRTC-to-SIP gateways.

Development has been active and ongoing for rich WebRTC support in GStreamer. Matthew Waters came to Prague all the way from Australia to talk about building a selective-forwarding server that can support multi-party conferencing with GStreamer. WebRTC is a peer-to-peer protocol, so multi-party conferences without a server in the middle handling media streams is prohibitively expensive for a large number of users. This is because each and every peer in the call will stream to every other peer in a mesh-style network. Another possible design for multi-party WebRTC is a central mixing server, also called an MCU (Multi-point Control Unit), which moves most of the cost to the provider instead. A middle ground where the server only forwards the media streams to the other peers (a Selective Forwarding Unit) is a good compromise for sharing the computational and bandwidth costs between the user and the provider.

Waters was able to achieve this by creating a new element, webrtcbin, that provides the necessary network transport requirements for a WebRTC session such as DTLS and SRTP (encrypted datagram formats for streaming media), trickle ICE for network traversal and shorter call setup times, and the trusty rtpbin element for RTP all wrapped up with an API similar to W3C JavaScript PeerConnection API.

While it can be complicated to write a server to handle the many moving parts required to do WebRTC well, GStreamer makes it eminently practical to construct fully customized client and server applications with this relatively new protocol. As is frequently the case, the GStreamer project is not only on the forefront of emerging media technologies but the talented and dedicated community is quick to showcase examples and demonstrations of how to make use of the features in a non-trivial application that makes it look easy.

[Videos of this year's talks can be found online.]

Comments (3 posted)

Using eBPF and XDP in Suricata

November 1, 2017

This article was contributed by Tom Yates

Kernel Recipes

Much software that uses the Linux kernel does so at comparative arms-length: when it needs the kernel, perhaps for a read or write, it performs a system call, then (at least from its point of view) continues operation later, with whatever the kernel chooses to give it in reply. Some software, however, gets pretty intimately involved with the kernel as part of its normal operation, for example by using eBPF for low-level packet processing. Suricata is such a program; Eric Leblond spoke about it at Kernel Recipes 2017 in a talk entitled "eBPF and XDP seen from the eyes of a meerkat".

Suricata is a network Intrusion Detection System (IDS), released under the GPLv2 (Suricata is also the genus of the meerkat, hence the title of the talk). An IDS is a system designed to sit parallel to a router, examining all the traffic that the router is being asked to pass, to decide if any of it resembles the patterns of any known malware and to alert if so. This means it has to do something considerably more memory- and CPU-intensive than the router is doing, so efficient performance is crucial. Making a decision about the danger represented by a packet involves extracting the TCP segments from those packets, reassembling the data stream (or flow) from the TCP segments, and considering that stream from a protocol-aware standpoint; that work is performed by what Suricata refers to as its worker threads.

Suricata makes decisions about the traffic at all layers of this analysis, and is currently capable of doing this at 10Gbps, or sometimes even faster. It is also capable of operating as an Intrusion Prevention System, or IPS, where it is actually routing the network traffic (or choosing not to route it, depending on the result of the analysis). But Leblond made it clear that it works best when analyzing traffic, then reporting to a greater infrastructure that, in turn, advises the actual router on when to stop routing packets from a given stream.

Suricata starts by capturing network packets using an AF_PACKET socket, which is the method introduced in the 2.2 kernel for getting raw packets off the wire. It is used in fanout mode, which allows each successive packet from a single interface to be sent to one of a set of sockets. The policy for choosing which socket to select is configured when the mode is engaged.

By running multiple worker threads, each on its own CPU and each listening to one of those sockets, Suricata is able to parallelize the work of packet processing, flow reconstruction, and consequent analysis. This parallelization is crucial to Suricata's ability to process data at high speed, but it only works if packets from a given stream are always fed to the same worker thread. Fanout's default mode guarantees this by using a hash of the packet's network parameters to choose which socket to feed with any given packet. This hash is supposed to be constant for all packets in any given flow, but should generally be different for packets in different data flows.

Unfortunately, unrelated changes in kernel 4.4 broke the symmetry of the hash function used by the kernel to make this decision. After that change, packets with source address S and destination address D all returned one hash, H1, but those with source D and destination S returned a different hash, H2. The effect of this was that traffic from client to server tended to end up in a different worker thread than responses from server to client. If the second worker thread was less heavily loaded than the first, responses could end up being considered by Suricata before it had even seen the traffic that prompted them. The effect of this, said Leblond, was interesting: the count of processed packets remained high, but the count of detected attacks fell off dramatically. So, he said, "you are in a total mess, and it's not working", noting with Gallic understatement that "users did start to complain". David S. Miller fixed this in 4.7, with the fix also appearing in 4.4.16 and 4.6.5.

A similar complication came from Receive-Side Scaling (RSS), where network interface cards (NICs) that implement it try to do something similar themselves. As the Suricata documentation notes:

Receive Side Scaling is a technique used by network cards to distribute incoming traffic over various queues on the NIC. This is meant to improve performance but it is important to realize that it was designed for normal traffic, not for the IDS packet capture scenario. RSS uses a hash algorithm to distribute the incoming traffic over the various queues. This hash is normally not symmetrical. This means that when receiving both sides of a flow, each side may end up in a different queue. [...] By having both sides of the traffic in different queues, the order of processing of packets becomes unpredictable. Timing differences on the NIC, the driver, the kernel and in Suricata will lead to a high chance of packets coming in at a different order than on the wire.

This leads to a similar scenario to the issue above. Leblond said it was known to be a problem with the Intel XL510; its brother card, the XL710, is capable of using a symmetric hash, but the Linux driver does not currently allow telling it to do so. A patch by Victor Julien to enable this was initially rejected; Leblond is keen that it should make it in, eventually.

This discussion brought Leblond to the extended Berkeley Packet Filter, or eBPF. As noted earlier, this provides an environment where programs, written in the eBPF virtual-machine language, can be attached from user space to the kernel for various purposes. Since 4.3, Leblond said, one such purpose is providing the hash function that allows fanout mode to decide to which socket to send any given packet. He showed a slide with an example of such an eBPF program, which in 18 lines extracted source and destination IPv4 addresses (treating each as a 32-bit binary number) and returned a simple hash that was the sum of the two.

When using eBPF you have to parse the packet yourself to extract the information you want but, he said, the advantages significantly outweighed this cost. One such advantage was in dealing with data tunneled through protocols such as L2TP (for VPNs) and GTP (a 4G protocol). Because these protocols are not known to the kernel or NIC, if you ask either of those do the hashing, all the data through a given tunnel will go into a single worker even though it probably represents multiple flows. That can overload the worker. With eBPF one could strip the tunnel headers and load-balance on the inner packets, distributing them much more fairly.

Because one worker thread can handle at most some 500-1000Mbps, it's a problem if a single worker gets overloaded. Once that happens, the kernel's ring buffers fill up, packets start to get dropped, and you're no longer monitoring as comprehensively as you should be. The avoidance of this is therefore a priority for Suricata development. As we've seen above, many elegant techniques can be brought to bear to prevent this happening by accident. Sometimes, however, it's going to be unavoidable: when your traffic is completely dominated by a single flow, it's all going to go into a single worker thread because it needs to. Not only will that worker thread struggle to analyze this "big flow", but any other flows that by accident of hashing get sent to the same worker may also not get properly inspected.

To mitigate this, the developers introduced the concept of bypass, which relies on the observation that in most cases attacks are done at the start of a TCP session; for many protocols, multiple requests on the same TCP session are not even possible. Looking only at the start of a flow thus gets you 99% of the coverage you need. Suricata can detect big flows with a simple counter, then either drop the packets from the worker queues (local bypass), or instruct the kernel not to bother capturing them (capture bypass). Suricata can be configured to only reassemble a flow for deep protocol analysis until it is a certain (configurable) length, after which time that flow, too, will be bypassed. It can also be configured to bypass intensive but likely-safe traffic, such as that coming from Netflix's servers.

To support capture bypass, Suricata needs the ability to maintain an in-kernel list of flows not to be captured. This can't be done using nftables, because AF_PACKET capture comes before nftables processing. Unsurprisingly, it turns out that here, again, there are eBPF hooks in the kernel. Leblond showed a 16-line eBPF program that did a lookup against an eBPF map of flows to be bypassed, and returned a capture decision based on the flow's presence in, or absence from, that map. Code for maintaining the map and timing out old entries was also presented.

Suricata's capture bypass testing was done with live data at 1-2Gbps, for an hour; as Leblond noted, that's good, because it's a real-world test, but it's bad because it's not reproducible. Nevertheless, he presented the test results, and big flows could clearly be seen being bypassed, with commensurate lowering of system load.

eBPF has given us the ability to drop packets earlier in the capture process than would otherwise be possible, but the kernel still has to spend some time and memory on each packet before a drop decision is made. Suricata would like to reduce this work as much as possible, and the project is looking at XDP (eXpress Data Path, or eXtreme Data path, depending on who you ask, he said) to do this. With XDP, an eBPF program can be run to make a decision about each packet inside the NIC driver's "receive" code; available decisions include an early (and therefore low-cost) drop, passing the packet up to the kernel's network stack (with or without modification), immediate retransmission out from the receiving NIC (again, with or without modification), and redirection for transmission from another NIC (the choice of NIC being made with more help from eBPF). Leblond is interested in the use of that last capability to make Suricata's IPS mode closer in speed to its IDS mode, by replacing the decision to bypass a packet with the decision to fast-route it into the enterprise network.

The downside to XDP is that it requires drivers that support it, which few currently do. Although he hasn't been able to test this himself, he reported results from others who have; using a single CPU and a Mellanox NIC running at 40Gbps, they were able to drop packets at 28 million packets per second (Mpps), or modify packets and retransmit them from the receiving NIC at 10Mpps. Leblond outlined the possibility of implementing not just packet bypass but packet capture via XDP, with consequent performance improvements.

With lunch fast approaching, Leblond wound up his talk, which was a fascinating exposition of how much more can be achieved in user space if it is willing to work more tightly with the kernel than is usual.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Comments (none posted)

The state of the realtime union

By Jake Edge
October 26, 2017

Realtime Summit

The 2017 Realtime Summit was held October 21 at Czech Technical University in Prague to discuss all manner of topics related to realtime Linux. Nearly two years ago, a collaborative project was formed with the goal of mainlining the realtime patch set. At the summit, project lead Thomas Gleixner reported on the progress that has been made and the plans for the future.

The Real-Time Linux (RTL) project has the goal of getting the patch set into the upstream kernel, but there is more to it than that, Gleixner said. Other facets of the project's work include documentation, establishing a community, and long-term maintenance. He often gets asked about the "community" goal; it is one of the more important pieces, actually.

Gleixner has talked with Linus Torvalds about the realtime patches along the way; Torvalds is adamant that he will not merge them if there is no community of users and developers that can maintain the code going forward. Drivers for different kinds of hardware can be merged into the mainline without impacting other parts of the kernel much or at all. The realtime patches touch many places in the core of the kernel itself, so it will be important to have a community that can handle its maintenance over the long haul.

Funding for the project for 2017 is secured, from ten member companies at three different membership levels. There are five paid developers on the project working on code, testing infrastructure, documentation, and the like. For many years, Gleixner worked on the project as something of a hobby, but it is "much more fun to get paid for things", he said.

Work on the project started around 21 months ago, he said, and much has been accomplished. The two main things that have occupied the most amount of time were the CPU hotplug rework (much of which he also detailed in a talk at Kernel Recipes) and the hotplug locking rework. He went into a bit more detail about the latter piece at the summit.

Hotplug locking

Hotplug locking caused a lot of headaches for realtime Linux. It used a counting semaphore that was mostly not covered by the lockep lock validator. Since it "evaded lockdep almost completely", it is not a surprise that there were various bugs and deadlocks hidden in that code. Gleixner removed the counting semaphore and replaced it with a per-CPU reader/writer semaphore.

That moved the hotplug locking under the purview of lockdep, which resulted in complaints that were also addressed as part of the realtime work. There are various subsystems outside of hotplug that take the hotplug lock (e.g. tracing, watchdog), which is generally where deadlocks could occur. There were some kernel developers who complained about false positives from lockdep once that change was made, which was similar to the complaints made with the recent crossrelease feature for lockdep (and those from when lockdep was first introduced); annotating the code to teach it about false positives is the right way forward. Peter Zijlstra noted that he got his start at Red Hat by doing lockdep annotations to avoid false positives, which is "how I got sucked into all of this".

The locking rework is mostly done with its "curing" now, Gleixner said. There are some watchdog problems that still need to be dealt with, however. He learned some lessons in the process, which sounded strangely familiar—unsurprisingly, no new "lessons learned" arose in the short time between his talk at Kernel Recipes (linked-to above) and the summit.

Accomplishments

He then went on to list some other accomplishments beyond the hotplug work. That includes the timer wheel rework, which allowed the CONFIG_NO_HZ and CONFIG_NO_HZ_FULL modes to start working with the realtime patches again. There is some reworking of the high-resolution timers that was posted and reviewed; there are plans to post another version soon. An overhaul of the futex and RTmutex code was done as well.

The realtime patches have long had a way to independently disable preemption and the handling of page faults, which the s390 architecture needed in mainline. So the RTL project helped s390 get it upstream. There was also a large cleanup of abuses of the task affinities; there was a lot of code doing so for "obscure reasons", Gleixner said. Lastly, the project extended the debug facilities in the kernel to catch the "abuse of various things in the kernel".

700 patches have been merged or queued for 4.15 and there are 50 patches under review. The project has fixed 40 real and 80 latent bugs. The latter are "impossible to trigger", he said, but the real ones are actually relatively easy to trigger. They just have either not been hit, due to luck, or not been reported.

He then listed the various releases that are being maintained with the realtime patch set. There are several 3.x releases (3.2, 3.10, 3.16) that are rarely updated; the 4.x series (4.1, 4.4, 4.9, and 4.14 is planned) mostly tracks the long-term stable releases of the mainline. Much of that work is done by Steven Rostedt, "and Steven does not scale", Gleixner said. Figuring out what to do about that would be one topic for the project meeting that was held on October 23.

The current development version is based on 4.13; it "sort of works". It will be abandoned in favor of the 4.14 version sometime soon.

Remaining work

The most complex remaining development tasks are locking for the dentry cache (dcache), some softirq modifications ("still some rough edges" to work out there), interaction with the memory management subsystem (pretty straightforward), and handling local locks and annotations, which is something he has wanted to get into the mainline for a long time.

Of those, the dcache locking issue is the most complex. Zijlstra pointed out that the code in question is in an area that Linus Torvalds "really cares about", so he suggested taking any proposed solutions to Torvalds first. The problem for realtime is that the dcache uses trylock loops; in the mainline, trylock is "pretty cheap", Gleixner said, but it doesn't work for realtime. If the code holding the lock gets preempted by a task on the same CPU that needs the lock, it will lead to a livelock.

Right now, the realtime patches are using a "butt-ugly workaround" of just sleeping for one millisecond if the trylock fails, which will allow the preempted task to make progress and eventually release the lock. It is the "cheapest solution for the problem" and if realtime-relevant tasks are doing "filesystem operations using the dcache, I can't help them anyway". Rostedt suggested using his "trylock and boost" approach, which Gleixner admitted worked, but said was even uglier. Rostedt replied that at least it was deterministic, thus it was "deterministically ugly", which elicited some laughter.

Gleixner said that he had been talking to some filesystem developers who think that the trylock loops may not be needed. There has been some experimentation using RCU; those experiments "kind of work" but there are corner cases that are, as yet, unsolved. This is the most challenging issue facing realtime right now, he said, so if anyone has any ideas that don't involve trylock boosting (which he does not believe can be sold to Torvalds), he would love to hear them.

He concluded his talk with a call for help. There are needs for testing and documentation. He has also recently updated the task list for the project for any developers who want to help. He asked that people "grab the code and fix it" and then to send him the patches.

Gleixner has presented various roadmaps for the realtime patch set over the years. This year's edition was textual: "Due to the evolutionary nature of Linux the roadmap will be published after the fact, but it will be a very precise roadmap." As with the others, this roadmap gives some insight into Gleixner's feelings about regularly being asked to provide one.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Prague for the Realtime Summit.]

Comments (5 posted)

The 2017 Kernel and Maintainers Summits

By Jonathan Corbet
October 31, 2017

2017 Kernel Summit

The 2017 Kernel and Maintainers Summits were held in Prague, Czechia, in late October, co-located with the Open-Source Summit Europe. As usual, LWN was there, and has put together coverage of the topics that were discussed at these meetings.

The format of the Kernel Summit was changed significantly for this year. The bulk of the schedule has been moved into an completely open set of talks that ran alongside the rest of the OSS tracks; as a result, the attendance at these discussions was larger than in past years and included more people outside of the core kernel community. The invitation-only discussion has been made much smaller (about 30 core maintainers) and turned into a half-day event.

The Kernel Summit

Topics discussed in this year's Kernel Summit include:

The tracepoint ABI problem another attempt to find a way to instrument the kernel without creating ABI issues in the future.
Restartable sequences and ops vectors: patches implementing restartable sequences have been circulating on the lists for years; this discussion covered another attempt to get this feature into the mainline.
Regression tracking, part 1: the kernel project has a regression tracker again, but he is finding the job to be challenging.
Improving printk(): a possible redesign of the internals of this important utility function to address some performance issues.
A kernel self-testing update: the kernel self-tests are growing, but there is more to do there.

The Maintainers Summit

Regression tracking, part 2, a continuation of the discussion in the smaller group.
Bash the kernel maintainers: a discussion about feedback from the community about working with kernel maintainers.
An update on the Android problem. The Android ecosystem is full of out-of-tree code, but it would appear that things are getting better.
The state of Linus: the traditional session where Linus Torvalds gives his view of the state of the development community.
SPDX, cross-subsystem development, and conclusion: a few short topics to close out the discussion.

[LWN would like to thank the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event].

Comments (none posted)

Another attempt to address the tracepoint ABI problem

By Jonathan Corbet
October 27, 2017

2017 Kernel Summit

Tracepoints provide a great deal of visibility into the inner workings of the kernel, which is both a blessing and a curse. The advantages of knowing what the kernel is doing are obvious; the disadvantage is that tracepoints risk becoming a part of the kernel's ABI if applications start to depend on them. The need to maintain tracepoints could impede the ongoing development of the kernel. Ways of avoiding this problem have been discussed for years; at the 2017 Kernel Summit, Steve Rostedt talked about yet another scheme.

The risk of creating a new ABI has made some maintainers reluctant to add instrumentation to their parts of the kernel, he said. They might be willing to add new interfaces to provide access to specific information but, in the absence of tools that use this information, it is hard to figure out which information is needed or what a proper interface would be. The solution might be to adopt an approach that is similar to the staging tree, where not-ready-for-prime-time drivers can go until they are brought up to the necessary level of quality.

People talk about "tracepoints", but there are actually two mechanisms in the kernel. Internally, a tracepoint is a simple marker in the code, a hook to which a kernel function can be attached. What user space sees as a tracepoint is actually a "trace event", which is a specific interface that is implemented using the internal tracepoints. Without trace events, there is no interface visible to user space.

The proposed solution to the ABI problem is to place a tracepoint at locations of interest, but not bother with the trace event. Making the tracepoint available to user space would then require loading a kernel module; this module would be kept out of the mainline tree. It would be, he said, a development space to try out interfaces for the more sensitive tracepoints. Since it is not a part of the mainline kernel, it could not become part of the kernel ABI. But distributors could ship this module, making the tracepoints available to user-space developers.

Ben Hutchings, a Debian kernel maintainer, said that this approach would not work in a number of cases. There are many situations where it's not possible to just load a random module into the kernel. Many customers are using module signing, for example, to prevent exactly that from happening. Even if distributions ship this module, users of different distribution would have different modules and the tracepoints would be incompatible; that would make it harder to write tools to use them.

Another member of the audience expressed skepticism, saying that if every distributor ships this module, it will become an ABI that has to be maintained anyway. Ben Herrenschmidt agreed and suggested that the right solution was to make the tracepoints be self-describing. But, as Rostedt pointed out, they are already self-describing, but changing the availability of information will still break things. Tools may depend on specific information that is no longer available, or they may simply ignore the format information for the tracepoint. That makes it hard to remove obsolete tracepoints which, since they each occupy about 5KB of memory, is unfortunate.

Matthew Wilcox asked whether the proposed scheme would have solved the problem with powertop, which broke some years ago when a variable was removed from a tracepoint. Rostedt said that it would have; Ted Ts'o noted that the powertop problem shows that self-describing formats do not work as a solution to this problem.

Much of the current work is being pushed by developers within Facebook, who use a vast library of tracepoints to diagnose performance problems. They are willing to deal with their tools breaking when the kernel changes. That led Andrew Morton to ask whether Linus Torvalds made the right call by including tracepoints in the kernel ABI. Rostedt said he disagrees with that decision, but it doesn't matter, since Torvalds has the final say. David Woodhouse complained that the group was talking about "arbitrary technical nonsense"; perhaps the loaded module should just set a flag to make the tracepoints available. Morton agreed that the module idea "sounds like bullshit" and suggested that perhaps it was time to get the rule changed. But Rostedt has tried that before, he said, and he still bears the scars that resulted.

Chris Mason said that, while Facebook can handle tracepoint changes that break its tools, there is a need to know when such a change has happened. Just moving the ABI to a loadable module will not solve that problem; it just pushes the problem onto the distributors instead.

Ts'o then launched into a discussion of the growing set of tools that work by attaching BPF scripts to tracepoints. These tools are becoming popular and soon will be as popular as powertop; that will result in the same kinds of problems when they break. The problem is here now and needs to be addressed.

Doing so will be hard, he said. The topic had been suggested for the invitation-only Maintainers Summit, since it is "fundamentally a Linus problem", but Torvalds had vetoed it. Torvalds wants to make a guarantee to user-space tools that works in 99% of the cases, but it is hard to live up to for tools that are closely tied to the kernel. So the powertop problem will come again, only worse; BPF will "turn it into a trainwreck". Rostedt added that Linux started off as "a desktop toy", but it is no longer a simple system. Nobody knows the whole thing, so they are relying more on tooling to know what is going on.

The conversation came to an end about here, but the topic did return at the Maintainers Summit later that week, after Torvalds and Rostedt had discussed it. The solution that was arrived at for now, as related by Torvalds, is to hold off on adding explicit tracepoints to the kernel. Instead, support will be added to make it easy for an application to attach a BPF script to any function in the kernel, with access to that function's arguments. That should give tools access to the information they need, and may make it possible to (eventually) remove many of the existing explicit tracepoints.

Arnd Bergmann asked what would happen if a popular script breaks as the result of the removal of a function; Torvalds replied that he would not see it as a regression that needs to be fixed. But, he said, if that happens it should be seen as a sign that the kernel should be providing that information in a more straightforward manner. A tracepoint or other interface could be added at that time.

Whether this solution provides what the tools need will take time to determine. But if it does, it may just be possible that a multi-year debate has finally come to some sort of conclusion that all of the parties involved can live with.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Comments (7 posted)

Restartable sequences and ops vectors

By Jonathan Corbet
October 31, 2017

2017 Kernel Summit

Some technologies find their way into the kernel almost immediately; others need to go through multiple iterations over a number of years first. Restartable sequences, a mechanism for lockless concurrency control in user space, fall into the latter category. At the 2017 Kernel Summit, Mathieu Desnoyers discussed yet another implementation of this concept — but this one may not be the last word either.

The core idea behind restartable sequences has not changed. An application defines a special region of code that, it is hoped, will run without interruption. This code performs some operation of interest on a per-CPU data structure that can be committed with a single instruction at the end. For example, it may prepare to remove an item from a list, with the final instruction setting a pointer that actually effects this change and makes it visible to other threads running on the same CPU. If the thread is preempted in the middle of this work, it may contend with another thread working on the same data structure. In this case, the kernel will cause the thread to jump to an abort sequence once it runs again; the thread can then clean up and try again (the "restart" part of the name). Most of the time, though, preemption does not happen, and the restartable sequence will implement a per-CPU, atomic operation at high speed.

Restartable sequences have been around for some time and are evidently in use at Google and elsewhere. But there are some rough edges, one of which is debugging. Single-stepping through a restartable sequence with a debugger will be an exercise in frustration, since the sequence will never run uninterrupted and, thus, always abort. Fixing this problem requires the implementation of some way to execute the sequence as a single step.

The solution in the current patch set is a new system call:

    int cpu_opv(struct cpu_op *ops, int opcount, int cpu, int flags);

The purpose of this system call is to accept a sequence of operations (an "ops vector") and execute it atomically. Each entry in the ops array is a single operation; the array has a maximum length of sixteen operations. The available operations include comparisons, memory copies, and basic arithmetic. The amount of data that can be operated on is bounded (to limit the maximum execution time of the vector), and all of that data is locked into memory before the execution of the ops vector begins. The vector is run in the processor indicated by cpu; the flags field must be zero in the current implementation.

The ops vector is meant to be used as a fallback when a restartable sequence aborts; it can be run during single-stepping or any other situation where the sequence itself is unable to complete successfully (it is not a suitable replacement for the sequence entirely; as a system call, it will be quite a bit slower). Users of restartable sequences would thus need to create a second implementation of their algorithm in this new language and run it when the original sequence fails. This idea, Desnoyers said, came to him in the shower one day. It is, he said, a relatively simple solution to the problem.

This was the point where your editor was unable to resist raising his hand and asking whether, rather than adding yet another interpreter to the kernel, Desnoyers could use the existing BPF language and interpreter. The existing BPF verifier could likely be adapted to the needs of the ops-vector mechanism. Desnoyers replied that BPF carries a lot of weight that is not needed here and, in any case, the ops vector should almost never actually run in real-world use. But then he went on to say that ops vectors could also be employed for simple housekeeping tasks that may not need a full restartable-sequence implementation.

Andy Lutomirski jumped in to say that BPF seemed like a reasonable solution to the problem; the BPF interpreter's context mechanism could be used to manage operands to the vector, for example. Peter Zijlstra pointed out that BPF programs have a large kernel-space context associated with them; a program might have one-hundred restartable sequences, which would add up to a lot of overhead.

Lutomirski then said that he has his own version of restartable sequences that he has been working on. Rather than abort when preemption occurs, it aborts when an actual data conflict happens. Single-stepping works in this implementation, he said. Desnoyers replied that such an approach would make the implementation more complex, but Lutomirski said that it is still better than requiring every user to implement their algorithms twice. The slow path will be poorly tested at best, and developers will often get it wrong, he said. Zijlstra replied that there would be a library that would take care of the details for most uses, though, and Ben Herrenschmidt said that only developers who truly care about restartable sequences will deal with things at that level.

Desnoyers moved on to the use cases for restartable sequences — an important topic, since Linus Torvalds has made it clear that he will not merge this code without clear evidence that it will be used. The LTTng tracing code can use this feature for fast user-space tracing across processes, Desnoyers said; he would also like it for his user-space read-copy-update implementation. The jemalloc and GNU C Library malloc() implementations can speed things up with restartable sequences. There is a use case for per-CPU statistics counters. Matthew Wilcox added that the developers of the DPDK user-space driver system also want this mechanism. Herrenschmidt said that, in the end, all of the concurrency issues that apply to the kernel also apply to user space.

The final part of the discussion wandered over various topics, including the details of how multiple, independent users can share a restartable-sequences region and whether maybe the classic BPF interpreter might be a better tool for the ops-vector job than extended BPF. Desnoyers said that he would look into the BPF option; expect the conversation to continue on the mailing lists.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Comments (15 posted)

Kernel regression tracking, part 1

By Jonathan Corbet
October 31, 2017

2017 Kernel Summit

The kernel development community has run for some years without anybody tracking regressions; that changed one year ago when Thorsten Leemhuis stepped up to the task. Two conversations were held on the topic at the 2017 Kernel and Maintainers summits in Prague; this article covers the first of those, held during the open Kernel-Summit track.

Leemhuis begin by pointing out that he started doing this work even though he does not work for a Linux company; he is, instead, a journalist for the largest computer magazine in Germany. He saw a mention of the gap that was left after Rafael Wysocki stopped tracking regressions, and thought that he might be a good fit for the job. This work is being done in his spare time. When he started, he had thought that the job would be difficult and frustrating; in reality, it turned out to be even worse than he expected.

Why is it so hard? The first problem is that nobody actually tells him about regressions, so he has to hunt them down himself. That means digging through a lot of mailing lists and bug trackers. Wysocki noted that things are worse than they were years ago when he did the job, there are a lot more information sources. It is more, Wysocki said, than any one person can follow.

Leemhuis went on to say that a lot of regressions are also fixed without him even noticing. Nobody tells him about progress toward fixing regressions, so that, too, must be tracked manually. He had asked developers to include a special identifier in discussions on regressions, but nobody has done it. That is unfortunate, since he had thought it would be a useful mechanism; perhaps, he said, he should have tried harder. Ben Herrenschmidt agreed, saying that it can be hard to get people to change their established workflow to incorporate a new mechanism. James Bottomley noted that maintainers would, in general, rather avoid having their bugs termed "regressions", since that increases the pressure for an immediate fix.

Leemhuis raised the idea of creating a dedicated mailing list for regressions, with reporters asked to copy their reports there. Wysocki agreed that this might be useful, but said that the information on how to report regressions properly needs to be better communicated. Laura Abbott concurred, saying that the documentation in this area should be improved.

Herrenschmidt noted that most bug reports come from distributor kernels rather than the mainline. For distributions like Fedora, which ships something close to a current mainline kernel, these reports can be relevant, though are still a version or two behind the current development kernels. Reports of bugs in enterprise kernels, instead, have little value. Bottomley added that Linus Torvalds is mostly interested in mainline regressions; the resources just don't exist to track regressions in distributor kernels as well.

There was general agreement that only mainline regressions should be tracked, but Ted Ts'o said that the community could look for volunteers to track regressions in older kernel versions. The work is still useful, he said, and would train others to help with regression tracking. The problem with this idea, Bottomley replied, is that one has to be an idiot to want to do this work — an idea that Leemhuis seemed to concur with. There won't, Bottomley added, be a flood of volunteers in this area. Matthew Wilcox's suggestion that the situation could change because there are a lot of journalists being laid off was not seen as entirely helpful.

Abbott said that, in her role as a Fedora kernel maintainer, she sees a lot of bug reports, but many of them are of low quality. They need to be filtered before being passed on to any sort of core regression list. Arnd Bergmann added that Linaro has been doing more testing recently and finding regressions in linux-next. But Leemhuis said he is really only interested in regressions that make it to the mainline at this point.

Leemhuis went on to say that, while Wysocki used the kernel's Bugzilla tracker to handle regressions, it "looks like double-entry accounting" to him and he has avoided it. There is a lot of overhead associated with working in Bugzilla, and kernel developers tend not to like it. So he has been using the mailing lists instead, but perhaps that was the wrong decision?

Wysocki replied that he used Bugzilla because it was suitable for him; it provided a useful archive of the discussions around regressions. Ts'o said that the real problem is that Torvalds will not dictate a single bug-tracking system for the kernel, so the information is scattered around the community. The kernel Bugzilla is not perfect, he said, but it has the advantage of actually existing and being available. Wysocki added that there needs to be a database somewhere; it should be possible to point people to a definitive entry for a regression. Takashi Iwai said that, for distributors, the most important thing to have is an overview of the situation; that is missing now. There is no comprehensive list of problems, so distributors must go through the time-consuming task of polling a number of different bug trackers.

Wilcox asked if distributors use the regression list for decisions on which kernel versions to ship, or whether those decisions are purely based on time. Abbott replied that Fedora tries to ship the latest mainline kernel, but the decision on pushing a specific kernel does depend on the current regressions. A significant Intel or AMD graphics regression will cause a kernel to be held back, she said, while "an obscure USB dongle" problem will not.

Ben Hutchings said that the situation at Debian is similar, at least outside of the long-term support releases. Iwai said that openSUSE Tumbleweed ships the latest kernel, meaning that regression reports are relevant to the current mainline release, not the development kernel that the kernel developers are working on currently. There are, he said, not many people testing the -rc kernels. Jiri Kosina added that SUSE tracks the "Fixes" tags in patches to see which bug fixes are relevant to the kernels they have shipped; those fixes will be backported if needed. That has led to a reduction in the regressions reported with openSUSE kernels.

Leemhuis asked if he should query developers via email more often the way Wysocki did; Wysocki replied that he didn't do that — his scripts did. Mark Brown said that was a good thing, since the scripts were more polite than their author. Overall, there didn't appear to be any opposition to more email if that's what is needed to improve regression tracking.

As the discussion came to a close, it was noted that regression reporting is hard for most users. They don't know where to send their reports, and there is little information out there to help them. The noise on the mailing lists does not help. The kernel Bugzilla is especially problematic since it is the wrong place to report many bugs, but it's not clear which ones or where they should actually go. Ts'o said that, if it were up to him, he would designate the Bugzilla as being for all kernel bugs, and that subsystem maintainers would simply be told to cope with it. In the absence of such a policy, users will continue to struggle.

The final suggestion came from Abbott, who said that perhaps users who send email to the linux-kernel list (and nobody else) should get an automatic response. That response would inform them that email sent only to the list is unlikely to be read by many people and would thus probably not get a response. It would include suggestions regarding how to more successfully report bugs. This idea was generally well received.

This topic was revisited during the invitation-only Maintainers Summit two days later.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Comments (18 posted)

Improving printk()

By Jonathan Corbet
November 1, 2017

2017 Kernel Summit

When a kernel developer wants to communicate a message to user space, be it for debugging or to report a serious problem with the system, the venerable printk() function is usually the tool of choice. But, as Steve Rostedt (accompanied by Petr Mladek and Sergey Senozhatsky) noted during a brief session at the 2017 Kernel Summit, printk() has not aged well. In particular, it can affect the performance of the system as a whole; the roots of that problem and a possible solution were discussed, but a real solution will have to wait for the appearance of the code.

The problem, Rostedt said, has to do with the management of the console lock, which serializes access to the console device where messages are printed. Deep within printk(), one will find interesting call sequences like:

    if (console_trylock())
	console_unlock();

The first call will attempt to acquire the console lock but may not succeed; the second, on its surface, releases that lock immediately after it was acquired. It is the work involved in releasing the console lock that can create problems for the system.

printk() must proceed regardless of the availability of the console lock; since printk() is called from all over the kernel, waiting for any sort of lock risks deadlocking the system. So, if a particular call is unable to obtain the console lock, it simply buffers its output and returns, in the expectation that somebody else will flush that output to the console. That task falls to the thread that holds the console lock; that thread is expected to flush out all buffered output as part of the process of releasing the lock.

On a large system with a lot of CPUs, there can be multiple threads calling printk() at any given time. They can leave behind a lot of work for the unlucky thread that holds the console lock; indeed, in the worst case, output can continue to pile up while the buffer is being flushed, leaving the lock holder with a job of indefinite duration. That is bad for system performance and the latency of anything that needs to run on the affected CPU.

Peter Zijlstra jumped in to say that, whenever this problem comes up, he just removes printk() calls until it goes away. Andrew Morton, instead, asked for forgiveness for creating this mechanism in the first place; it was, he said, something he came up with at 3AM. Rostedt went on to say that, in the worst case, flushing printk() output can take so long that the watchdog fires and the system crashes. If there are 100 CPUs in the system, one of them can end up flushing printk() output forever.

There are, he said, a couple of possible solutions to the problem. One of them is to remove printk() calls as Zijlstra suggested, but that is a game of whack-a-mole that is never really done. The alternative is a new locking scheme where the second thread attempting to obtain the console lock spins and waits for it to become available. The current holder of the lock will see that there is a waiter and release the lock; the second thread will then acquire it and continue flushing the output buffer. If multiple CPUs are generating output, the lock will circulate between them, and none will end up printing output for too long.

Jan Kara said that he had once tried to implement a similar scheme, but he ran into a lot of special cases and finally gave up on it. Mathieu Desnoyers suggested deferring any excess printing work to a workqueue rather than pushing it out immediately; Ben Herrenschmidt concurred, saying that there is no real need to flush output right away. But Rostedt answered that Linus Torvalds insists that crash dumps must go out immediately, so any scheme that can delay output will not fly. The entire printk() buffer must be printed out as soon as possible.

There was some unstructured discussion on the details of the new approach, but no real conclusions were reached. This is a conversation that will have to resume once the code to implement the new mechanism has been posted.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to the Kernel Summit.]

Comments (7 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Canonical & GNOME; Plasma Mobile; Free Software Awards; Quotes; ...
Announcements: Newsletters; CFPs; Events; Security updates; Kernel patches.

Next page: Brief items>>