LWN.net Weekly Edition for November 10, 2016
Portable system services
In the refereed track of the 2016 Linux Plumbers Conference, Lennart Poettering presented a new type of service for systemd that he calls a "portable system service". It is a relatively new idea that he had not talked about publicly until the systemd.conf in late September. Portable system services borrow some ideas from various container managers and projects like Docker, but are targeting a more secure environment than most services (and containers) run in today.
There is no real agreement on what a "container" is, Poettering said, but most accept that they combine a way to bundle up resources and to isolate the programs in the bundle from the rest of the system. There is also typically a delivery mechanism for getting those bundles running in various locations. There may be wildly different implementations, but they generally share those traits.
![Lennart Poettering [Lennart Poettering]](https://static.lwn.net/images/2016/lpc-poettering-sm.jpg)
Portable system services are meant to provide the same kind of resource bundling, but to run the programs in a way that is integrated with the rest of the system. Sandboxing would be used to limit the problems that a compromised service could cause.
If you look at the range of ways that a service can be run, he said, you can put it on an axis from integrated to isolated. The classic system services, such as the Apache web server or NGINX, are fully integrated with the rest of the system. They can see all of the other processes, for example. At the other end of the scale is virtual machines, like those implemented by KVM, which are completely isolated. In between, moving from more integrated to more isolated, are portable system services, Docker-style micro-services, and full operating system containers such as LXC.
Portable system services combine the traditional, integrated services with some ideas from containers, Poettering said. The idea is to consciously choose what gets shared and what doesn't. Traditional services share everything, the network, filesystems, process IDs, init system, devices, and logging, but some of those things will be walled off for portable system services.
This is the next step for system services, he said. The core idea behind systemd is service management; not everything is Docker-ized yet, but everything has a systemd service file. Administrators are already used to using systemd services, so portable services will just make them more powerful. In many cases, users end up creating super-privileged containers by dropping half of the security provided by the container managers and mostly just using the resource bundling aspect. He wants to go the other direction and take the existing services and add resource bundling.
Integration is good, not bad, Poettering said; having common logging and networking is often a good thing. Systemd currently recognizes two types of services: System V and native. A new "portable" service type will be added to support this new idea. It will be different from the other service types by having resource bundling and sandboxing.
To start with, unlike Docker, systemd does not want to be in the business of defining its own resource bundling format, so it will use a simple directory tree in a tarball, subvolume, or GUID Partition Table (GPT) image with, say, SquashFS inside. Services will run directly from that directory, which will be isolated using chroot().
The sandboxing he envisions is mostly concerned with taking away the visibility of and preventing access to various system resources. He went through a long list of systemd directives that could be used to effect the sandboxing. For example, PrivateDevices and PrivateNetwork are booleans that restrict access to any but the minimal devices in /dev (e.g. /dev/null, /dev/urandom) and to provide only a loopback interface for networking. PrivateTmp gives the service its own /tmp, which removes a major attack surface. There is a setting to have a private user database that only has three users (root and a service-specific user; all other user IDs are mapped to nobody). Some other settings protect various directories in the system by mounting them read-only for the service, there is a setting to disallow realtime scheduling priority, another to restrict kernel-module loading, and so on.
Many of those are already present in systemd and more will be added. The systemd project has been working to make sandboxing more useful for services, Poettering said. He would like to see a distribution such as Fedora turn these features on for the services that it ships. Another area systemd will be working on is per-service firewalls and accounting.
Unlike native or System V services, portable services will have to opt out of these sandboxing features if they don't support them. In fact, he said, if systemd were just starting out today, native services would have been opt-out for the sandboxing options, but it is too late for that now.
There are some hard problems that need to be solved to make all of this work. One is that Unix systems are not ready to handle dynamic user IDs. When a portable service gets started, an unprivileged user for the service gets created, but is not put into the user database (e.g. passwd file). If a file is created by this user and then the service dies, the file lingers with a user ID that is unknown to the system.
One way to handle that is to restrict services with a dynamic user ID from being able to write to any of the filesystems, so using that feature will require the ProtectSystem feature that mounts the system directories read-only. Those services will get a private /tmp and a directory under /run that they can use, but those will be tied to the lifecycle of the service. That way, the files with unknown (to the system) user IDs will go away when the service and user ID are gone.
Dynamic users are currently working in systemd, which is a big step forward, Poettering said. Right now, new users are installed when an RPM for a service is installed, which doesn't really scale. Dynamic users make that problem go away since the user ID is consumed only while the service is running.
Another problem that people encounter is that if they try to install a service in a chroot() environment, they will need to copy the user database into the chroot(). The idea behind the PrivateUsers setting is to make chroot() work right. That setting will restrict the service to only have three users, one is the dynamic user for the service and the other two are root and nobody. Most distributions agree on the user ID for root and nobody, so that will help make the portable services able to run on various distributions.
D-Bus is incompatible with a chroot() environment because there is a need to drop policy files into the host filesystem. For now, that is an unsolved problem, but the goal is to move D-Bus functionality into systemd itself. That is something the project should have done a long time ago, Poettering said. The systemd D-Bus server would then use a different policy mechanism that doesn't require access to the host filesystem.
He stressed that systemd is not building its own Docker-like container manager; it is, instead, providing building blocks to take a native service and turn it into a portable one. So systemd will only have a simple delivery mechanism that is meant to be used by developers for testing, not for production use. Things like orchestration and cluster deployment are out of scope for systemd, Poettering declared.
He showed a few examples of his vision of using the systemctl command to start, stop, and monitor portable services on the local host or a remote one, though it was not a demo. It was not entirely clear from the talk how far along things are for portable system services. The overall goal is to take existing systemd services, add resource bundling and sandboxing, and make "natural extensions to service management" to support the portable services idea, he said.
Various questions were asked at the end. For example, updating the services is something that is out of the scope for systemd. That can be handled with some other tool, but the low-level building blocks will be provided by systemd. Another question concerned configuration. Different configurations of a service would require building a different bundle with those changes, as the assumption is that configuration gets shipped in the bundle.
Having the security be opt-out is useful, but how will additions to the security restrictions be handled? Existing services could break under stricter rules. Poettering said that it was something he was aware of, but had not come up with a solution for yet. He wants to start with a powerful set of restrictions out of the box, but perhaps defining a target security level for a particular service could help deal with this backward incompatibility problem.
[ Thanks to LWN subscribers for supporting my travel to Santa Fe for LPC. ]
Making WiFi fast
Dave Täht has been working to save the Internet for the last six years (at least). Recently, his focus has been on improving the performance of networking over WiFi — performance that has been disappointing for as long as anybody can remember. The good news, as related in his 2016 Linux Plumbers Conference talk, is that WiFi can be fixed, and the fixes aren't even all that hard to do. Users with the right hardware and a willingness to run experimental software can have fast WiFi now, and it should be available for the rest of us before too long.Networking, Täht said, has been going wrong for over a decade; it turns out that queuing theory has not properly addressed the problem of matching data rates to the bandwidth that the hardware can provide. Developers have tended to optimize for the fastest rates possible, but those rates are rarely seen in the real world when WiFi is involved. The "make WiFi fast" effort, involving a number of developers, seeks to change the focus and to optimize both throughput and latency at all data rates.
He has been working on the bufferbloat problem for the last six years. Hundreds of people have been involved in this effort, which was spearheaded by the Linux networking stack. Many changes were merged, starting with byte queue limits in 3.3 and culminating (so far) with the BBR congestion-control algorithm, which was merged for 4.8. At this point, all network protocols can be debloated — with the exception of WiFi and LTE. But, he said, a big dent has just been made in the WiFi problem.
For the rest of the talk, Täht enlisted the aid of Ham the mechanical
monkey. Ham, it seems, works in the marketing department. He only cares
about benchmarks; if the numbers are big, they will help to sell products.
Ham has been the nemesis for years, driving the focus in the wrong
direction. The right place to focus is on use cases, where the costs of
bufferbloat are felt. That means paying much more attention to latency,
and focusing less on the throughput numbers that make Ham happy.
As an example, he noted that the Slashdot home page can, when latency is near zero, be loaded in about eight seconds (the LWN page, he said, was too small to make an interesting example). If the Flent tool is used to add one second of latency to the link, that load takes nearly four minutes. We have all been in that painful place at one point or another. The point is that latency and round-trip times matter more than absolute throughput.
Unfortunately, the worst latency-causing bufferbloat is often found on high-rate connections deep within the Internet service provider's infrastructure. That, he said, should be fixed first, and WiFi will start to get better for free. But that is only the start. WiFi need not always be slow; its problems are mostly to be found in its queuing, not in external factors like radio interference. The key is eliminating bufferbloat from the WiFi subsystem.
To get there, Täht and his collaborators had to start by developing a better set of benchmarks to show what is going on in real-world situations. The most useful tool, he said, is Flent, which is able to do repeatable tests under network load and show the results in graphical form. Single-number benchmark results are not particularly helpful; one needs to look at performance over time to see what is really going on. It is also necessary to get out of the testing lab and test in the field, in situations with lots of stations on the net.
What they found was that the multiple-station case is where things fall down in the WiFi stack. If you have a single device on a WiFi network, things will work reasonably well. But as soon as there is contention for air time, the problems show up.
How to improve WiFi
The WiFi stack in current kernels has four major layers of interest, when it comes to queuing:
- At the top, the queuing discipline accepts packets and feeds them
into the driver layer. The amount of buffering there is huge; it can
hold ten seconds of WiFi data.
- The mac80211 layer does high-level WiFi work, and adds some queuing
and latency of its own.
- The driver for the WiFi adapter maintains several queues of its own,
perhaps holding several seconds of data. This level is where aggregation is done;
aggregation groups a set of packets into a single transmitted frame to
improve throughput — at the cost of increased latency.
- The firmware in the adapter itself can hold another ten seconds of data in its queues.
That adds up to a lot of queuing in the WiFi subsystem, with all of the associated problems. The good news is that fixing it required no changes to the WiFi protocols at all. So those fixes can be applied to existing networks and existing adapters.
The first step was to add a "mac80211 intermediate queue" that handles all packets for a given device, reducing the amount of queuing overall, especially since the size of this queue is strictly limited. It is meant to to hold no more data than can be sent in two "transmission opportunities" (slots in which an aggregate of packets can be transmitted). The fq_codel queue management algorithm was generalized to work well in this setting.
The queuing discipline layer was removed entirely, eliminating a massive amount of buffering. Instead, there is a simple per-station queue, and round-robin fair queuing between the stations. The goal is to have one aggregated frame in the hardware for transmission, and another one queued, ready to go as soon as the hardware gets to it. Only having two packets queued at this layer may not scale to the very highest data rates, he said, but, in the real world, nobody ever sees those rates anyway.
There should be a single aggregate under preparation in the mac80211 layer; all other packets should be managed in the (short) per-station queues. In current kernels, mac80211 pushes packets into the low-level driver, where they may accumulate. In the new model, instead, the driver calls back into the mac80211 layer when it needs another packet; that gives mac80211 a better view into when transmission actually happens. The total latency imposed by buffering in this scheme is, he said, limited to 2-12ms, and there is no need for intelligence in the network hardware.
Results and future directions
The result of all this work is WiFi latencies that are less than 40ms, down from a peak of 1-2 seconds before they started, and much better handling of multiple stations running at full rate. Before the changes, a test involving 100 flows all starting together collapsed entirely, with at most five flows getting going; all the rest failed due to TCP timeouts caused by excessive buffering latency. Afterward, all 100 could start and run with reasonable latency and bandwidth. All this work, in the end, comes down to a patch that removes a net 200 lines of code.
There are some open issues, of course. The elimination of the queuing discipline layer took away a number of useful network statistics. Some of these have been replaced with information in the debugfs filesystem. There is, he said, some sort of unfortunate interaction with TCP small queues; Eric Dumazet has some ideas for fixing this problem, which only arises in single-station tests. There is an opportunity to add better air-time fairness to keep slow stations from using too much transmission time. Some future improvements, he said, might come at a cost: latency improvements might reduce the peak bandwidth slightly. But latency is what almost all users actually care about, so that bandwidth will not be missed — except by Ham the monkey.
At this point, the ath9k WiFi driver fully supports these changes; the code can be found in the LEDE repository and daily snapshots. Work is progressing on the ath10k driver; it is nearly done. Other drivers have not yet been changed. Expanding the work may well require some more thought on the driver API within the kernel but, for the most part, the changes are not huge.
WiFi is, Täht said, the only wireless technology that is fully under our control. We should be taking more advantage of that control to make it work as well as it possibly can; he wishes that there were more developers working in this area. Even a relatively small group has been able to make some significant progress in making WiFi work as it should, though; we will all be the beneficiaries of this work in the coming years.
[Your editor thanks LWN subscribers for supporting his travel to LPC.]
Security
A trio of fuzzers
In the testing and fuzzing microconference at the 2016 Linux Plumbers Conference, three separate kernel fuzzing projects were presented and discussed. It didn't take long to see that there are a few different opportunities for the projects to collaborate even though they have different focuses and operational modes. Some of that collaboration got started the next day in a session on creating a formal specification of the Linux user-space APIs.
Fuzzing is a testing technique aimed at finding bugs by providing random inputs to programs and APIs. Often, the bugs found are exploitable security vulnerabilities, so there is a strong push to apply fuzzing to the kernel.
syzkaller
First up was Dmitry Vyukov, who was presenting about syzkaller, which is a coverage-guided fuzzer. The idea is to take a corpus of "interesting inputs" and to record the code paths that are executed for those inputs. The inputs are then mutated in various ways, and if a new code path is taken, that input gets added to the corpus. This has been done before in user space, but he wanted to apply it to the kernel.
![Dmitry Vyukov [Dmitry Vyukov]](https://static.lwn.net/images/2016/lpc-vyukov-sm.jpg)
One problem is in how to get the coverage information out of the kernel. Over the last year or so, GCC has gained the ability to insert a function call into every basic block in the code. The syzkaller team added the CONFIG_KCOV option to the kernel to build it using that GCC mode and to provide a way for user space to retrieve the coverage information from debugfs.
Vyukov spent a bit of time explaining the system-call descriptions that are used to guide the mutations of the inputs. It is mostly about argument and return types for the function, but goes beyond simply the C types of those elements. For example, it can describe things like the map file descriptors that are created by the bpf() system call (when using the MAP_CREATE flag), which are placed into a structure that gets passed to other bpf() operations.
The system-call descriptions allow syzkaller to generate and mutate programs that make system calls. The algorithm used is to start with an empty corpus; each iteration either generates a new program or chooses one from the corpus to mutate. The program is executed and the coverage is collected; if new parts of the kernel code have been reached, the program is added into the corpus.
There is a threaded execution mode to handle blocking system calls. For example, a read() on a pipe will block until a write() is done on the other end of the pipe, so each system call is done in its own thread. There is also a "collider" that tries to simultaneously run consecutive system calls in the program, which finds "a lot of data races", he said. It is currently implemented in a basic form. There is an open question how to make it better and how hard to collide the calls.
Another area that is a work in progress is a way to provide "external stimulus" to the programs. For example, a program may need to have some network packets delivered to it. The idea is to have hooks in the kernel to inject data in the right places, but it is not completely clear where those hooks should be. It needs to be done synchronously in the thread so that the code coverage information can be collected. It also needs to be reproducible so that problems can be tracked down. Part of that is that he needs to ensure that the kernel is not keeping state information between calls, such as dynamic cookie values.
He is also looking at doing smarter program mutation; the program space is "enormous". Right now, there is some basic prioritization, so that a program that works with TCP sockets will add more calls that use sockets. But he thinks there is more work to be done; using the knowledge of what the program is doing, there should be a way to mutate the programs better. He doesn't have an exact solution, but hopes to get there.
But a small group of people cannot describe all of the system calls, he said. Sometimes a good description requires domain knowledge for the system call and underlying subsystem. Also, new system calls, flags, and fields in structures are constantly being added, so the syzkaller team cannot keep up. The previous day he had gotten together with the BPF folks and made the description of that system call better.
Currently, those descriptions are part of syzkaller itself, but it would be nice to have them in the kernel tree, Vyukov said. There are two proposed locations for those descriptions: include/uapi or Documentation. Sasha Levin suggested that a machine-readable description of the system calls could be used by other tools to, for example, validate system calls at runtime. And Thomas Gleixner objected that "random descriptions" as documentation is not what is needed; a formal specification of the user-space API should be created instead. There is a group in Germany working on safety certifications for Linux that would be willing to help with that effort, he said, as would some kernel developers. That idea was generally well-received and would come up again in the microconference.
Each program that is generated for testing is self-contained, Vyukov said. They run in one process with separate directories for each. The programs can become arbitrarily long if needed; if there is a resource that the program needs to use, it must create it as there are no facilities for passing things into the program.
In less than a year, though, syzkaller has found more than 300 runtime failures. Many of those were security issues, he said, so reports from the syzkaller team should not be ignored.
Gleixner suggested that syzkaller exercise the signal code in the kernel because it is a code path that has not had much coverage over the years. If one thread is blocked in a system call and another thread sends it a signal that will get into code paths that are rarely executed. Vyukov said that syzkaller has support for some of the signal-related system calls, but not to do what Gleixner proposed. It should be possible to do so, though.
Trinity
Dave Jones then stepped forward to talk about the Trinity system-call fuzzer. Jones gave more details on how Trinity functions in a talk at linux.conf.au in 2013.
![Dave Jones [Dave Jones]](https://static.lwn.net/images/2016/lpc-jones-sm.jpg)
Over the last year, he has been trying to get the child process that is making the system calls to run longer before it fails from bugs in Trinity itself. It used to be on the order of 50 system calls before a crash, but is now around 5000 calls. Longer-running processes meant that he needed to add garbage collection for the mmap() regions so that the processes do not get a visit from the out-of-memory (OOM) killer.
A key feature that he has added in the last year is to alter the access patterns for the mmap() regions. Now a test run can access the first page, last page, every other page, or a random page (as it had done previously) in the region. That, he said, turned up a lot of bugs.
He has also been improving the BPF support in Trinity. In the last six months, he has added better BPF generation. The plan is to attach random BPF programs to tracepoints and "see what falls out". There have also been stability and debuggability improvements made.
For a new system call, Trinity can find trivial bugs in the implementation pretty quickly. The others that it finds are those that take a long time to run and are hard to reproduce. In addition, there is no way to get a real handle on what caused the crash. He does dump the last system calls made in all of the threads, but that may not really help. He is going to add network logging which would allow Trinity to store more state before the crash.
Mathieu Desnoyers asked if Trinity could use the tracing facilities in the kernel, such as the ftrace function tracer or LTTng, to gather its state. Jones said that was a good idea that he wanted to explore. Steve Rostedt suggested that choosing only certain functions to trace could be one way to reduce the size of the trace. The ftrace ring buffer will be part of what gets dumped in a kernel crash dump, so that could be used to extract the tracepoints that were hit before a crash.
At Facebook, where Jones works, he has machines running Trinity constantly. He would like to make it more parallelizable across machines so that crashes that take 24 hours to reproduce could be done in an hour on, say, 50 machines. He wants to find a way to reduce the randomness in Trinity when trying to reproduce problems so that crashes happen more predictably and can hopefully be diagnosed more quickly.
Jones noted that he wanted to talk to Desnoyers and Rostedt about tracing, as well as Vyukov about the system-call descriptions that syzkaller uses. Vyukov said that it might make sense for Trinity to use the syzkaller descriptions, and Jones agreed. Right now, it is a lot manual work to maintain that information, which is something Jones hates doing. He also often misses new system calls and flags that get added. There is a lot of value in having two fuzzers that work differently sharing infrastructure, he said.
The conversation turned back to the idea of a formal system-call specification. Mike Frysinger noted that user space could use that information in a variety of ways; strace could use it, for example. Jones said it is surprising that "we haven't gotten there already", but it appears that a critical mass is forming around the idea at this point.
Someone asked if the macros that define system calls in the kernel could be extended with more metadata to automatically generate the specification. Vyukov said there is a lot more that is needed than could be put into the macros; for example, valid flag values and crypto algorithms would need to part of the specification. Levin said that it was important to write the specification rather than get the information automatically; the kernel should "chase the specification", instead of the other way around.
Jones concluded by saying that he plans to get back to the networking code in Trinity. He hasn't really been finding networking bugs over the last year and would like to change that. He wants to talk with Vyukov about his ideas for injecting network traffic into processes. There is a huge opportunity to collaborate on that, he said.
perf_fuzzer
The last fuzzer presented in the track is a special-purpose one that simply fuzzes a single system call: perf_event_open(). Vince Weaver spoke about his perf_fuzzer tool, which he has also written about for LWN. Weaver is a professor at the University of Maine and said that he had wanted to give his presentation using a Raspberry Pi with a custom operating system that he wrote, but had recently broken the frame buffer support, so he went without slides (as had Jones before him).
![Vince Weaver [Vince Weaver]](https://static.lwn.net/images/2016/lpc-weaver-sm.jpg)
He set out to fuzz perf because he wanted to show that it is safe for use in high-performance computing (HPC). Users of HPC systems hate crashes, but want to use the performance counters to monitor their workloads. perf_event_open() has different paranoia levels that can be set via a sysctl parameter (kernel.perf_event_paranoid). The default is 2, which does not allow regular users access to some of the more interesting perf features. He set out to show that it would be safe to reduce that value to 0, so that users could access those features.
Unfortunately, his results have shown the opposite. A value of 0 is not safe at this point, which has led to calls to further restrict perf_event_open(). That is not at all what he wanted to see, he said.
perf_event_open() has a complicated interface that involves a structure with lots of entries that can interact in different ways. The man page is not that great, which he can say since he wrote it. perf_event_open() also interacts with other system calls.
At this point, all of the low-hanging fruit has been caught in perf_event_open(). Other fuzzers, such as Trinity and syzkaller don't find any problems, but perf_fuzzer still does. The perf subsystem has "tentacles all over the kernel", it is entangled with the proc filesystem, sysfs, fork(), BPF, tracepoints, and more. It also generates a lot of non-maskable interrupts (NMIs). Beyond that are the hardware interactions that make it hard to have deterministic results. Things like cache misses are not deterministic at the hardware level, for example. Also, performance counters cannot be virtualized.
Over the previous two weeks, he had run a bunch of tests with perf_fuzzer, which was rather tedious to do, Weaver said. He wanted to run with a recent kernel from Git, which is easy to do on x86, but not so easy for ARM. He has difficulty getting Git kernels to run on his boards.
When only using the facilities available with a paranoid value of 2, a
Pentium 4 will crash in eight hours minutes,
while a Core 2 will run for a week without any problems. Others, such
as Haswell, will crash in two or three days. A Raspberry Pi will run for
more than a week without problems, but that may be due to how slow the
processor is.
Most of the bugs are either RCU or NMI stalls and it is not clear what the underlying problem actually is. Sometimes the systems lock up with nothing on the serial console. Many of the problems are not reproducible either.
For a paranoid value of 1, the Core 2 will crash in one day, while a Haswell will crash in two hours. At a paranoid level of 0, the Core 2 takes 21 hours to crash, and the Haswell will crash in one or two hours. He has run into something he suspects is a hardware bug on Haswell as well. A paranoid value of -1 (no restrictions) is "not recommended", he said. So, after years of work finding and fixing bugs in perf_event_open(), you can still crash a machine by running perf_fuzzer long enough. Those crashes will be hard to debug, which is frustrating.
Weaver likes the idea of a more formalized specification for system calls and other user-space APIs. It would be nice to have a way to find out about new things added to the interfaces and it is great to see more people get interested in these problems. He has tried getting funding and in having his research published without much success. In addition, perf_fuzzer is being used to find bugs and collect bug bounties, but he rarely gets any code contributions from others—or even hears from them at all.
For a long time, work on improving perf_fuzzer has been stalled because it was so easy to crash machines with it. There simply has been no impetus to add more features. He would like to add support for control groups and BPF to the fuzzer, possibly by reusing the Trinity BPF fuzzing code. He finished by noting that he has a paper [PDF] on his web site that goes into some detail on how perf_fuzzer operates.
Vyukov wondered if Weaver had tried using the KernelAddressSanitizer (KASAN) while fuzzing with perf_fuzzer. Weaver said that he had not done so, but there was no real reason for that, so it is probably worth trying. Vyukov also asked what the advantage was for a specialized fuzzer and perf developer Peter Zijlstra quickly spoke up to say that perf_fuzzer is still finding bugs, while syzkaller can't crash perf any longer.
Part of the difference may be that syzkaller runs in a virtual machine, normally, while perf_fuzzer runs on bare metal. Weaver also pointed out that perf_event_open() has specialized needs for things like its mmap() region that Trinity and syzkaller may not be creating in the right way because they lack the specialized knowledge of the interface that perf_fuzzer has. Jones suggested that both Trinity and perf_fuzzer might benefit from some coverage guidance to help in reproducing bugs when they are encountered.
That's where the conversation wound down, but it is clear there are several areas of collaboration that could, and seemingly will, be pursued. The system-call (or user-space API) specification gained quite a bit of traction in the microconference; we will have to see where that ends up. Meanwhile, these three fuzzers (and perhaps others that might crop up) can benefit from each others' work—which should hopefully lead to more kernel bugs being found and fixed.
[ Thanks to LWN subscribers for supporting my travel to Santa Fe for LPC. ]
New vulnerabilities
389-ds-base: two vulnerabilities
Package(s): | 389-ds-base | CVE #(s): | CVE-2016-5405 CVE-2016-5416 | ||||||||||||||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | November 21, 2016 | ||||||||||||||||||||||||||||||||||||
Description: | From the Red Hat advisory:
* It was found that 389 Directory Server was vulnerable to a remote password disclosure via timing attack. A remote attacker could possibly use this flaw to retrieve directory server password after many tries. (CVE-2016-5405) * It was found that 389 Directory Server was vulnerable to a flaw in which the default ACI (Access Control Instructions) could be read by an anonymous user. This could lead to leakage of sensitive information. (CVE-2016-5416) | ||||||||||||||||||||||||||||||||||||||
Alerts: |
|
ansible: two vulnerabilities
Package(s): | ansible | CVE #(s): | CVE-2016-8614 CVE-2016-8628 | ||||||||||||
Created: | November 8, 2016 | Updated: | November 21, 2016 | ||||||||||||
Description: | From the Red Hat bugzilla:
CVE-2016-8614: It was found that apt_key module does not properly verify key fingerprints, allowing remote adversary to create an OpenPGP key which matches the short key ID and inject this key instead of the correct key. CVE-2016-8628: It was found that it's possible to inject code and gain remote code execution via setting ansible_ssh_executable variable by attacker that takes over of controlled server. | ||||||||||||||
Alerts: |
|
chromium: memory leak
Package(s): | chromium | CVE #(s): | CVE-2016-5198 | ||||||||||||||||||||||||||||||||||||||||||||||||
Created: | November 7, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the openSUSE advisory:
- CVE-2016-5198: out of bounds memory access in v8 (boo#1008274) | ||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
curl: insufficient validation
Package(s): | curl | CVE #(s): | CVE-2016-8625 | ||||||||||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||
Description: | From the Arch Linux advisory:
- CVE-2016-8625 (insufficient validation): When curl is built with libidn to handle International Domain Names (IDNA), it translates them to puny code for DNS resolving using the IDNA 2003 standard, while IDNA 2008 is the modern and up-to-date IDNA standard. This misalignment causes problems with for example domains using the German ß character (known as the Unicode Character 'LATIN SMALL LETTER SHARP S') which is used at times in the .de TLD and is translated differently in the two IDNA standards, leading to users potentially and unknowingly issuing network transfer requests to the wrong host. For example, straße.de is translated into strasse.de using IDNA 2003 but is translated into xn--strae-oqa.de using IDNA 2008. Needless to say, those host names could very well resolve to different addresses and be two completely independent servers. IDNA 2008 is mandatory for .de domains. This name problem exists for DNS-using protocols in curl, but only when built to use libidn. | ||||||||||||||||||||||||||||||||||
Alerts: |
|
jasper: multiple vulnerabilities
Package(s): | jasper | CVE #(s): | CVE-2016-8884 CVE-2016-8885 CVE-2016-8887 | ||||||||||||||||||||||||
Created: | November 7, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||
Description: | From the openSUSE bug reports:
CVE-2016-8884: AddressSanitizer: SEGV on unknown address 0x000000000000 0x7f90527a18fd in bmp_getdata ... jasper-1.900.5/src/libjasper/bmp/bmp_dec.c:394:5 CVE-2016-8885: AddressSanitizer: SEGV on unknown address 0x000000000000 0x7f888b2f5a43 in bmp_getdata ... jasper-1.900.5/src/libjasper/bmp/bmp_dec.c:398:5 CVE-2016-8887: AddressSanitizer: SEGV on unknown address 0x000000000000 0x7f8dcb5bc940 in jp2_colr_destroy ... jasper-1.900.5/src/libjasper/jp2/jp2_cod.c:443:3 | ||||||||||||||||||||||||||
Alerts: |
|
jasper: multiple vulnerabilities
Package(s): | jasper | CVE #(s): | CVE-2016-8690 CVE-2016-8691 CVE-2016-8692 CVE-2016-8693 CVE-2016-8880 CVE-2016-8881 CVE-2016-8882 CVE-2016-8883 CVE-2016-8886 | ||||||||||||||||||||||||||||||||
Created: | November 4, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||
Description: | From the openSUSE advisory:
- CVE-2016-8690: Null pointer dereference in bmp_getdata triggered by crafted BMP image (bsc#1005084). - CVE-2016-8691, CVE-2016-8692: Missing range check on XRsiz and YRsiz fields of SIZ marker segment (bsc#1005090). - CVE-2016-8693: The memory stream interface allowed for a buffer size of zero. The case of a zero-sized buffer was not handled correctly, as it could lead to a double free (bsc#1005242). - CVE-2016-8880: Heap overflow in jpc_dec_cp_setfromcox() (bsc#1006591). - CVE-2016-8881: Heap overflow in jpc_getuint16() (bsc#1006593). - CVE-2016-8882: Null pointer access in jpc_pi_destroy (bsc#1006597). - CVE-2016-8883: Assert triggered in jpc_dec_tiledecode() (bsc#1006598). - CVE-2016-8886: Memory allocation failure in jas_malloc (jas_malloc.c) (bsc#1006599). | ||||||||||||||||||||||||||||||||||
Alerts: |
|
java-1.8.0-openjdk-aarch32: multiple vulnerabilities
Package(s): | java-1.8.0-openjdk-aarch32 | CVE #(s): | |||||||||
Created: | November 8, 2016 | Updated: | February 10, 2017 | ||||||||
Description: | From the Oracle advisory:
This Critical Patch Update contains 7 new security fixes for Oracle Java SE. All of these vulnerabilities may be remotely exploitable without authentication, i.e., may be exploited over a network without requiring user credentials. | ||||||||||
Alerts: |
|
kernel: three vulnerabilities
Package(s): | kernel | CVE #(s): | CVE-2015-8746 CVE-2015-8844 CVE-2016-3699 | ||||||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||
Description: | From the CVE entries:
fs/nfs/nfs4proc.c in the NFS client in the Linux kernel before 4.2.2 does not properly initialize memory for migration recovery operations, which allows remote NFS servers to cause a denial of service (NULL pointer dereference and panic) via crafted network traffic. (CVE-2015-8746) The signal implementation in the Linux kernel before 4.3.5 on powerpc platforms does not check for an MSR with both the S and T bits set, which allows local users to cause a denial of service (TM Bad Thing exception and panic) via a crafted application. (CVE-2015-8844) The Linux kernel, as used in Red Hat Enterprise Linux 7.2 and Red Hat Enterprise MRG 2 and when booted with UEFI Secure Boot enabled, allows local users to bypass intended Secure Boot restrictions and execute untrusted code by appending ACPI tables to the initrd. (CVE-2016-3699) | ||||||||||||||||||||||||||||||
Alerts: |
|
kernel: two vulnerabilities
Package(s): | kernel | CVE #(s): | CVE-2016-9084 CVE-2016-9083 | ||||||||||||||||||||||||||||||||||||||||
Created: | November 8, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||||||||||
Description: | From the Red Hat bugzilla:
CVE-2016-9084: The use of a kzalloc with an integer multiplication allowed an integer overflow condition to be reached in vfio_pci_intrs.c. CVE-2016-9083: The vfio driver allows direct user access to devices. The VFIO_DEVICE_SET_IRQS ioctl for vfio PCI devices has a state machine confusion bug where specifying VFIO_IRQ_SET_DATA_NONE along with another bit in VFIO_IRQ_SET_DATA_TYPE_MASK in hdr.flags allows integer overflow checks to be skipped for hdr.start/hdr.count. This might allow memory corruption later in vfio_pci_set_msi_trigger() with user access to an appropriate vfio device file, but it seems difficult to usefully exploit in practice. | ||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
libreswan: denial of service
Package(s): | libreswan | CVE #(s): | CVE-2016-5361 | ||||||||||||
Created: | November 3, 2016 | Updated: | December 15, 2016 | ||||||||||||
Description: | From the Red Hat advisory:
A traffic amplification flaw was found in the Internet Key Exchange version 1 (IKEv1) protocol. A remote attacker could use a libreswan server with IKEv1 enabled in a network traffic amplification denial of service attack against other hosts on the network by sending UDP packets with a spoofed source address to that server. | ||||||||||||||
Alerts: |
|
libvirt: privilege escalation
Package(s): | libvirt | CVE #(s): | CVE-2015-5160 | ||||||||||||
Created: | November 3, 2016 | Updated: | November 11, 2016 | ||||||||||||
Description: | From the Red Hat advisory:
It was found that the libvirt daemon, when using RBD (RADOS Block Device), leaked private credentials to the process list. A local attacker could use this flaw to perform certain privileged operations within the cluster. | ||||||||||||||
Alerts: |
|
libwebp: integer overflows
Package(s): | libwebp | CVE #(s): | CVE-2016-9085 | ||||||||||||||||||||
Created: | November 4, 2016 | Updated: | January 24, 2017 | ||||||||||||||||||||
Description: | From the Red Hat bugzilla:
Multiple integer overflows were found in libwebp library. | ||||||||||||||||||||||
Alerts: |
|
libxslt: code execution
Package(s): | libxslt | CVE #(s): | CVE-2016-4738 | ||||||||||||
Created: | November 7, 2016 | Updated: | November 23, 2016 | ||||||||||||
Description: | From the Debian advisory:
A heap overread bug was found in libxslt, which can cause arbitrary code execution or denial of service. | ||||||||||||||
Alerts: |
|
mariadb: unspecified vulnerability
Package(s): | mariadb mysql | CVE #(s): | CVE-2016-5630 | ||||||||||||||||||||||||||||
Created: | November 9, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||
Description: | From the CVE entry:
Unspecified vulnerability in Oracle MySQL 5.6.31 and earlier and 5.7.13 and earlier allows remote administrators to affect availability via vectors related to Server: InnoDB. | ||||||||||||||||||||||||||||||
Alerts: |
|
nvidia-graphics-drivers-367: privilege escalation
Package(s): | nvidia-graphics-drivers-367 | CVE #(s): | CVE-2016-7382 CVE-2016-7389 | ||||||||||||
Created: | November 3, 2016 | Updated: | January 10, 2017 | ||||||||||||
Description: | From the Ubuntu advisory:
It was discovered that the NVIDIA graphics drivers incorrectly sanitized user mode inputs. A local attacker could use this issue to possibly gain root privileges. | ||||||||||||||
Alerts: |
|
openjpeg2: code execution
Package(s): | openjpeg2 | CVE #(s): | CVE-2016-8332 | ||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||
Description: | From the CVE entry:
A buffer overflow in OpenJPEG 2.1.1 causes arbitrary code execution when parsing a crafted image. An exploitable code execution vulnerability exists in the jpeg2000 image file format parser as implemented in the OpenJpeg library. A specially crafted jpeg2000 file can cause an out of bound heap write resulting in heap corruption leading to arbitrary code execution. For a successful attack, the target user needs to open a malicious jpeg2000 file. The jpeg2000 image file format is mostly used for embedding images inside PDF documents and the OpenJpeg library is used by a number of popular PDF renderers making PDF documents a likely attack vector. | ||||||||||||||||||||||||||
Alerts: |
|
oracle-jre-bin: unspecified vulnerability
Package(s): | oracle-jre-bin | CVE #(s): | CVE-2016-5568 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | November 4, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the CVE entry:
Unspecified vulnerability in Oracle Java SE 6u121, 7u111, and 8u102 allows remote attackers to affect confidentiality, integrity, and availability via vectors related to AWT. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
pacemaker: privilege escalation
Package(s): | pacemaker | CVE #(s): | CVE-2016-7035 | ||||||||||||||||||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | December 15, 2016 | ||||||||||||||||||||||||||||||||||||||||
Description: | From the Red Hat advisory:
An authorization flaw was found in Pacemaker, where it did not properly guard its IPC interface. An attacker with an unprivileged account on a Pacemaker node could use this flaw to, for example, force the Local Resource Manager daemon to execute a script as root and thereby gain root access on the machine. | ||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
pacemaker: denial of service
Package(s): | pacemaker | CVE #(s): | CVE-2016-7797 | ||||||||||||||||||||
Created: | November 3, 2016 | Updated: | December 15, 2016 | ||||||||||||||||||||
Description: | From the Red Hat advisory:
It was found that the connection between a pacemaker cluster and a pacemaker_remote node could be shut down using a new unauthenticated connection. A remote attacker could use this flaw to cause a denial of service. | ||||||||||||||||||||||
Alerts: |
|
python-imaging: two vulnerabilities
Package(s): | python-imaging | CVE #(s): | CVE-2016-9189 CVE-2016-9190 | ||||||||||||||||
Created: | November 8, 2016 | Updated: | November 18, 2016 | ||||||||||||||||
Description: | From the CVE entries:
Pillow before 3.3.2 allows context-dependent attackers to obtain sensitive information by using the "crafted image file" approach, related to an "Integer Overflow" issue affecting the Image.core.map_buffer in map.c component. (CVE-2016-9189) Pillow before 3.3.2 allows context-dependent attackers to execute arbitrary code by using the "crafted image file" approach, related to an "Insecure Sign Extension" issue affecting the ImagingNew in Storage.c component. (CVE-2016-9190) | ||||||||||||||||||
Alerts: |
|
qemu: multiple vulnerabilities
Package(s): | qemu | CVE #(s): | CVE-2016-9101 CVE-2016-9102 CVE-2016-9103 CVE-2016-9104 CVE-2016-9105 CVE-2016-9106 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | November 3, 2016 | Updated: | November 9, 2016 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Debian LTS advisory:
CVE-2016-9101: Quick Emulator(Qemu) built with the i8255x (PRO100) NIC emulation support is vulnerable to a memory leakage issue. It could occur while unplugging the device, and doing so repeatedly would result in leaking host memory affecting, other services on the host. A privileged user inside guest could use this flaw to cause a DoS on the host and/or potentially crash the Qemu process on the host. CVE-2016-9102 CVE-2016-9105 CVE-2016-9106: Quick Emulator(Qemu) built with the VirtFS, host directory sharing via Plan 9 File System(9pfs) support, is vulnerable to a several memory leakage issues. A privileged user inside guest could use this flaws to leak the host memory bytes resulting in DoS for other services. CVE-2016-9103: Quick Emulator(Qemu) built with the VirtFS, host directory sharing via Plan 9 File System(9pfs) support, is vulnerable to an information leakage issue. It could occur by accessing xattribute value before it's written to. A privileged user inside guest could use this flaw to leak host memory bytes. CVE-2016-9104: Quick Emulator(Qemu) built with the VirtFS, host directory sharing via Plan 9 File System(9pfs) support, is vulnerable to an integer overflow issue. It could occur by accessing xattributes values. A privileged user inside guest could use this flaw to crash the Qemu process instance resulting in DoS. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
resteasy-base: code execution
Package(s): | resteasy-base | CVE #(s): | CVE-2016-7050 | ||||||||||||||||
Created: | November 3, 2016 | Updated: | December 15, 2016 | ||||||||||||||||
Description: | From the Red Hat advisory:
It was discovered that under certain conditions RESTEasy could be forced to parse a request with SerializableProvider, resulting in deserialization of potentially untrusted data. An attacker could possibly use this flaw to execute arbitrary code with the permissions of the application using RESTEasy. | ||||||||||||||||||
Alerts: |
|
spip: multiple vulnerabilities
Package(s): | spip | CVE #(s): | CVE-2016-7980 CVE-2016-7981 CVE-2016-7982 CVE-2016-7998 CVE-2016-7999 | ||||
Created: | November 3, 2016 | Updated: | November 9, 2016 | ||||
Description: | From the Debian LTS advisory:
CVE-2016-7980: Nicolas Chatelain of Sysdream Labs discovered a cross-site request forgery (CSRF) vulnerability in the valider_xml action of SPIP. This allows remote attackers to make use of potential additional vulnerabilities such as the one described in CVE-2016-7998. CVE-2016-7981: Nicolas Chatelain of Sysdream Labs discovered a reflected cross-site scripting attack (XSS) vulnerability in the validater_xml action of SPIP. An attacker could take advantage of this vulnerability to inject arbitrary code by tricking an administrator to open a malicious link. CVE-2016-7982: Nicolas Chatelain of Sysdream Labs discovered a file enumeration / path traversal attack in the the validator_xml action of SPIP. An attacker could use this to enumerate files in an arbitrary directory on the file system. CVE-2016-7998: Nicolas Chatelain of Sysdream Labs discovered a possible PHP code execution vulnerability in the template compiler/composer function of SPIP. In combination with the XSS and CSRF vulnerabilities described in this advisory, a remote attacker could take advantage of this to execute arbitrary PHP code on the server. CVE-2016-7999: Nicolas Chatelain of Sysdream Labs discovered a server side request forgery in the valider_xml action of SPIP. Attackers could take advantage of this vulnerability to send HTTP or FTP requests to remote servers that they don't have direct access to, possibly bypassing access controls such as a firewall. | ||||||
Alerts: |
|
subscription-manager: information disclosure
Package(s): | subscription-manager | CVE #(s): | CVE-2016-4455 | ||||||||
Created: | November 3, 2016 | Updated: | January 10, 2017 | ||||||||
Description: | From the Red Hat advisory:
It was found that subscription-manager set weak permissions on files in /var/lib/rhsm/, causing an information disclosure. A local, unprivileged user could use this flaw to access sensitive data that could potentially be used in a social engineering attack. | ||||||||||
Alerts: |
|
sudo: information disclosure
Package(s): | sudo | CVE #(s): | CVE-2016-7091 | ||||||||||||
Created: | November 3, 2016 | Updated: | December 15, 2016 | ||||||||||||
Description: | From the Red Hat advisory:
It was discovered that the default sudo configuration preserved the value of INPUTRC from the user's environment, which could lead to information disclosure. A local user with sudo access to a restricted program that uses readline could use this flaw to read content from specially formatted files with elevated privileges provided by sudo. | ||||||||||||||
Alerts: |
|
tomcat: multiple vulnerabilities
Package(s): | tomcat | CVE #(s): | CVE-2016-0762 CVE-2016-5018 CVE-2016-6794 CVE-2016-6796 CVE-2016-6797 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | November 7, 2016 | Updated: | November 23, 2016 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Mageia advisory:
The Realm implementations did not process the supplied password if the supplied user name did not exist. This made a timing attack possible to determine valid user names. Note that the default configuration includes the LockOutRealm which makes exploitation of this vulnerability harder (CVE-2016-0762). A malicious web application was able to bypass a configured SecurityManager via a Tomcat utility method that was accessible to web applications (CVE-2016-5018). When a SecurityManager is configured, a web application's ability to read system properties should be controlled by the SecurityManager. Tomcat's system property replacement feature for configuration files could be used by a malicious web application to bypass the SecurityManager and read system properties that should not be visible (CVE-2016-6794). A malicious web application was able to bypass a configured SecurityManager via manipulation of the configuration parameters for the JSP Servlet (CVE-2016-6796). The ResourceLinkFactory did not limit web application access to global JNDI resources to those resources explicitly linked to the web application. Therefore, it was possible for a web application to access any global JNDI resource whether an explicit ResourceLink had been configured or not (CVE-2016-6797). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The current development kernel is 4.9-rc4, released on November 5. Linus said: "So I'm not going to lie: this is not a small rc, and I'd have been happier if it was. But it's not unreasonably large for this (big) release either, so it's not like I'd start worrying. I'm currently still assuming that we'll end up with the usual seven release candidates, assuming things start calming down. We'll see how that goes as we get closer to a release."
The November 6 regression report shows 17 known regressions in the 4.9 kernel.
Stable updates: none have been released in the last week. The 4.8.7 and 4.4.31 updates are in the review process as of this writing; they can be expected on or after November 11.
Kernel development news
A discussion on virtual-memory topics
The Kernel Summit "technical topics" day was a relatively low-key affair in 2016; much of the interesting discussion had been pulled into the adjoining Linux Plumbers Conference instead. That day did feature one brief discussion involving several core virtual-memory (VM) developers, though. While a number of issues were discussed, the overall theme was clear: the growth of persistent memory is calling into question many of the assumptions on which the kernel's VM subsystem was developed.The half-hour session begin with self-introductions from the four developers at the head of the room. Mel Gorman, who works in the performance group at SUSE, said that his current interest is in improving small-object allocation, a task which he hopes will be relatively straightforward. Rik van Riel has spent much of his time on non-VM topics recently, but has a number of projects he wants to get back to. Those include using the persistent-memory support in QEMU to shift the page cache out of virtualized guest systems and into the host, where it can be better managed and shared. Johannes Weiner has been working on scalability issues, and is working on improving page-cache thrashing detection. Vlastimil Babka is concerned with high-order (larger than one page) allocations. With the merging of the virtually mapped kernel stacks work, stack allocation no longer requires high-order allocations, but there are other users in the kernel. So his current focus is on making the compaction and anti-fragmentation code work better.
Mel opened the discussion by asking whether anybody in the room had problems with the VM subsystem. Laura Abbott replied that she is working on slab sanitization as a way of hardening the kernel; it zeroes memory when it is freed to prevent memory leaks. The initial code came from the PaX/grsecurity patch set, but that code is not acceptable in mainline due to the performance hit it adds to the memory allocator's hot paths. She has been pursuing a suggestion to use the SLUB allocator's debug path, but there is little joy to be found there; SLUB's slow path is very slow.
Mel replied that he would expect that the performance hit from sanitization would be severe in any case; thrashing the memory caches will hurt even if nothing else does. If a particular environment cares enough about security to want sanitization, it will have to accept the performance penalty; this would not be the first time that such tradeoffs have had to be made. That said, he has no fundamental opposition to the concept. Laura does believe that the hit can be reduced, though; much of the time spent in the slow path goes to lock contention, so perhaps lockless algorithms should be considered. Mel concurred, noting that the slow path was misnamed; it should be the "glacial path."
Swap woes
The next topic was brought up by a developer who is working on next-generation memory, and swapping to hardware-compressed memory in particular. In the absence of the actual hardware, he is working on optimizing swapping to a RAM disk using zram. There are a number of problems he is running into, including out-of-memory kills while there is still memory available. But he's concerned about performance; with zram, about ⅔ of the CPU's time is spent on compression, while the other ⅓ is consumed by overhead. When the compression moves to hardware, that ⅓ will be the limiting factor, so he would like to find ways to improve it.
Johannes replied that there are a lot of things to fix in the swap path when fast devices are involved, starting with the fact that the swapout path uses global locks. Because swapping to rotational devices has always been terrible, the system is biased heavily against swapping in general. A workload can be thrashing the page cache, but the VM subsystem will still only reclaim page-cache pages and not consider swapping. He has been working on a patch set to improve the balance between swapping and the page cache; it tries to reclaim memory from whichever of the two is thrashing the least. There are also problems with the swapout path splitting huge pages on the way out, with a consequent increase in overhead. Adding batching to the swap code will hopefully help here.
Mel suggested the posting of profiles showing where the overhead is in the problematic workload. Getting representative workloads is hard for the VM developers; without those workloads, they cannot easily reproduce or address the problems. In general, he said, swapping is "running into walls" and needs to be rethought. Patience will be required, though; it could be 6-24 months before the problems are fixed.
Shrinker shenanigans
Josef Bacik is working, as he has for years, on improving the Btrfs filesystem. He has observed a problematic pattern: if the system is using vast amounts of slab memory, everything will bog down. He has workloads that can fill 80% of memory with cached inodes and dentries. The size of those caches should be regulated by the associated shrinkers, but that is not working well. Invocation of shrinkers is tied to page-cache scanning, but this workload has almost no page cache, so that scanning is not happening and the shrinkers are not told to free as much memory as they should. As more subsystems use the shrinker API, he said, we will see more cases where it is not working as desired.
Ted Ts'o said that he has seen similar problems with the extent status slab cache in the ext4 filesystem. That cache can get quite large; it can also create substantial spinlock contention when multiple shrinkers are running concurrently. The problems are not limited to Btrfs, he said.
Rik asked whether it would make sense to limit the size of these caches to some more reasonable value. There are quite a few systems out there now that do not really have a page cache, and their number will grow as the use of persistent memory spreads. Persistent memory is nice in that it can make terabytes worth of file data instantly accessible, but that leads to the storing of a lot of metadata in RAM.
Christoph Hellwig replied that blindly limiting the size of metadata caches is not a good solution; it might be a big hammer that is occasionally needed, but it should not be relied upon in a well-functioning system. What is needed is better balancing, he said, not strict limits. The VM subsystem has been built around the idea that filesystems store much of their metadata in the page cache, but most of them have shifted that metadata over to slab-allocated memory now. So, he said, there needs to be more talk between the VM and filesystem developers to work out better balancing mechanisms.
Rik answered that the only thing the VM code can do now is to call the shrinkers. Those shrinkers will work through a global list of objects and free them, but there is a problem. Slab-allocated objects are packed many to a page; all objects in a page must be freed before the page itself can be freed. So, he said, a shrinker may have to clear out a large fraction of the cache before it is able to free the first whole page. The cache is wiped out, but little memory is made available to the rest of the system.
Christoph said that shrinkers are currently designed around a one-size-fits-all model. There needs to be a way to differentiate between clean objects (which can be freed immediately) and dirty objects (that must be written back to persistent store first). There should also be page-based shrinkers that can try to find pages filled with clean objects that can be quickly freed when the need arises.
Mel suggested that there might be a place for a helper that a shrinker can call to ask for objects that are on the same page; it could then free them all together. The problem of contention for shrinker locks could be addressed by limiting the number of threads that can be running in direct reclaim at any given time. Either that, or shrinkers should back off quickly when locks are unavailable on the assumption that other shrinkers are running and will get the job done.
Ted said that page-based shrinkers could make use of a shortcut by which they could indicate that a particular object is pinned and cannot be freed. The VM subsystem would then know that the containing page cannot be freed until the object is unpinned. Jan Kara suggested that there could be use in having a least-recently-used (LRU) list for slab pages to direct reclaim efforts, but Linus Torvalds responded that such a scheme would not work well for the dentry cache, which is usually one of the largest caches in the system.
The problem is that some objects will pin others in memory; inodes are pinned by their respective dentries, and dentries can pin the dentries corresponding to their parent directories. He suggested that it might make sense to segregate the dentries for leaves (ordinary files and such) from those for directories. Leaf dentries are much easier to free, so keeping them together will increase the chances of freeing entire pages. There's just one little problem: the kernel often doesn't know which type a dentry will be when it is allocated, so there is no way to know where to allocate it. There are places where heuristics might help, but it is not an easy problem. Mel suggested that the filesystem code could simply allocate another dentry and copy the data when it guesses wrong; Linus said that was probably doable.
Some final details
Linus said that there is possible trouble coming with the merging of slab caches in the SLUB allocator. SLUB normally does that merging for objects of similar size, but many developers don't like it. Slab merging would also obviously complicate the task of freeing entire pages. That merging currently doesn't happen when there is a shrinker associated with a cache, but that could change in the future; disabling merging increases the memory footprint considerably. We need to be able to do merging, he said, but perhaps need to be more intelligent about how it is done.
Tim Chen talked briefly about his swap optimization work. In particular, he is focused on direct access to swap when persistent memory is used as the swap device. Since persistent memory is directly addressable, the kernel can map swapped pages into a process's address space, avoiding the need to swap them back into RAM. There will be a performance penalty if the pages are accessed frequently, though, so some sort of decision needs to be made on when a page should be swapped back in. Regular RAM has the LRU lists to help with this kind of decision, but all that is available for persistent memory is the "accessed" bit in the page-table entry.
Johannes pointed out that the NUMA code has a page-table scanner that uses the accessed bit; perhaps swap could do something similar, but Rik said that this mechanism is too coarse for swap use. Instead, he said, perhaps the kernel could use the system's performance-monitoring unit (PMU) to detect situations where pages in persistent memory are being accessed too often. The problem with that approach, Andi Kleen pointed out, is that developers generally want the PMU to be available for performance work; they aren't happy when the kernel grabs the PMU for its own use. So it's not clear what form the solution to this problem will take.
All of the above was discussed in a mere 30 minutes. Mel closed the session by thanking the attendees, noting that some good information had been shared and that there appeared to be some low-hanging fruit that could be addressed in the near future.
The perils of printk()
One might be tempted to think that there is little to be said about the kernel's printk() function; all it does, after all, is output a line of text to the console. But printk() has its problems. In a Kernel Summit presentation, Sergey Senozhatsky said that he is simply unable to use printk() in its current form. The good news, he said, is that it is not unfixable — and that there are plans for addressing its problems.
Locking the system with printk()
One of the biggest problems associated with printk() is deadlocks, which can come about in a couple of ways. One of those is reentrant calls. Consider an invocation of printk() that is preempted by a non-maskable interrupt (NMI). The handler for that NMI will, likely as not, want to print something out; NMIs are extraordinary events, after all. If the preempted printk() call holds a necessary lock, the second call will deadlock when it tries to acquire the same lock. That is just the sort of unpleasantness that operating system developers normally go far out of their way to avoid.
This particular problem has been solved; printk() now has a special per-CPU buffer that is used for calls in NMI context. Output goes into that buffer and is flushed after the NMI completes, avoiding the need to acquire the locks normally needed by a printk() call.
Unfortunately, printk() deadlocks do not end there. It turns out that printk() calls can be recursive, the usual ban on recursion in the kernel notwithstanding. Recursive calls can happen as the result of warnings issued from deep within the kernel; lock debugging was also listed as a way to create printk() calls at inopportune times. If something calls printk() at the wrong time, the result is a recursive call that can deadlock in much the same way as preempted calls.
The problem looks similar to the NMI case, so it should not be surprising
that the solution is similar as well. Sergey has a proposal to extend the NMI idea, creating
more per-CPU buffers for printk() output. Whenever
printk() wanders into a section of code where recursion could
happen, output from any recursive calls goes to those buffers, to be
flushed at a safe time. Two new functions, printk_safe_enter()
and printk_safe_exit(), mark the danger areas. Perhaps
confusingly, printk_safe_enter() does not mark a safe area;
instead, it marks an area where the "safe" output code must be used.
Given that the per-CPU buffers are required in an increasing number of situations, Peter Zijlstra wondered whether printk() should just use the per-CPU buffer always. Sergey responded that this approach is under consideration.
Hannes Reinecke said that part of the problem results from the two distinct use cases for printk(): "chit-chat" and "the system is about to die now." The former type of output can go out whenever, while the latter is urgent. In the absence of better information, printk() must assume that everything is urgent, but a lot of problems could be solved by simply deferring non-urgent output to a safe time. Linus Torvalds pointed out that the log level argument should indicate which output is urgent, but Peter said that just deferring non-urgent output is not close to a full solution. The real problem, he said, is in the console drivers; this subject was revisited later in the session.
One problem with deferring non-urgent output, Sergey said, is that the ordering of messages can be changed and it can be hard to sort them out again. Peter suggested that this was not much of a problem; Hannes said, rather forcefully, that printk() output has timestamps on it, so placing it back into the proper order should not be difficult. The problem there, according to Linus, is that timestamps are not necessarily consistent across CPUs; if a thread migrates, the ordering of its messages could be wrong.
Petr Mladek, who joined Sergey at the front of the room, said that there is a problem with per-CPU buffers: they will almost necessarily be smaller than a single, global buffer, and can thus limit the amount of output that can be accumulated. So it is more likely that the system will lose messages if it is using per-CPU buffers. It was pointed out that the ftrace subsystem has solved this problem for a long time, but it was also pointed out that the cost of that solution is a lot of complicated ring-buffer code. Linus said that the one thing that must be carefully handled is oops messages resulting from a kernel crash; those must make it immediately to the console.
Sergey went on to say that there is a larger set of printk() deadlocks that needs to be dealt with. Thus far, the conversation had concerned "internal" locks that are part of printk() itself. But printk() often has to acquire "external" locks in other parts of the kernel. The biggest problem area appears to be sending output to the console; there are locks and related problems in various serial drivers that can, once again, deadlock the system. Unlike internal locks, external locks are not controlled by printk(), so the problem is harder to solve.
The kernel already has a printk_deferred() function that goes out of its way to avoid taking external locks, once again deferring output to a safer time. Sergey's proposal is to make printk() always behave like printk_deferred(), eliminating the distinction between the two and enabling the eventual removal of printk_deferred() itself. The only exception would be for emergency output, which will always go directly to the console. Linus suggested going one step further, and taking the deferred path even in emergencies, but then flushing the buffers immediately thereafter.
Console troubles and more
Locks are not the only problem with printk(), though. To output its messages, it must call into the console drivers and, at completion, it must call console_unlock() which will, among other things, flush any pending output to the console. This function has some unfortunate properties: it can loop indefinitely, it may not be preemptible, and the time it takes depends on the speed of the console — which may not be fast at all. As a result, nobody knows how long a printk() call will take, so it's not really safe to call it in any number of situations, including atomic context, RCU critical sections, interrupt context, and more.
To get around this kind of problem, Jan Kara has proposed making printk() completely asynchronous. Once again, output would be directed to a buffer and sent to the console later, but, with this proposal, the actual writing to the console would be done in a dedicated kernel thread. A call to printk() would simply store the message, then use the irq_work mechanism to kick off that thread. This suggestion passed by without much in the way of complaints from the group in the room.
Then, there is the problem of pr_cont(), a form of printk() used to print a single line using multiple calls. This function is not safe on SMP systems, with the result that output generated with it can be mixed up and corrupted. There is a strong desire to get rid of the "continuation line" style of printing, but, as Sergey pointed out, the number of pr_cont() calls in the kernel is growing rapidly. The problem, as Linus pointed out, is that there is no other convenient way to output variable-length lines in the kernel. Changing pr_cont(), to use a per-CPU buffer, for example, is possible, but one would want to create a well thought-out helper function. Then, perhaps, pr_cont() users could be easily fixed up with a Coccinelle script.
Ted Ts'o asked how much of a problem interleaved output really is on a production system; the consensus seemed to be that it was rarely a problem. Linus said that, on occasion, he sees ugly oops output as a result of continuation lines. Andy Lutomirski said, with a grin, that his algorithm for dealing with interleaved oops output is to wait for Linus to straighten it out for him. That solution seemed to work for the group as a whole; there does not seem to be any work planned in this area in the immediate future.
The final topic, covered in a bit of a hurry at the end of the session, was the console_sem semaphore. This semaphore covers access to all consoles in the system, so it is a global contention point. But there are paths that acquire console_sem that do not need to modify the console list or even write to a console. For example, simply reading /proc/consoles from user space will acquire that semaphore. That can cause unpleasant delays, including in printk() itself. And releasing this semaphore, once again, results in a call to console_unlock(), with the same associated problems.
Sergey suggested that console_sem should be turned into a reader/writer lock. That way, any path that does not need to modify the console list itself can acquire the lock in reader mode, increasing parallelism. That still won't help direct callers of console_unlock(), who will still be stuck flushing output to the device. For them, there was discussion of splitting console_unlock() into synchronous and asynchronous versions; the latter would just wake the printk() thread rather than flushing any pending console output itself. There does not appear to be any urgency to this work, though.
That is where time ran out and the session ended. Sergey's slides are available for those who are interested.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Distributions
Qubes OS 3.2
Qubes OS is unlike any other desktop operating system in that it takes digital security to a much higher level than most of its competition. The extra layers of security lock down individual components through multiple virtual machines (VMs, which are also known as "qubes") that are managed by the Xen hypervisor, which makes it difficult for attackers or malicious code to compromise the entire system. Since our previous review of the first 3.0 release candidate of Qubes OS in May 2015, this brainchild of Polish security researcher Joanna Rutkowska has been revamped significantly.
Background
The initial, privileged Xen domain, Dom0, manages the other domains which, by default, consist of VMs designed solely for internet access ("sys-net" and "sys-firewall"), as well as separate "app" qubes for personal and work uses. Other default app qubes include "vault", which is isolated from the internet and designed to store data such as that from password managers, and an "untrusted" qube where any suspect devices or files can be opened.
The Qubes OS project introductory page explains that it is a form of security by compartmentalization (or security by isolation as described on the architecture page). This allows you to divide the various parts of your digital life into securely isolated qubes. By way of an example, this would mean that malware downloaded via a web browser in the "personal" qube would not be able to access sensitive data in the "work" qube. Similarly, a BadUSB device attached to the the "untrusted" qube will not be able to access any sensitive files in the other qubes.
Qubes OS uses a template system as the basis for its qubes; each VM in Qubes OS runs a copy of the template operating system. As of version 3.2, the VMs use Fedora 23 by default but can easily be switched to run Debian 8 ("Jessie") out of the box. There are also community supported templates for Whonix, Ubuntu, and Arch Linux. Qubes OS 3.2 comes with a template pre-installed for Whonix to allow anonymous web browsing through Tor. Other templates can be installed but need to be built manually from the template configuration using Qubes Builder. To save space, the template system uses a read-only root filesystem for qubes, but private data is stored in a separate block device that provides protection for the personal data from other qubes .
All of that complexity makes it easy to believe that Qubes OS is rather convoluted, but all of the qubes can be accessed from a single desktop with a GUI that runs in Xen's Dom0. The trusted window manager assigns a dedicated color frame to each qube. For instance, all "personal" qube windows are yellow by default. This also applies to panel icons so, for example, the "sys-net" VM displays a red "network connections" icon at the top right of the screen to show that it has internet access.
This setup can cause problems for users trying to connect legitimate devices. By default, only Dom0 has access to USB storage devices. The USB pass-through feature is listed as an install option but is currently experimental. This said, the Qubes VM Manager can easily be used to connect a USB storage device. Other devices such as USB webcams can be more problematic, as only one qube can access the hardware at a time. This usually doesn't pose a problem for devices such as USB keyboards and mice that will affect the machine globally. However, for security reasons the Qubes OS web site recommends setting up a VM specifically for managing USB devices.
New features in 3.2
The template VMs in Qubes OS 3.2 are based on the 4.4.14 kernel for Fedora 23. The default desktop environment has been updated to Xfce 4 but its predecessor, KDE/Plasma 5, is still supported; it also now supports the i3 tiling window manager. A new policy option for services, which will allow restricting access to specific devices, for example, has been added. A more detailed description of the updates to version 3.2 can be found in the release notes and in the project's GitHub issues.
Interested readers should pay careful attention to the minimum hardware requirements before trying to install Qubes OS 3.2. A 64-bit Intel or AMD processor is required, as is at least 4GB of RAM. The OS also requires at least 32GB of disk space. Further help for those interested in installing Qubes OS is available from the hardware compatibility list, which contains various hardware reports submitted by users over the years.
Qubes OS cannot be installed in a VM, nor will it boot from a live CD. There is a live USB version that is still in the alpha-testing stage, however. With that, only the three default application qubes (personal, work, and untrusted) will load and UEFI isn't supported. Qubes OS will install to a USB stick, provided there's at least 32GB available and the web site recommends doing so in order to boot and test the operating system's hardware compatibility on multiple machines.
The reason that the hardware should be checked so painstakingly is that some systems (especially laptops) are incapable of using Intel's VT-d (virtualization technology for directed I/O). Without VT-d, the CPU will still isolate the VMs, but their memory remains open to DMA (direct memory access) attacks via components like network devices or GPUs. The Qubes OS web site offers a helpful link to a Google Groups forum post on this topic with some useful tips on finding a compatible laptop wherein both the CPU and its chipset support VT-d.
To simplify matters even further, the Qubes OS web site maintains a web page for Qubes-certified laptops. Currently only the Purism Librem 13 is certified for Qubes OS 3.x although other hardware manufacturers are encouraged to use the Qubes OS hardware certification page to validate their own products.
Looking ahead
The same page also mentions that there are currently no certified
hardware devices that meet the certification
requirements for upcoming Qubes OS 4.x series. Reading
through these requirements
provides some insight into what the future holds for Qubes OS. In the first
instance, given recent advances to CPUs and GPUs, there are plans to use
nested
paging (also known as SLAT or EPT) to "ditch paravirtualization (PV) technology and replace it with hardware-enforced memory virtualization
".
This should help compartmentalize VMs further in the system memory, making them much less vulnerable to DMA-based attacks. This is especially important given that a security bug that exploits the paravirtualization used by Xen affected Qubes OS and was called out as a reason to move away from paravirtualization. A more in-depth explanation of this bug, named XSA-148, is available on the Qubes GitHub page.
Qubes OS 4.x certified hardware will also only run open-source boot firmware; the BIOS must be non-proprietary such as coreboot. However, the Qubes OS team has made an exception for correctly authenticated vendor supplied CPU "blobs" such as the Intel firmware support package (FSP). These blobs are software that is included by a vendor without disclosure of the source code, which makes it nearly impossible to verify they contain no bugs or hidden backdoors. Most Linux distributions (and the Qubes OS project) take a pragmatic approach to binary blobs since proprietary firmware is often necessary for some hardware to function.
In the announcement of the 4.x hardware requirements, Rutkowska clarified the project's stance on these blobs:
Examining the roadmap for Qubes OS offers a stark perspective on how vulnerable even a compartmentalized operating system can be against an adversary who can physically access a machine and install malware as part of an "Evil Maid" attack. Currently the recommend hardware requirements for Qubes OS 3.2 include machines that have a Trusted Platform Module (TPM) with BIOS support. The TPM uses a microcontroller along with encryption keys to allow trusted code to verify that a device's firmware has not been modified since the last boot, reducing the chance that a hacker could have installed spyware like a key logger.
Users whose hardware meets these requirements can install the Qubes OS Anti Evil Maid (AEM), which will display customized text or an image on the boot screen when unlocking the hard drive to reassure users that their boot loader hasn't been compromised. More information is available on AEM's GitHub page.
The Qubes OS documentation doesn't make any vaunted claims about offering a security "magic box". The security guidelines page points out, for example, that the Firefox ESR browser that is available by default in each of the application qubes is in itself no more secure than the same version of any browser on a standalone Linux machine. The only difference is the of Qubes OS architecture that keeps other data safe if stored in a separate qube.
In particular, users are encouraged not to use the main Dom0 or any of the template VMs to run applications since this would potentially affect the private areas of the "work" and "personal" qubes. The only exception to this is running updates for Dom0, which is necessary to fix the latest security bugs. The risk of opening up potentially harmful files is further reduced by offering templates for disposable VMs. If a file is found to be safe, it can then be safely copied to one of the application qubes. PDFs, which can contain malicious data, can be converted to trusted PDFs with a simple right click.
Overall, Qubes OS is a solid effort to provide usability and counterbalance the effort of juggling various VMs, while its template approach uses a minimum of resources. The truly paranoid may relish isolating the various sections of their life into separate domains. However, the elusive Holy Grail of computer security as noted in the Qubes OS own roadmap remains a device that uses both compartmentalization and entirely open-source firmware. In the meantime, Qubes OS represents one of the most solid software-only approaches to digital security.
Brief items
Distribution quotes of the week
[...skip...]
Finally I get everything installed correctly and triumphantly reboot into Linux.
Of course now Windows doesn't work again...
Red Hat Enterprise Linux 7.3
Red Hat has announced the release of Red Hat Enterprise Linux 7.3. "This update to Red Hat’s flagship Linux operating system includes new features and enhancements built around performance, security, and reliability. The release also introduces new capabilities around Linux containers and the Internet of Things (IoT), designed to help early enterprise adopters use existing investments as they scale to meet new business demands."
SUSE Linux Enterprise 12 SP2
The second service pack for SUSE Linux Enterprise Server, Desktop and other products, has been released. Highlights include software defined networking and network function virtualization, the new SUSE Package Hub for package updates, the ability to skip service pack releases (e.g. upgrade from SLES 12 to SLES 12-SP2), architecture support for AArch64 and Raspberry Pi, and much more.
Distribution News
Debian GNU/Linux
call for participation - Debian contributors survey, 1st ed.
All Debian contributors are invited to take part in the first edition of the Debian contributors survey. "This is the first instance of what we hope will become a recurring annual survey of Debian contributors. The survey is intended to help the Debian project and community by enabling them to understand and document the evolution of the project's population over time, through the lenses of common demographics." The deadline for participation is December 4.
Ubuntu family
Mythbuntu: So Long and Thanks for All the Fish
Mythbuntu, the Ubuntu derivative aimed at integrating MythTV packages, is getting out of the game. "Mythbuntu as a separate distribution will cease to exist. We will take the necessary steps to pull Mythbuntu specific packages from the repositories (17.04 and later) unless someone steps up to take these packages over. MythTV packages in the official repositories and the Mythbuntu PPA will continue to be available and updated at their current rate."
Ubuntu Budgie joins the Ubuntu family
The budgie-remix team has announced that the Ubuntu Technical Board has granted official community flavor status to the distribution. "We now move full steam ahead and look forward to working with the Ubuntu Developer Membership Board to examine and work through the technical aspects. Working together will allow us to be adhere to community standards that other flavors follow. 17.04 will be our first official release under the new name."
Ubuntu Online Summit: Call for sessions
The next Ubuntu online summit will be held November 15-16.
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly, Issue 686 (November 7)
- Lunar Linux weekly news (November 4)
- openSUSE news (November 3)
- openSUSE Tumbleweed – Review of the Week (November 4)
- Ubuntu Weekly Newsletter, Issue 486 (November 6)
Maru OS 0.3 released (Liliputing)
Liliputing takes a look at the 0.3 release of Maru OS. "The move to Android 6.0 means the operating system gets new security features and patches and improved power management, among other things. But there are also some tweaks to the way Debian Linux runs on the machine. You can now start the Maru Desktop even if the phone isn’t plugged into an HDMI monitor. You won’t see the Linux-based operating system on the phone’s screen, but you’ll be able to run it as a headless server with support for ssh." LWN looked at an early beta release of Maru last April.
Q4OS+Trinity Gives New Meaning to Lightweight (LinuxInsider)
LinuxInsider reviews Q4OS. "Q4OS version 1.6.1 'Orion,' released this summer, has as its main claim to fame the developing Trinity desktop. Trinity is a breakaway fork from the KDE 3 community."
Page editor: Rebecca Sobol
Development
A year with Notmuch mail
For a little longer than a year now, I have been using Notmuch as my primary means of reading email. Though the experience has not been without some annoyances, I feel that it has been a net improvement and expect to keep using Notmuch for quite some time. Undoubtedly it is not a tool suitable for everyone though; managing email is a task where individual needs and preferences play an important role and different tools will each fill a different niche. So before I discuss what I have learned about Notmuch, it will be necessary to describe my own peculiar preferences and practices.
Notmuch context
I can identify three driving forces in my attitude to email. First there is a desire to be in control. I want my email to be stored primarily on my hardware, preferably in my home. For this reason, various popular hosted email services are of little interest to me. Second is the difficulty I have with throwing things away; I'm sure I'll never want to look at 99.999% of my email a second time, but I don't know which 0.001% will interest me again, and I don't want to risk deleting something I may someday want. Finally, I am somewhat obsessive about categorization. "A place for everything, and everything in its place" is a goal, but rarely a reality, for me. Email is one area where that goal seems achievable, so I have a broad collection of categories for filing incoming mail, possibly more than I need.
My most recent experience before committing to Notmuch was to use Claws Mail as an IMAP client that accessed email using the Dovecot IMAP server running on a machine in my home. This was sufficient for many years, mostly because I work at home with good network connectivity between client and server. On those rare occasions when I traveled to the other side of the world, latency worsened and the upstream bandwidth over my home ADSL connection was not sufficient to provide acceptable service; it was bearable, but that is all. Using procmail to filter my email into different folders met most of my obsessive need to categorize, but when an email sensibly belonged in multiple categories, there wasn't really any good solution.
I think the frustration that finally pushed me to commit to the rather large step of transitioning to Notmuch was the difficulty of searching. Claws doesn't have a very good search interface and Dovecot (at least in the default configuration) doesn't have very good performance. I knew Notmuch could do better, so I made the break.
A close second in the frustration stakes was that I had to use a different editor for composing email than I used for writing code. Prior to choosing Claws, I used the Emacs View Mail mode; it had been difficult giving up that seamless integration between code editor and email editor. Notmuch offered a chance to recover that uniformity.
Notmuch of a mail system
Notmuch has been introduced and reviewed (twice) in these pages
previously so only a brief recap will be provided here. Notmuch describes
itself as "not much of an email program"; it doesn't aim to
provide a friendly user interface, just a back-end engine that indexes and
retrieves email messages. Most of the user-interface tasks are left to a
separate tool such as an Emacs mode that I use. In this
vein of self-deprecation, the web site states that
even "for what it does do, that work is
provided by an external library, Xapian
". This is a little unfair as
Notmuch does contain other functionality. It decodes MIME messages in
order to index the decoded text with the help of libgmime. It manages a
configuration file with the help of g_key_file
from GLib. And it will decrypt encrypted messages, using GnuPG. It even
has some built-in functionality for managing tags and tracking message
threads.
The genius of Notmuch is really the way it combines all these various libraries together into a useful whole that can then be used to build a user interface. That interface can run the Notmuch tool separately, or can link with the libnotmuch library to perform searches and access email messages.
Notmuch need for initial tagging
Notmuch provides powerful functionality but, quite appropriately, does not impose any particular policy for how this functionality should be used. It quickly became clear to me that there is a tension between using tags and using saved searches as the primary means of categorizing incoming email. Tags are simple words, such as "unread", "inbox", "spam", or "list-lkml", that can be associated with individual messages. Saved searches were not natively supported by Notmuch before version 0.23, which was released in early October (and which calls them "named queries"), but are easily supported by user-interface implementations.
Using tags as a primary categorization is the idea behind the "Approaches to initial tagging" section of the Notmuch documentation. This page provides some examples of how a "hook" can be run when new mail arrives to test each message against a number of rules and then to possibly add a selection of tags to that message. The user interface can then be asked to display all messages with a particular tag.
I chose not to pursue this approach, primarily because I want to be able to change the rules and have the new rule apply equally to old emails, which doesn't work when rules are applied at the moment of mail delivery. The alternative is to use fairly complex saved searches. This ran into a problem when I wanted one saved search to make reference to another, as neither the Emacs interface nor the Notmuch backend had a syntax including one saved search in another search. For example, I have one saved search to identify email from businesses (that I am happy to receive email from) whose mail otherwise looks like spam. So my "spam" saved search is something like:
tag:spam and not saved:commercial
The new "named queries" support should make this easy to handle but, until I upgrade my Notmuch installation, I have a wrapper script around the "notmuch" tool that performs simple substitutions to interpolate saved searches as required.
It also causes a minor problem in that I have several saved searches that are intermediaries that I'm not directly interested in, but which still appear in my list of saved searches. Those tend to clutter up the main screen in the Emacs interface.
Unfortunately, the indexing that Notmuch performs is not quite complete, so some information is not directly accessible to saved searches, resulting in the need for some limited handling at mail delivery time. Notmuch does not index all headers; two missed headers that are of interest to me are "X-Bogosity" and "References".
I use bogofilter to detect spam, which adds the "X-Bogosity" header to messages to indicate their status. Further, when someone replies to an email that I sent out, I like that reply to be treated differently from regular email, and particularly to get a free pass through my spam filter. I can detect replies by simple pattern matching on the References or In-reply-to headers. While Notmuch does include these values in the index so that threads can be tracked, it does not index them in a way that allows pattern matching, so there is no way for Notmuch to directly find replies to my emails.
To address this need, I have a small procmail filter that runs bogofilter and then files email in one of the folders "spam", "reply", or "recv" depending on which headers are found. Notmuch supports "folder:" queries for searches, so that my saved search can now differentiate based on these headers that Notmuch cannot directly see.
I find that tags still are useful, but that use is largely orthogonal to classification based on content. When new mail arrives, it is automatically tagged as both "unread" and "inbox". When I read a message, the "unread" tag is cleared; when I archive it, the "inbox" tag is cleared. I would like an extra tag, "new", which would be cleared as soon as I see the subject in a list of new email, but the Emacs interface I use doesn't yet support that.
There are other uses for tags, such as marking emails that I should review when submitting my tax return or that need to be reported to bogofilter because it guessed wrongly about their spam status, but they all reflect decisions that I consciously make rather than decisions that are made automatically.
Notmuch remote access
Remote access via IMAP can be slow, but that is still faster than not having remote access at all, which is the default situation when the mail store only provides local access. I have two mechanisms for remote access that work well enough for me.
When I am in my home city, I only need occasional remote access; this is easily achieved by logging in remotely with SSH and running "emacsclient -t" in a terminal window. This connects to my running Emacs instance and gives me a new window through which I can access Notmuch almost as easily as on my desktop. A few things don't work transparently, viewing PDF files and other non-text attachments in particular, but as this is only an occasional need, lack of access to non-text content is not a real barrier. Here we see again the genius of Notmuch in making use of existing technology rather than inventing everything itself. Notmuch isn't contributing at all to this remote access but, since it supports Emacs as a user-interface, all the power of Emacs is immediately available.
For times when I am away from home and need more regular and complete remote access, there is muchsync, a tool that synchronizes two Notmuch mail stores. All email messages are stored one per file, so synchronizing those simply requires determining which files have been added or removed since the last synchronization and copying or deleting them. Tags are stored in the Xapian database, so a little more effort is required there but, again, muchsync just looks to see what has changed since the last sync and copies the relevant tags. I don't know yet if muchsync will synchronize the named queries and other configuration that can be stored in the database in the latest Notmuch release. Confirming that is a major prerequisite to upgrading.
Before discovering muchsync, I had used rsync to synchronize mail stores; I was happy to find that muchsync was significantly faster. While rsync is efficient when there are small changes to large files, it is not so efficient when there are small changes to a large list of files. The first step in an rsync transaction is to exchange a complete list of file names, which can be slow when there are tens of thousands of them. Muchsync doesn't waste time on this step as it remembers what is known to be on the replica, so it can deduce changes locally.
With muchsync, reading email on my notebook is much like reading email on my desktop. Unfortunately, I cannot yet read email on my phone, though I don't personally find that to be a big cost. There is a web interface for Notmuch written in Haskell, but I have not put enough effort into that to get it working so I don't know if it would be a usable interface for me.
When Notmuch mail is too much
As noted above, I don't like deleting email because I'm never quite sure what I want to keep. Notmuch allows me to simply clear the inbox flag; thereafter I'll never see the message again unless I explicitly search for older messages, as my saved searches all include that tag. As a result, I haven't deleted email since I started using Notmuch and have over 600,000 messages at present (528,000 in the last year, over half of that total from the linux-kernel mailing list). The mail store and associated index consume nearly ten gigabytes. I'm hoping that Moore's law will save me from ever having to delete any of this. This large store allows me to see if very large amounts of email is too much or if, as the program claims, "that's not much mail".
As far as I can tell, the total number of messages has no effect on operations that don't try to access all of those messages, so extracting a message by message ID, listing messages with a particular tag, or adding or clearing a tag, for example, are just as fast in a mail store with 100,000 messages as in one with 100 messages. The times when a lot of mail can seem to be too much is when a search matches thousands of messages or more. There are two particular times when I find this noticeable.
As you might imagine, given my need for categorization, I have quite a
few saved searches. The Emacs front end for Notmuch has a
"hello" page that helpfully lists all the saved searches
together with the number of matching messages. Some of these searches are
quite complex and, while the complexity doesn't seem to be a particular
problem, the number of matches does. Counting the 217,952 linux-kernel messages
still marked as in my inbox takes
four to eight seconds, depending on the hardware. It only takes a few saved
searches that take more than a couple of
seconds for there to be an irritating lag when Emacs wants to update the
"hello" page. Similarly, generating the list of matches for a
large search can take a couple of seconds just to start producing the list,
and much longer to create the whole list.
None of these delays need to be a problem. Having precise up-to-the-moment counts for each search is not really necessary, so updating those counts asynchronously would be perfectly satisfactory and rarely noticeable. Unfortunately, the Notmuch Emacs mode updates them all synchronously and (in the default configuration) does so every time the "hello" window is displayed. This delay can become tiresome.
When displaying the summary lines for a saved search, the Emacs interface is not synchronous, so there is no need to wait for the full list to be generated, but one still needs to wait the second or two for the first few entries in a large list to be displayed. If the condition "date:-1month.." is added to a search, only messages that arrived in the last month will be displayed, but they will normally be displayed without any noticeable delay as there are far fewer of them. The user interface could then collect earlier months asynchronously so they can be displayed quickly if the user scrolls down. The Emacs interface doesn't yet support this approach.
Notmuch locking
As a general rule, those Notmuch operations that have the potential to be slow can usually be run asynchronously, thus removing much of the cost of the slowness. Putting this principle into practice causes one to quickly run up against the somewhat interesting approach to locking that Xapian uses for the indexing database.
When Xapian tries to open the database for write access and finds that it is already being written to, its response is to return an error. As I run "notmuch new" periodically in the background to incorporate new mail, attempts to, for example, clear the "inbox" flag sometimes fail because the database cannot be updated, and I have to wait a moment and try again. I'd much rather Notmuch did the waiting for me transparently.
If one process has the database open for read access and another process wants write access, the writer gets the access it wants and the reader will get an error the next time that it tries to retrieve data. This may be an appropriate approach for the original use case for Xapian but seems poorly suited for email access. It was sufficient to drive me to extend my wrapper script to take a lock on a file before calling the real Notmuch program, so that it would never be confronted with unsupported concurrency.
The most recent version of Xapian, the 1.4 series released in recent months, adds support for blocking locks, and Notmuch 0.23 makes use of these to provide a more acceptable experience when running Notmuch asynchronously.
Working with threads
One feature of Notmuch that I cannot quite make my mind up about is the behavior of threads. In a clear contrast to my finding with JMAP, the problem is not that the threads are too simplistic, but that they are rich and I'm not sure how best to tame them.
As I never delete email, every message in a thread remains in the mail store indefinitely. When Notmuch performs a search against the mail store it will normally list all the threads in which any message matches the search criteria. The information about the thread includes the parent/child relationship between messages, flags indicating which messages matched the search query, and what tags each individual message has.
The Emacs interface uses the parent/child information to display a tree structure using indenting. It uses the "matched" flag to de-emphasize the non-matching messages, either greying them out in the message summary list or collapsing them to a single line in the default thread display, which concatenates all messages in a thread into a single text buffer. It uses some of tags to adjust the color or font such as to highlight unread messages.
This all makes perfect sense and I cannot logically fault it, yet working with threads sometimes feels a little clumsy and I cannot say why. The most probable answer is that I haven't made the effort to learn all the navigation commands that are available; a rich structure will naturally require more subtle navigation and I'm too lazy to learn more than the basics until they prove insufficient. Maybe a focus on some self-education will go a long way here. Certainly I like the power provided by Notmuch threads, I just don't feel that I personally have tamed that power yet.
Notmuch of a wish list
Though I am sufficiently happy with Notmuch to continue using it, I always seem to want more. The need for sensible locking and for native saved searches should be addressed once I upgrade to the latest release, so I expect to be able to cross them off my wish list soon.
Asynchronous updates of the match-counts for saved searches and for the messages in a summary is the wish that is at the top of my list, but my familiarity with Emacs Lisp is not sufficient to even begin to address that, so I expect to have to live without it for a while yet.
One feature that is close to the sweet spot for being both desirable and achievable is to support an outgoing mail queue. Usually when I send email it is delivered quite promptly, thought not instantly, to the server for my email provider. Sometimes it takes longer, possibly due to a network outage, or possibly due to a configuration problem. I would like outgoing email to be immediately stored in the Notmuch database with a tag to say that it is queued. Then some Notmuch hook could periodically try to send any queued messages, and update the tag once the transmission was successful. This would mean that I never have to wait while mail is sent, but can easily see if there is anything in the outgoing queue, and can investigate at my leisure.
There are plenty of other little changes I would like to see in the user interface, but none really interesting enough to discuss here. The important aspect of Notmuch is that the underlying indexing model is sound and efficient and suits my needs. It is a good basis on which to experiment with different possibilities in the user interface.
Brief items
Development quotes of the week
Somewhere along the way, in the last OH GOD TWENTY YEARS, we – along with a bunch of vulture capitalists and wacky Valley libertarians and government spooks and whoever else – built this whole big crazy thing out of that 1990s Internet and…I don’t like it any more.
digiKam 5.3.0 is published
The digiKam Software Collection 5.3.0 has been released. This version is available as an AppImage bundle. "AppImage is an open-source project dedicated to provide a simple way to distribute portable software as compressed binary file, that standard user can run as well, without to install special dependencies. All is included into the bundle, as last Qt5 and KF5 frameworks. AppImage use Fuse file-system, which is de-compressed into a temporary directory to start the application. You don't need to install digiKam on your system to be able to use it. Better, you can use the official digiKam from your Linux distribution in parallel, and test the new version without any conflict with one used in production. This permit to quickly test a new release without to wait an official package dedicated for your Linux box. Another AppImage advantage is to be able to provide quickly a pre-release bundle to test last patches applied to source code, outside the releases plan."
Paperwork 1.0
Paperwork 1.0, code-named "it's about time !", has been released. Some of the main changes include a switch to Python 3, generated PDFs now include the text from the OCR, 'paperwork-chkdeps' has been replaced by 'paperwork-shell', an option has been added to automatically simplify the content for export, an option has been added to automatically adjust the colors, allow scrolling using the middle click, and more. LWN covered Paperwork last March. (Thanks to Martin Michlmayr)RPM 4.13.0 released
RPM 4.13.0 has been released. Notable changes include support for file triggers and boolean dependency expressions.systemd 232
Systemd 232 has been released "many new features and even more fixes". There is a new RemoveIPC= option that can be used to remove IPC objects owned by the user or group of a service when that service exits, the new ProtectKernelModules= option can be used to disable explicit load and unload operations of kernel modules by a service, the ProtectSystem= option gained a new value "strict", and much more.
Trac 1.2 Released
Trac 1.2 has been released. It is the first major release of Trac in more than 4 years. Highlights from the release include extensible notification system, notification preference panel, usernames replaced with full names, restyled ticket changelog, workflow controls on the New Ticket page, editable wiki page version comments, and datetime custom fields.
Newsletters and articles
Development newsletters
- Emacs News (November 7)
- This week in GTK+ (November 7)
- OCaml Weekly News (November 8)
- OpenStack Developer Mailing List Digest (November 4)
- Perl Weekly (November 7)
- Python Weekly (November 3)
- Ruby Weekly (November 3)
- This Week in Rust (November 8)
- Wikimedia Tech News (November 7)
First 64-bit Orange Pi slips in under $20 (HackerBoards.com)
HackerBoards takes a look at the 64-bit Orange Pi. "Shenzhen Xunlong is keeping up its prolific pace in spinning off new Allwinner SoCs into open source SBCs, and now it has released its first 64-bit ARM model, and one of the cheapest quad-core -A53 boards around. The Orange Pi PC 2 runs Linux or Android on a new Allwinner H5 SoC featuring four Cortex-A53 cores and a more powerful Mali-450 GPU."
The iconic text editor Vim celebrates 25 years (Opensource.com)
Opensource.com celebrates 25 years of Vim. "Vim is a flexible, extensible text editor with a powerful plugin system, rock-solid integration with many development tools, and support for hundreds of programming languages and file formats. Twenty-five years after its creation, Bram Moolenaar still leads development and maintenance of the project—a feat in itself! Vim had been chugging along in maintenance mode for more than a decade, but in September 2016 version 8.0 was released, adding new features to the editor of use to modern programmers."
Move over Raspberry Pi, here is a $4, coin-sized, open-source Linux computer (ZDNet)
ZDNet takes a look at the VoCore2, a coin-sized computer. "VoCore2 is an open source Linux computer and a fully-functional wireless router that is smaller than a coin. It can also act as a VPN gateway for a network, an AirPlay station to play lossless music, a private cloud to store your photos, video, and code, and much more. The Lite version of the VoCore2 features a 580MHz MT7688AN MediaTek system on chip (SoC), 64MB of DDR2 RAM, 8MB of NOR storage, and a single antenna slot for Wi-Fi that supports 150Mbps."
Page editor: Rebecca Sobol
Announcements
Brief items
Results from the Linux Foundation Technical Advisory Board election
The 2016 Linux Foundation Technical Advisory Board election was held November 2 at the combined Kernel Summit and Linux Plumbers Conference events. Incumbent members Chris Mason and Peter Anvin were re-elected to the board; they will be joined by new members Olof Johansson, Dan Williams, and Rik van Riel. Thanks are due to outgoing members Grant Likely, Kristen Accardi, and John Linville.New inclusivity course for Linux Foundation event speakers
The Linux Foundation and the National Center for Women & Information Technology (NCWIT) have announced a partnership to develop an Inclusive Speaker Orientation Course to help prepare event presenters and public speakers with background knowledge and practical skills to promote inclusivity in their presentations, messaging, and other communications. "The goal of the course is to present content in a simple and practical way applied to the specialized needs of presenters, including crafting presentation messages, scripting discussions, presented media and subconscious communications. The course will be offered in three 20-minute modules defined around NCWIT’s 'Unconscious Bias' messaging, which encompasses the ideas of 'Realize, Recognize, and Respond.' The Inclusive Speaker Orientation Course will be available for free online. Beginning in 2017, speakers at all Linux Foundation events will be required to complete the course."
Articles of interest
FSFE Newsletter - November 2016
This edition of the Free Software Foundation Europe's monthly newsletter covers a proposal to deprecate the Fellowship, European Interoperability Framework v.3, news from the community, and several other topics.Internet Archive turns 20, gives birthday gifts to the world (Opensource.com)
Opensource.com covers the Internet Archive's 20th birthday celebration. "Of all the projects announced during the event though, by far one of the most exciting and impressive is the newly released ability to search the complete contents of all text items on the Internet Archive. Nine million text items, covering hundreds of years of human history, are now searchable in an instant."
Calls for Presentations
CFP Deadlines: November 10, 2016 to January 9, 2017
The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.
Deadline | Event Dates | Event | Location |
---|---|---|---|
November 11 | November 11 November 12 |
Linux Piter | St. Petersburg, Russia |
November 11 | January 27 January 29 |
DevConf.cz 2017 | Brno, Czech Republic |
November 13 | December 10 | Mini Debian Conference Japan 2016 | Tokyo, Japan |
November 15 | March 2 March 5 |
Southern California Linux Expo | Pasadena, CA, USA |
November 15 | March 28 March 31 |
PGConf US 2017 | Jersey City, NJ, USA |
November 18 | February 18 February 19 |
PyCaribbean | Bayamón, Puerto Rico, USA |
November 20 | December 10 December 11 |
SciPy India | Bombay, India |
November 21 | January 16 | Linux.Conf.Au 2017 Sysadmin Miniconf | Hobart, Tas, Australia |
November 21 | January 16 January 17 |
LCA Kernel Miniconf | Hobart, Australia |
November 28 | March 25 March 26 |
LibrePlanet 2017 | Cambridge, MA, USA |
December 1 | April 3 April 6 |
‹Programming› 2017 | Brussels, Belgium |
December 10 | February 21 February 23 |
Embedded Linux Conference | Portland, OR, USA |
December 10 | February 21 February 23 |
OpenIoT Summit | Portland, OR, USA |
December 31 | March 2 March 3 |
PGConf India 2017 | Bengaluru, India |
December 31 | April 3 April 7 |
DjangoCon Europe | Florence, Italy |
January 1 | April 17 April 20 |
Dockercon | Austin, TX, USA |
January 3 | May 17 May 21 |
PyCon US | Portland, OR, USA |
January 6 | July 16 July 23 |
CoderCruise | New Orleans et. al., USA/Caribbean |
January 8 | March 11 March 12 |
Chemnitzer Linux-Tage | Chemnitz, Germany |
If the CFP deadline for your event does not appear here, please tell us about it.
Upcoming Events
Events: November 10, 2016 to January 9, 2017
The following event listing is taken from the LWN.net Calendar.
Date(s) | Event | Location |
---|---|---|
November 9 November 11 |
O’Reilly Security Conference EU | Amsterdam, Netherlands |
November 11 November 12 |
Seattle GNU/Linux Conference | Seattle, WA, USA |
November 11 November 12 |
Linux Piter | St. Petersburg, Russia |
November 12 November 13 |
T-Dose | Eindhoven, Netherlands |
November 12 November 13 |
Mini-DebConf | Cambridge, UK |
November 12 November 13 |
PyCon Canada 2016 | Toronto, Canada |
November 13 November 18 |
The International Conference for High Performance Computing, Networking, Storage and Analysis | Salt Lake City, UT, USA |
November 14 November 16 |
PGConfSV 2016 | San Francisco, CA, USA |
November 14 November 18 |
Tcl/Tk Conference | Houston, TX, USA |
November 14 | The Third Workshop on the LLVM Compiler Infrastructure in HPC | Salt Lake City, UT, USA |
November 16 November 17 |
Paris Open Source Summit | Paris, France |
November 16 November 18 |
ApacheCon Europe | Seville, Spain |
November 17 | NLUUG (Fall conference) | Bunnik, The Netherlands |
November 18 November 20 |
GNU Health Conference 2016 | Las Palmas, Spain |
November 18 November 20 |
UbuCon Europe 2016 | Essen, Germany |
November 19 | eloop 2016 | Stuttgart, Germany |
November 21 November 22 |
Velocity Beijing | Beijing, China |
November 24 | OWASP Gothenburg Day | Gothenburg, Sweden |
November 25 November 27 |
Pycon Argentina 2016 | Bahía Blanca, Argentina |
November 29 November 30 |
5th RISC-V Workshop | Mountain View, CA, USA |
November 29 December 2 |
Open Source Monitoring Conference | Nürnberg, Germany |
December 3 | NoSlidesConf | Bologna, Italy |
December 3 | London Perl Workshop | London, England |
December 6 | CHAR(16) | New York, NY, USA |
December 10 | Mini Debian Conference Japan 2016 | Tokyo, Japan |
December 10 December 11 |
SciPy India | Bombay, India |
December 27 December 30 |
Chaos Communication Congress | Hamburg, Germany |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol