LWN.net Weekly Edition for April 4, 2019

Welcome to the LWN.net Weekly Edition for April 4, 2019

This edition contains the following feature content:

How to (not) fix a security flaw: a router vendor tries to sweep a vulnerability under the rug.
The return of the lockdown patches: another attempt to get the lockdown patch set merged into the mainline.
Working with UTF-8 in the kernel: a proposed internal kernel API for managing UTF-8 strings.
Improving the performance of the BFQ I/O scheduler: how to make an I/O scheduler faster.
Some slow progress on get_user_pages(): chipping away at a perennial memory-management problem.
Program names and "pollution": are the names of the PostgreSQL utilities too generic and, if so, what should be done about them?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

How to (not) fix a security flaw

By Jake Edge
April 3, 2019

A pair of flaws in the web interface for two small-business Cisco routers make for a prime example of the wrong way to go about security fixes. These kinds of flaws are, sadly, fairly common, but the comedy of errors that resulted here is, thankfully, rather rare. Among other things, it shows that vendors may wish to await a real fix rather than to release a small, ineffective band-aid to try to close a gaping hole.

RedTeam Pentesting GmbH found the flaws in September 2018 and notified Cisco shortly thereafter. The original disclosure date was planned for January 9, but that was postponed until January 23 at Cisco's request. On the latter date, Cisco issued advisories for CVE-2019-1652 and CVE-2019-1653; RedTeam Pentesting released its own advisories, with lots more detail, for CVE-2019-1652 and CVE-2019-1653.

The flaws are bog standard web-application vulnerabilities. CVE-2019-1652 is a command injection that allows authenticated users (in the web interface) to execute arbitrary Linux commands as root. CVE-2019-1653 allows anyone to request the configuration page from the router, which contains all sorts of interesting information, including user names with hashed passwords, VPN and IPsec secrets, and more. In addition, password hashes are all that is needed to log into the web interface—no cracking required.

Beyond that, an additional information disclosure flaw, related to CVE-2019-1653, was reported; it uses a debug interface to retrieve a .tgz (gzipped tar) file, encrypted with a known, hard-coded password, from the device. That file contained even more configuration and debugging information as well as etc.tgz and var.tgz with the contents of those directories from the router. In all of the RedTeam Pentesting advisories, curl was used for the proof of concept, though there are lots of other ways to perform the same actions, of course.

The flaws were found, reported, and fixed; so far, so good—or so it would seem. But on February 7, RedTeam Pentesting found that the fixes shipped by Cisco were, at a minimum, insufficient. Once again, the problems were reported to Cisco, with a disclosure date of March 27. Despite a last-minute request to postpone the disclosure, three new advisories (command injection, configuration information disclosure, and even more configuration information disclosure) were released by RedTeam Pentesting on March 27.

It turns out that in the four months between the report and the "fix", Cisco decided that simply blocking curl was sufficient. In addition, the blocking was done based on the HTTP User-Agent header that curl sends. It can hardly have escaped the notice of someone (or, likely, multiple someones) that curl -A (or --user-agent) will helpfully change that header to anything the user (or attacker) wants. There are a lot of options to curl, which can be seen in its man page, but User-Agent strings are nothing magic—and curl is hardly the only way to perform an HTTP GET or POST.

Another two months on, nearly six months after the original disclosure, Cisco still does not have a real fix for the problems. Not enabling internet (WAN) access to the web interface (which is the default state), or disabling it if it has been enabled, is a workaround for the flaws, but obviously impedes the remote management feature that some customers may have been relying on.

The original disclosures in January set off a flurry of exploit attempts. By chaining the two vulnerabilities together, root access is easily available to any attackers. Some detailed proof-of-concept exploit code is available—and has been since the disclosures. It does not use curl, so it presumably always worked, even on updated routers. There are some 20,000 affected routers that can be found using the Shodan tool and roughly half of them were exploitable at the end of January; most of those were still exploitable at the end of March.

Given how close the second set of disclosures was to April 1, and the almost comical nature of the "fix", one might be forgiven for thinking it is all some elaborate hoax. Of course, the comedic effect is much diminished for those who have the RV320 and RV325 routers installed. At least from the outside, the problems do not seem so difficult to solve that it would make sense to try to slip in a broken hackaround instead of making a real fix. One suspects that Cisco has de-prioritized those router models, so there is no one to work on them. But did anyone with a technical clue at Cisco really think this kind of thing would fly underneath the radar?

One can only imagine how quickly a fix would have been mooted had the web interface in question been open source. There is probably not much in the way of proprietary "secret sauce" in the web interface, but an open-source release might be problematic for other reasons; Cisco would also have to provide ways for users to upload changes to the router, which may have its own set of challenges.

One amusing outcome is a suggestion from Florian Obser to apply the Cisco "fix" to the OpenBSD HTTP server. His "httpd(8): Adapt to industry wide current best security practices" proposal is unlikely to be acted upon, however.

This episode is pretty much a textbook example of the perils of being at the mercy of a vendor when a security flaw is found. It is also yet another example of a device that is meant to be on the internet, but has apparently had little or no thought to security baked in. Debugging interfaces are useful, for debugging; they should be removed from shipping products. Any vendor shipping a web interface should either internally do penetration tests (pentests) on it, hire that job out, or both. It is rather amazing to see this kind of flaw—and response—from a major vendor in 2019—but there are surely others out there that we will hear about in due course.

Comments (29 posted)

The return of the lockdown patches

By Jake Edge
April 3, 2019

It's been a year since we looked in on the kernel lockdown patches; that's because things have been fairly quiet on that front since there was a loud and discordant dispute about them back then. But Matthew Garrett has been posting new versions over the last two months; it would seem that the changes that have been made might be enough to tamp down the flames and, perhaps, even allow them to be merged into the mainline.

The idea behind kernel lockdown is to supplement secure boot mechanisms to limit the ability of the root user to cause unverified, potentially malicious code to be run. The most obvious way to do that is to use the kexec subsystem to run a new kernel that has not been vetted by the secure boot machinery, but there are lots of other ways that root can circumvent the intent of (some) secure boot users. While the support for UEFI secure boot has been in the kernel for years, providing a way to restrict the root user after that point has always run aground.

A renewed push

The latest round began with a pull request from Garrett at the end of February. He noted that he had taken over shepherding the patch set from David Howells, who is "low on cycles at the moment". There were just a few changes from the previous version that caused the ruckus a year ago.

The main change was to remove the tie-in between secure boot and lockdown mode. The main complaint that Linus Torvalds and Andy Lutomirski had a year ago was about that linkage; they felt that it was unreasonable to force those using secure boot into having a locked-down kernel—and vice versa. At a minimum, kernel developers might well want the flexibility to have one without the other. Changing the fundamental behavior of the kernel based on a BIOS setting that might not be under the control of the user was also seen as highly problematic.

Beyond that big ticket item, there were two other changes. A CONFIG_KERNEL_LOCK_DOWN_FORCE option was added that will build a kernel that always enforces lockdown. Integration with the Integrity Measurement Architecture (IMA) was also dropped, though IMA maintainer Mimi Zohar questioned that plan. There were enough comments that needed addressing to cause Garrett to send a second pull request to security maintainer James Morris in early March.

Zohar was still unhappy with the (lack of) IMA integration, however. Garrett worked on a solution to that, which showed up as a patch in a third pull request on March 25. The patch will use the IMA architecture-specific mechanism to verify a kernel image before allowing it to be booted via kexec:

Systems in lockdown mode should block the kexec of untrusted kernels. For x86 and ARM we can ensure that a kernel is trustworthy by validating a PE [Portable Executable] signature, but this isn't possible on other architectures. On those platforms we can use IMA digital signatures instead.

A patch that disables the use of the bpf() system call in locked-down kernels was also discussed. There are some BPF functions that can read kernel memory, which would allow BPF programs to extract private keys (e.g. the hibernation image signing key) and to alter kernel memory, so the patch simply disabled bpf() entirely. But, given the ever-increasing use of BPF in the kernel, that was seen as a draconian restriction by some. Jordan Glover pointed out that disabling the system call would break some systemd functionality, making locked-down systems less secure.

Disabling BPF was one of the problems that Lutomirski saw with Garrett's approach to decoupling secure boot and lockdown mode. In particular, Lutomirski wanted to see three possible states for lockdown:

Lockdown mode becomes three states, not a boolean. The states are: no lockdown, best-effort-to-protect-kernel-integrity, and best-effort-to-protect-kernel-secrecy-and-integrity. And this BPF mess illustrates why: most users will really strongly object to turning off BPF when they actually just want to protect kernel integrity. And as far as I know, things like Secure Boot policy will mostly care about integrity, not secrecy, and tracing and such should work on a normal locked-down kernel. So I think we need this knob.

The code for disabling direct model-specific register (MSR) writes on x86 systems was also questioned. Writing to MSRs can "lead to execution of arbitrary code in kernel mode", which is why it should be disabled for locked-down kernels. At the behest of Alan Cox, log messages were added to someday facilitate a whitelist of allowed MSR writes, but Thomas Gleixner was not a fan:

Maintaining a whitelist for this is a horrible idea as you will get a gazillion of excuses why access to a particular MSR is sane. And I'm neither interested in these discussions nor interested in adding the whitelist to this trainwreck.

Gleixner would much rather see direct access to /dev/msr go away entirely: "The right thing to do is to provide sane interfaces and that's where we are moving to."

Another complaint came from Greg Kroah-Hartman, who said that the heuristic-based patch that restricted debugfs operations for locked-down kernels should instead simply disable debugfs completely. Garrett noted that previous attempts to do so had resulted in "strong pushback from various maintainers", but Kroah-Hartman said he was willing to handle any of those that come along.

Version 31

He got a chance to do just that after Garrett posted version 31 of the patch set. It addressed the complaints, starting with lockdown state:

Based on Andy's feedback, lockdown is now a tristate and can be made stricter at runtime. The states are "none", "integrity" and "confidentiality". "none" results in no behavioural change, "integrity" enables features that prevent untrusted code from being run in ring 0, and "confidentiality" is a superset of "integrity" that also disables features that may be used to extract secret information from the kernel at runtime.

[...]

In the general case, I'd expect distributions to opt for nothing stricter than "integrity" - "confidentiality" seems more suitable for more special-case scenarios.

Beyond that, he removed the logging from the MSR-disabling code and disabled opening files in debugfs when in integrity mode. Perhaps predictably, that latter part led to a complaint. Lutomirski said that reading debugfs files should still be allowed for integrity mode. Kroah-Hartman, who doesn't think much of the lockdown idea in general, said that there are legitimate worries about what kinds of information debugfs provides:

Reading a debugfs file can expose loads of things that can help take over a kernel, or at least make it easier. Pointer addresses, internal system state, loads of other fun things. And before 4.14 or so, it was pretty trivial to use it to oops the kernel as well (not an issue here anymore, but people are right to be nervous).

Personally, I think these are all just "confidentiality" type things, but who really knows given the wild-west nature of debugfs (which is as designed). And given that I think this patch series [is] just crazy anyway, I really don't care :)

Garrett seems amenable to changing integrity mode to use the previous scheme and to block all reads in confidentiality mode, but doesn't want to "spend another release cycle arguing about it". That previous scheme would only allow opening "safe" debugfs files for read: those with a 00444 mode and lacking .ioctl() and .mmap() methods.

Overall, the comments seem to be fairly minor problems that can be—have been—addressed easily. While some don't buy the whole idea behind lockdown, and there will always be ways around any of its restrictions due to bugs if nothing else, it is something that some kernel users want. Distributions have been shipping with some form of lockdown for quite some time, so it is pretty hard to argue that it is completely useless.

But, of course, the elephant in the room is Torvalds. He has not commented on any of the recent postings. One might guess that most of his concerns were addressed by the decoupling of secure boot and lockdown mode, but that remains to be seen. Morris has not yet said he will merge the lockdown patches either, which would also seem to be a prerequisite. Reducing out-of-tree patches that distributions feel they need to carry is a good goal, though, so one way or another it seems likely that lockdown will get merged before too long.

Comments (19 posted)

Working with UTF-8 in the kernel

By Jonathan Corbet
March 28, 2019

In the real world, text is expressed in many languages using a wide variety of character sets; those character sets can be encoded in a lot of different ways. In the kernel, life has always been simpler; file names and other string data are just opaque streams of bytes. In the few cases where the kernel must interpret text, nothing more than ASCII is required. The proposed addition of case-insensitive file-name lookups to the ext4 filesystem changes things, though; now some kernel code must deal with the full complexity of Unicode. A look at the API being provided to handle encodings illustrates nicely just how complicated this task is.

The Unicode standard, of course, defines "code points"; to a first approximation, each code point represents a specific character in a specific language group. How those code points are represented in a stream of bytes — the encoding — is a separate question. Dealing with encodings has challenges of its own, but over the years the UTF-8 encoding has emerged as the preferred way of representing code points in many settings. UTF-8 has the advantages of representing the entire Unicode space while being compatible with ASCII — a valid ASCII string is also valid UTF-8. The developers implementing case independence in the kernel decided to limit it to the UTF-8 encoding, presumably in the hope of solving the problem without going entirely insane.

The API that resulted has two layers: a relatively straightforward set of higher-level operations and the primitives that are used to implement them. We'll start at the top and work our way down.

The high-level UTF-8 API

At a high level, the operations that will be needed can be described fairly simply: validate a string, normalize a string, and compare two strings (perhaps with case folding). There is, though, a catch: the Unicode standard comes in multiple versions (version 12.0.0 was released in early March), and each version is different. The normalization and case-folding rules can change between versions, and not all code points exist in all versions. So, before any of the other operations can be used, a "map" must be loaded for the Unicode version of interest:

    struct unicode_map *utf8_load(const char *version);

The given version number can be NULL, in which case the latest supported version will be used and a warning will be emitted. In the ext4 implementation, the Unicode version used with any given filesystem is stored in the superblock. The latest version can be explicitly requested by obtaining its name from utf8version_latest(), which takes no parameters. The return value from utf8_load() is a map pointer that can be used with other operations, or an error-pointer value if something goes wrong. The returned pointer should be freed with utf8_unload() when it is no longer needed.

UTF-8 strings are represented in this interface using the qstr structure defined in <linux/dcache.h>. That reveals an apparent assumption that the use of this API will be limited to filesystem code; that is true for now, but could change in the future.

The single-string operations provided are:

    int utf8_validate(const struct unicode_map *um, const struct qstr *str);
    int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
		       unsigned char *dest, size_t dlen);
    int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
		      unsigned char *dest, size_t dlen);

All of the functions require the map pointer (um) and the string to be operated upon (str). utf_validate() returns zero if str is a valid UTF-8 string, non-zero otherwise. A call to utf8_normalize() will store a normalized version of str in dest and return the length of the result; utf8_casefold() does case folding as well as normalization. Both functions will return -EINVAL if the input string is invalid or if the result would be longer than dlen.

Comparisons are done with:

    int utf8_strncmp(const struct unicode_map *um,
		     const struct qstr *s1, const struct qstr *s2);
    int utf8_strncasecmp(const struct unicode_map *um,
		     const struct qstr *s1, const struct qstr *s2);

Both functions will compare the normalized versions of s1 and s2; utf8_strncasecmp() will do a case-independent comparison. The return value is zero if the strings are the same, one if they differ, and -EINVAL for errors. These functions only test for equality; there is no "greater than" or "less than" testing.

Moving down

Normalization and case folding require the kernel to gain detailed knowledge of the entire Unicode code point space. There are a lot of rules, and there are multiple ways of representing many code points. The good news is that these rules are packaged, in machine-readable form, with the Unicode standard itself. The bad news is that they take up several megabytes of space.

The UTF-8 patches incorporate these rules by processing the provided files into a data structure in a C header file. A fair amount of space is then regained by removing the information for decomposing Hangul (Korean) code points into their base components, since this is a task that can be done algorithmically as well. There is still a lot of data that has to go into kernel space, though, and it's naturally different for each version of Unicode.

The first step for code wanting to use the lower-level API is to get a pointer to this database for the Unicode version in use. That is done with one of:

    struct utf8data *utf8nfdi(unsigned int maxage);
    struct utf8data *utf8nfdicf(unsigned int maxage);

Here, maxage is the version number of interest, encoded in an integer form from the major, minor, and revision numbers using the UNICODE_AGE() macro. If only normalization is needed, utf8nfdi() should be called; use utf8nfdicf() if case folding is also required. The return value will be an opaque pointer, or NULL if the given version cannot be supported.

Next, a cursor should be set up to track progress working through the string of interest:

    int utf8cursor(struct utf8cursor *cursor, const struct utf8data *data,
	           const char *s);
    int utf8ncursor(struct utf8cursor *cursor, const struct utf8data *data,
		    const char *s, size_t len);

The cursor structure must be provided by the caller, but is otherwise opaque; data is the database pointer obtained above. If the length of the string (in bytes) is known, utf8ncursor() should be used; utf8cursor() can be used when the length is not known but the string is null-terminated. These functions return zero on success, nonzero otherwise.

Working through the string is then accomplished by repeated calls to:

    int utf8byte(struct utf8cursor *u8c);

This function will return the next byte in the normalized (and possibly case-folded) string, or zero at the end. UTF-8-encoded code points can take more than one byte, of course, so individual bytes do not, on their own, represent code points. Due to decomposition, the return string may be longer than the one passed in.

As an example of how these pieces fit together, here is the full implementation of utf8_strncasecmp():

    int utf8_strncasecmp(const struct unicode_map *um,
		         const struct qstr *s1, const struct qstr *s2)
    {
	const struct utf8data *data = utf8nfdicf(um->version);
	struct utf8cursor cur1, cur2;
	int c1, c2;

	if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
	    return -EINVAL;

	if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
	    return -EINVAL;

	do {
	    c1 = utf8byte(&cur1);
	    c2 = utf8byte(&cur2);

	    if (c1 < 0 || c2 < 0)
		return -EINVAL;
	    if (c1 != c2)
		return 1;
	} while (c1);

	return 0;
    }

There are other functions in the low-level API for testing validity, getting the length of strings, and so on, but the above captures the core of it. Those interested in the details can find them in this patch.

That is quite a bit of complexity when one considers that it is all there just to compare strings; we are now far removed from the simple string functions found in Kernighan & Ritchie. But that, it seems, is the world that we live in now. At least we get some nice emoji for all of that complexity 👍.

Comments (62 posted)

Improving the performance of the BFQ I/O scheduler

March 29, 2019

This article was contributed by Paolo Valente

BFQ is a proportional-share I/O scheduler available for block devices since the 4.12 kernel release. It associates each process or group of processes with a weight, and grants a fraction of the available I/O bandwidth proportional to that weight. BFQ also tries to maximize system responsiveness and to minimize latency for time-sensitive applications. Finally, BFQ aims at boosting throughput and at running efficiently. A new set of changes has improved BFQ’s performance with respect to all of these criteria. In particular, they increase the throughput that BFQ reaches while handling the most challenging workloads for this I/O scheduler. A notable example is DBENCH workloads, for which BFQ now provides 150% more throughput. These changes also improve BFQ’s I/O control — applications start about 80% more quickly under load — and BFQ itself now runs about 10% faster.

Let’s start with throughput improvements and, to introduce them, let’s examine the main cause of throughput loss with BFQ.

I/O-dispatch plugging: a necessary evil that lowers throughput

In BFQ, I/O requests from each process are directed into one of a set of in-scheduler queues, called "bfq-queues". Multiple processes may have their requests sent to a shared bfq-queue, as explained in more detail later. A bfq-queue is tagged as being either synchronous or asynchronous if the I/O requests it contains are blocking or non-blocking for the process that issues them, respectively. Read requests tend to be blocking, since the reading process cannot continue without that data; writes are often non-blocking and, thus, asynchronous.

BFQ serves each bfq-queue, one at a time, with a frequency determined by its associated weight. If "Q" is a synchronous bfq-queue then, to preserve Q’s allotted bandwidth, BFQ cannot switch to serving a new bfq-queue when Q becomes temporarily empty while in service. Instead, BFQ must plug the dispatching of other I/O, possibly already waiting in other bfq-queues, until a new request arrives for Q (or until a timeout occurs).

With fast drives, this service scheme creates a critical shortcoming. Only one core at a time can insert I/O requests into a bfq-queue; a single core may easily be slower to insert requests than a fast drive can serve the same requests. This results in Q often becoming empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty then, during the servicing of Q, there will be frequent time intervals during which Q is empty and the device can only consume the I/O already submitted to its hardware queues (possibly even becoming idle). This easily causes considerable loss of throughput.

The new changes address this issue in two ways: by improving how BFQ tries to fill the resulting service holes and by reducing the cases where I/O dispatching is actually plugged. We will only look at the main new improvement concerning the second countermeasure.

Improving extra-service injection

BFQ implements an I/O-injection mechanism that tries to fill the idle times occurring during the servicing of a bfq-queue with I/O requests taken from other, non-in-service bfq-queues. The hard part is finding the right amount of I/O to inject so as to both boost throughput and not break bandwidth and latency guarantees for the in-service bfq-queue. Before the changes described in this section, the mechanism tried to compute this amount as follows. First, it measured the bandwidth enjoyed by a given bfq-queue when it was not subjected to any extra-service injection. Then, while that bfq-queue was in service, BFQ tried to inject the maximum possible number of extra requests that did not cause the target bfq-queue's bandwidth to decrease too much.

This solution had an important shortcoming: for bandwidth measurements to be stable and reliable, a bfq-queue must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not happen with many workloads. With the new changes, the service times of single I/O requests, rather than the bandwidth experienced by a bfq-queue, are measured. Injection is then tuned as a function of how much it increases service times. Single-request service times are meaningful even if a bfq-queue completes few I/O requests while it is in service.

The throughput boost on SSDs is now about 50% on the hardest workloads for BFQ: those that trick BFQ into doing a lot of often unnecessary plugging of I/O dispatches. We’ll see this result on a graph in a moment, combined with the throughput boost provided by the following improvement.

Disable queue merging on flash storage with internal queuing

Some applications, such as QEMU, spawn a set of processes that perform interleaved I/O. Such an I/O, taken in isolation (per process), appears random, but it becomes sequential when merged with that of all the other processes in the set. To boost throughput with these processes, BFQ performs queue merging; it redirects the I/O of these processes into a common, shared bfq-queue.

Since they are ordered by I/O-request position, the I/O requests in the shared bfq-queue are sequential. On devices like rotational disks, serving such a sequential I/O definitely boosts throughput compared with serving the random I/O generated by the processes separately. But that is not the case on flash storage devices with internal queuing, which enqueue many I/O requests and serve them in parallel, possibly reordering requests so as to maximize throughput. Thanks to these optimizations and their built-in parallelism, these devices reach the same throughput for interleaved I/O, with or without BFQ reordering. In view of this fact, the new changes disable queue merging altogether on these devices.

As counter-intuitive as it may seem, disabling queue merging actually boosts throughput on these devices; queue merging tends to make many workloads artificially more uneven. Consider the case where one of the bfq-queues in a set of merged bfq-queues has a higher weight than a normal bfq-queue, and where the shared bfq-queue inherits that high weight. I/O dispatching must be plugged while serving the shared bfq-queue, to preserve the higher bandwidth demands of this bfq-queue. In addition, the bfq-queue is filled by several processes, so it tends to remain active for a longer time than normal bfq-queues. In the end, it may force BFQ to perform I/O plugging for a lot of time, hurting overall throughput.

To evaluate the benefits of this improvement, we measured the throughput with DBENCH for the configuration causing the highest throughput loss with BFQ: six clients on a filesystem with journaling, with the journaling daemon enjoying a higher weight than normal processes, and with all other parameters configured as in the DBENCH test in the MMTests suite. The throughput grew by about 50% on SSDs. The combined effect of this and the service-injection improvement is shown below:

This plot shows throughput on a PLEXTOR PX-256M5 SSD, compared with the maximum throughput reached by the mq-deadline scheduler (which, in turn achieves the highest throughput among non-BFQ I/O schedulers).

Improving responsiveness

When waiting for the arrival of a new I/O request for the in-service bfq-queue, a timeout needs to be set to avoid waiting forever if the processes associated with the bfq-queue have stopped doing I/O. Even if the timeout avoids infinite waiting, the drive is still not fed with new I/O until the timer fires (in the absence of injection). This lowers throughput and inflates latencies. For this reason, the timeout is kept relatively low; 8ms is the current default.

Unfortunately, such a low value may cause a violation of bandwidth guarantees for processes that happen to issue new I/O requests just a little too late. The higher the system load, the higher the probability that this will happen; it is a problem in scenarios where service guarantees matter more than throughput. One important case is when, for example, an application is being started, or is performing some interactive task (such as opening a file). To provide a high level of responsiveness to the application, its I/O requests must be served quickly. This implies that, in the presence of other workloads competing for storage bandwidth, the bfq-queue for the application must be granted a high fraction of the available storage bandwidth. To reach this goal, BFQ tries to automatically detect such queues and raise their weight. But the benefit of this higher weight will be lost in case of late I/O arrivals.

To address this issue, BFQ now places a 20ms lower bound on the dispatch-plugging timeout for weight-raised bfq-queues. This simple change reduces application start-up times under load by up to 80%. This plot show the start-up times of GNOME terminal on a PLEXTOR PX-256M5S drive while ten files are being read sequentially in parallel (10r-seq), or while five files are being read sequentially in parallel, and five more files are being written sequentially in parallel (5r5w-seq).

As a reference, start-up times with KYBER are reported too; they are the second lowest start-up times after those with BFQ.

Reducing execution time

Handling queue merging costs CPU time, so disabling it reduced the execution time of BFQ, by about 10%, as shown below for an Intel Core i7-2760QM system:

As a reference, the figure also shows the execution time of mq-deadline, the simplest available I/O scheduler in the kernel.

To provide more detail: the total times in the figure are the sums of the execution times over three request-processing events: enqueue, dispatch, and completion. So the amortized cost of BFQ, per event, decreased to about 0.6µs, against 0.2µs for mq-deadline.

This improvement reduced the number of CPU and drive configurations for which BFQ cannot currently be used (but mq-deadline can) due to its execution cost. The remaining configurations are those in which switches between user and kernel context, plus 0.2µs of I/O-scheduling overhead, are feasible, but for which an extra 0.4µs per event is not tolerable.

Conclusion

Thanks to these new improvements, BFQ seems now to be on par with the other I/O schedulers in terms of throughput, even with workloads that previously "fooled" its heuristics. In contrast, the execution time of BFQ is still higher than that of the other I/O schedulers, but it is not higher than single-request service times on fast drives. The problem now is that BFQ uses a single, per-device scheduler lock. Stay tuned for future work, which will increase parallelism within the BFQ scheduler itself.

[I wish to thank Alessio Masola for making a very first version of the patch that disables queue merging, Francesco Pollicino for patiently testing the various versions of these patches thousand times, and Mathieu Poirier for carefully revising the first draft of this article.]

Comments (11 posted)

Some slow progress on get_user_pages()

By Jonathan Corbet
April 2, 2019

One of the surest signs that the Linux Storage, Filesystem, and Memory-Management (LSFMM) Summit is approaching is the seasonal migration of memory-management developers toward the get_user_pages() problem. This core kernel primitive is necessary for high-performance I/O to user-space memory, but its interactions with filesystems have never been reliable — or even fully specified. There are currently a couple of patch sets in circulation that are attempting to improve the situation, though a full solution still seems distant.

get_user_pages() is a way to map user-space memory into the kernel's address space; it will ensure that all of the requested pages have been faulted into RAM (and locked there) and provide a kernel mapping that, in turn, can be used for direct access by the kernel or (more often) to set up zero-copy I/O operations. There are a number of variants of get_user_pages(), most notably get_user_pages_fast(), which trades off some flexibility for the ability to avoid acquiring the contended mmap_sem lock before doing its work. The ability to avoid copying data between kernel and user space makes get_user_pages() the key to high-performance I/O.

If get_user_pages() is used on anonymous memory, few problems result. Things are different when, as is often the case, file-backed memory is mapped in this way. Filesystems are generally responsible for the state of file-backed pages in memory; they ensure that changes to those pages are written back to permanent storage, and they make sure that the right thing happens when a file's layout on that storage changes. Filesystems are not designed to expect that file-backed pages can be written to without their knowledge, but that is exactly what can happen when those pages are mapped with get_user_pages().

Most of the time, things happen to work anyway. But if an I/O operation writes to a page while the filesystem is, itself, trying to write back changes to that page, data corruption can result. In some cases, having a page unexpectedly become dirty can cause filesystem code to crash. And there is a whole new range of problems that can turn up for filesystems on nonvolatile memory devices, where writing to a page directly modifies the underlying storage. Filesystems implementing this sort of direct access (a mode called "DAX") can avoid some problems by being careful to not move file pages around while references to them exist, but that leads to different kinds of problems if pages mapped with get_user_pages() remain mapped for long periods of time. Naturally, certain subsystems (notably the RDMA layer) do exactly that.

Tracking get_user_pages() mappings

When memory is mapped into kernel space with get_user_pages() the reference count for each page is incremented; among other things, that prevents those pages from being evicted from RAM for as long as the mapping persists. When the kernel is done with those pages, the references are released by calling put_page() on each page; put_page() is a generic function that is used to release any reference to a page. There is currently no infrastructure for tracking references resulting specifically from get_user_pages() calls, so there is no way for any other kernel subsystem to know when such references exist.

John Hubbard is trying to change that situation with a simple patch adding put_user_page(), which is intended to be called instead of put_page() when releasing references created by get_user_pages(). In this patch set, the new function is defined as:

    static inline void put_user_page(struct page *page)
    {
	put_page(page);
    }

In other words, put_user_page() simply turns into a call to put_page(), with no other changes. It clearly is not solving any problems in its own right, but it is a first step in a larger strategy.

The next step is to locate all get_user_pages() callers in the kernel and convert them to use put_user_page(); there are quite a few of those, so this process is expected to take a while. Once that has been done, though, those functions can be changed to allow for separate tracking of references created by get_user_pages(). According to Jérôme Glisse, the plan is to increment the page reference counts by a rather higher number (called GUP_BIAS) rather than by one. Any page with a reference count greater than GUP_BIAS can then be assumed to have references created by get_user_pages(), meaning that it might be written to, without warning, by some peripheral device on the system.

The next step appears to be a little fuzzier; Glisse describes it as "decide what to do for GUPed page". The thoughts seem to include keeping such pages in a dirty state at all times; writeback by filesystems would also be performed using bounce pages in an attempt to avoid corruption problems. Keeping pages dirty would disable a lot of filesystem features (such as copy-on-write) but, Glisse said, "it seems to be the best thing we can do". Another idea, suggested by Hubbard, is to introduce a "file lease" mechanism that would allow for coordination between user space and the kernel when filesystem code needs to shuffle pages around.

This patch has found its way into the -mm tree, and thus into linux-next, so it seems likely to be merged for 5.2.

`FOLL_LONGTERM`

When get_user_pages() was first added to the kernel, it was assumed that pages would be kept mapped for short periods of time. Over the years, that assumption has become increasingly invalid; now mappings from subsystems like RDMA can literally last for days, and the new io_uring() system call can also create mappings with an indefinite lifetime. Such long-lived mappings can be a stress for any filesystem implementation, but they are especially problematic for those that implement DAX. A file page that is referenced will block a number of important operations, from copy-on-write to basic housekeeping when a file is deleted.

In fact, long-lived references have been deemed to be fundamentally incompatible with DAX filesystems. As a result, a variant called get_user_pages_longterm() was merged for the 4.15 kernel release; it functions like get_user_pages() with the exception that it will refuse to create a mapping on filesystems where DAX is enabled. Creators of long-lived mappings can use this function to avoid causing problems for filesystems on nonvolatile-memory devices. This helps to head off one problem with long-lived mappings, but creates another: even though these mappings can last a long time, users like RDMA would still rather use get_user_pages_fast() to create them efficiently — and there is no get_user_pages_fast_longterm(). Creators of long-lived mappings are thus stuck using the slower interface.

Ira Weiny is trying to address this limitation with a patch set adding a new FOLL_LONGTERM flag to get_user_pages(). This flag requests the same functionality as a call to get_user_pages_longterm() does now; indeed, that function is reimplemented to use the new flag as part of the patch set. But making it available in the core of get_user_pages() means that this flag can now be passed to get_user_pages_fast(); that, in turn, means that creators of long-lived mappings can do so more quickly.

This patch set, too, is currently present in linux-next, and is thus likely to be in the 5.2 release.

While both patch sets improve the situation, they are both just nibbling at a big problem that has been vexing memory-management and filesystem developers for years. There will doubtless be a lot of discussion on the topic at the upcoming LSFMM Summit and afterward as well. get_user_pages() may make things fast, but the process of making it play well with filesystems in all settings is not.

Comments (1 posted)

Program names and "pollution"

By Jake Edge
April 2, 2019

A Linux user's $PATH likely contains well over a thousand different commands that were installed by various packages. It's not immediately obvious which package is responsible for a command with a generic name, like createuser. There are ways to figure it out, of course, but perhaps it would make sense for packages like PostgreSQL, which is responsible for createuser, to give their commands names that are less generic—and more easily disambiguated—such as pg_createuser. But renaming commands down the road has "backward compatibility problems" written all over it, as a recent discussion on the pgsql-hackers mailing list shows.

Someone with the unlikely name of "Fred .Flintstone" started things off with a post complaining that PostgreSQL "pollutes the file system" with generic program names. The post suggested that names either be prefixed with "pg_" or that they become subcommands of a wrapper command, à la Git: postgresql createuser. It is not the first time that the topic has been raised, Andreas Karlsson pointed to this thread from 2008; Tom Lane reached further back and pointed to a discussion in 1999.

At issue are a handful of commands that come with PostgreSQL and are potential sources of confusion for users: createdb, dropuser, vacuumdb, and so on. As Lane pointed out, though, the outcomes from the previous discussions make it pretty clear what will probably happen this time as well:

If we didn't pull the trigger twenty years ago, nor ten years ago, we're not likely to do so now. Yeah, it's a mess and we'd certainly do it differently if we were starting from scratch, but we're not starting from scratch. There are decades worth of scripts out there that know these program names, most of them not under our control.

One command that was not mentioned in the early going, perhaps because it is so widely used in scripts, is initdb. Julien Rouhaud thought its name was more confusing than some of the others that had been mentioned, but Lane disagreed:

Meh. The ones with "db" in the name don't strike me as mortal sins; even if you don't recognize them as referring to a "database", you're not likely to guess wrongly that you know what they do. The two that seem the worst to me are createuser and dropuser, which not only have no visible connection to "postgres" or "database" but could easily be mistaken for utilities for managing operating-system accounts.

That led Alvaro Herrera to suggest making symbolic links from pg_* to, at least, createuser and dropuser. That would cause no change but, at some point, a deprecation warning could be printed for the unadorned versions and, eventually, they could perhaps be dropped entirely. But Tomas Vondra wondered what problem was truly being solved:

Can someone describe a scenario where this (name of the binary not clearly indicating it's related postgres) causes issues in practice? On my system, there are ~1400 binaries in /usr/bin, and for the vast majority of them it's rather unclear where do they come from.

He went on to note that there are multiple ways for users to figure out what some random binary does, including man, -h or --help flags, or asking the package manager. Most seem to agree that some of the names are too generic (createuser and dropuser in particular), but there are not likely to be name conflicts with other tools since PostgreSQL has 20+ years of seniority at this point. Even though there is support for some kind of rename, doing so will cause pain—and not for the PostgreSQL project. As Lane put it:

The whole thing is unfortunate, without a doubt, but it's still unclear that renaming those programs will buy anything that's worth the conversion costs. I'd be happy to pay said costs if it were all falling to this project to do so ... but most of the pain will be borne by other people.

A suggestion from Chris Travers that perhaps createuser and dropuser should just be removed led to concerns that leaving it to users to write their own shell scripts might result in security problems. The psql command can be used to create users, but the way to do so is somewhat non-obvious—more obvious alternatives could lead to SQL injection holes. But Lane wondered what the overarching plan is; will createuser actually be removed at some point, especially given that the postmaster command has been deprecated for more than 12 years but still has not been removed? Peter Eisentraut agreed that deprecation was not the project's strong suit, so: "How about we compromise in this thread and remove postmaster and leave everything else as is. ;-)"

Herrera argued that clearing up the confusion should be done as a service to future users. "The implicit argument here is that existing users are a larger population than future users. I, for one, don't believe that." Some seem to think that simply adding symbolic links for the pg_* variants might help things down the road, however. Herrera suggested adding the links and leaving it for a fictional future AI to do the deprecation. David Steele concurred with that plan: "+1 to tasking Skynet with removing deprecated features. Seems like it would save a lot of arguing."

Jokes aside, it doesn't really seem like the idea is going any further than it did in 2008 or 1999. As Lane repeatedly said, if the project were starting from scratch, surely other choices would be made; at this point, though, there is two decades of precedent, scripts, and muscle memory to overcome. That sentiment likely affects plenty of other projects, especially those that have been around for many years—free software grew up in a much smaller pool.

For the future, though, we will probably see less of these kinds of problems. New projects are generally thinking about "pollution" and finding ways to make it clear which binaries go with a particular package/project. That is a good thing since the number of packages we install is only going to grow.

Comments (51 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Chef goes open source; Guix & Software Heritage; Linux Vendor Firmware Service; VMware suit ends; LJ turns 25; Quotes; ...
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>