LWN.net Weekly Edition for April 26, 2018 [LWN.net]

Welcome to the LWN.net Weekly Edition for April 26, 2018

This edition contains the following feature content:

A successful defense against a copyright troll: a report on Patrick McHardy's GPL-enforcement loss in Germany.
Finding Spectre vulnerabilities with smatch: a free-software tool for finding Spectre gadgets, finally.
Zero-copy TCP receive: accelerating TCP data reception for some specialized use cases.
Rhashtables: under the hood: Neil Brown's documentation of how rhashtables work.
Rewiring x86 system-call dispatch: a low-level change in how system calls are called on x86.
The 2018 Linux Storage, Filesystem, and Memory-Management Summit: the first set of articles from LSFMM 2018.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A successful defense against a copyright troll

By Jake Edge
April 23, 2018

LLW

At the 2018 Legal and Licensing Workshop (LLW), which is a yearly gathering of lawyers and technical folks organized by the Free Software Foundation Europe (FSFE), attendees got more details on a recent hearing in a German GPL enforcement case. Marcus von Welser is a lawyer who represented the defendant, Geniatech, in a case that was brought by Patrick McHardy. In the presentation, von Welser was joined by Armijn Hemel, who helped Geniatech in its compliance efforts. The hearing was of interest for a number of reasons, not least because McHardy withdrew his request for an injunction once it became clear that the judge was leaning in favor of the defendants—effectively stopping this case dead in its tracks.

Copyright trolling

Von Welser began by noting some of the reasons that Germany is an attractive venue for copyright trolls. It is inexpensive to file cases and the process is relatively quick. The practice of the courts in Germany is to issue an interim injunction without an oral hearing. In order to avoid that interim injunction, defendants must sign a cease-and-desist declaration that then subjects them to contractual fines. There is also a lack of transparency in the German court system, which allows trolls to operate largely anonymously.

The tactics used by McHardy are to first notify the company of the GPL violation and ask for a cease-and-desist declaration that would subject the company to a flexible contractual fine. If that is signed, it is followed up with another letter, pointing to a different GPL violation, that asks for a second cease-and-desist declaration with a fixed contractual fine. After that is signed, further violations are alleged, each of which comes with a request for the fines (which can be a five-figure value per infraction), along with further cease-and-desist declarations with even higher fines.

A timeline of the McHardy versus Geniatech case was then presented. The company received a letter from McHardy on July 17, 2017 requesting the first cease-and-desist declaration in order to avoid an interim injunction. Instead of signing it, the company replied a week later questioning some of McHardy's claims, which appeared to anger him. On July 26, he replied by terminating Geniatech's GPLv2 license on McHardy's netfilter contributions in the kernel. Hemel noted that there were two elements here that were quite different from most of McHardy's cases: the defendant did not sign the first warning and McHardy explicitly terminated his license to the code for the defendant.

In his filings, McHardy referred to the Linux kernel in general, not to a specific version or one that was tied to the firmware in Geniatech's product. In its reply to his first letter, Geniatech raised questions about McHardy's contributions to the kernel, noting that his contributions were not integral to the kernel itself. The regional court in Cologne did not understand, however, and granted an interim injunction on August 23 that stopped Geniatech from distributing (or even hyperlinking to) any version of the kernel.

Arguments

That injunction was appealed and there was an oral hearing at the higher regional court of Cologne on March 7, 2018, where McHardy eventually withdrew his application for an injunction. The arguments that Geniatech made at that hearing seemed to resonate with the judge, according to von Welser and Hemel.

McHardy claimed joint authorship in the kernel, which gave him the right to enforce the copyright of the whole, but Geniatech argued that McHardy's contributions were simply adaptations. The kernel was authored by Linus Torvalds and released under the GPLv2, which explicitly grants the right to modify the work. Modifying the work under that clause does not give rise to joint authorship, but were instead adaptations under German law. If it is not a work of joint authorship, McHardy could only enforce the copyrights of his changes, not those of the whole Linux kernel.

Beyond that, McHardy's arguments that his modifications are protected by copyright and were infringed by the defendant were not proven. He alleged that he contributed 50,000 lines of code over the years and provided a CD with the changelogs corresponding to those changes, but did not show that those modifications fulfilled the requirements for copyright protection. In addition, he did not show what parts of the modifications he made were actually used by the defendant.

Another argument that Geniatech used was that McHardy was abusing his rights by approaching multiple companies with the intent of gaining monetarily. This hurts more than just the companies involved; it harms the entire Linux ecosystem by making companies insecure in their use of the operating system, it argued. Hemel said that the judge had heard from colleagues about other companies that had suffered at the hands of McHardy, which helped this argument.

The final element of the abuse-of-rights argument concerned the revocation of the GPL terms for Geniatech by McHardy. This was done to raise pressure on the defendant. In his efforts to enforce the GPL, McHardy is not following community norms, specifically "The Principles of Community-Oriented GPL Enforcement", which prioritizes compliance and not financial gain. He is also not entitled to terminate his GPL grant to Geniatech as it does not comply with German law or the GPLv2 itself, von Welser said.

The final argument, which von Welser thought was "quite striking", was that that the scope of the interim injunction was too broad and not well defined. It enjoined Geniatech from distributing the Linux kernel, not some specific version or versions of the Geniatech firmware. There are more than 100 officially released versions of the kernel that do not include any contributions from McHardy, von Welser said. By ordering Geniatech not to distribute any version of the kernel, the court was covering kernels that were not even part of the dispute with McHardy.

After those arguments had been presented, McHardy and his lawyer discussed it over the telephone (McHardy was not present, according to Harald Welte's account that was linked above). At that point, the request to continue the injunction was withdrawn. McHardy is required to cover the costs of the proceedings, including court fees and lawyer costs for both sides.

Recommendations

The recommendations for companies contacted by McHardy "are quite obvious", von Welser said. They should refuse to sign the cease-and-desist declaration. They should then ensure GPL compliance for their products and prepare a legal defense strategy. That may be easier to do now, since there is more information available from this case. Those that sign the declaration generally do so in the hopes of avoiding a conflict altogether, but it is important to recognize that it likely will not succeed in doing that. Hemel added that if companies do end up signing something with McHardy, they should try to get rid of the confidentiality clause that he demands; shining more light on his tactics will help others defend against them.

In answer to a question from the audience, von Welser said that since there is no judgment, there is nothing to translate to English for others to study. He also noted that it sets no precedent since there was no ruling, but he thinks the arguments make logical sense and will be successful elsewhere.

James Bottomley noted that it was good to have fended off McHardy in this case, but wondered how legitimate (i.e. community-oriented) enforcement would be able to proceed in Germany. Von Welser said that McHardy was trying to make wide claims to make things simpler for himself. If he had showed which parts of the kernel he modified, demonstrated that these modifications are copyrightable (which is not difficult in Germany), and showed how those modifications were used by the defendant, he might have found a different reception by the court. Bottomley said that the goal should be to stop trolls like McHardy, but to still allow legitimate GPL enforcement for the kernel. Von Welser indicated that should still be possible.

[I would like to thank the LLW Platinum sponsors, Intel, the Linux Foundation, and Red Hat, for their travel assistance support to Barcelona for the conference.]

Comments (37 posted)

Finding Spectre vulnerabilities with smatch

By Jonathan Corbet
April 20, 2018

The furor over the Meltdown and Spectre vulnerabilities has calmed a bit — for now, at least — but that does not mean that developers have stopped worrying about them. Spectre variant 1 (the bounds-check bypass vulnerability) has been of particular concern because, while the kernel is thought to contain numerous vulnerable spots, nobody really knows how to find them all. As a result, the defenses that have been developed for variant 1 have only been deployed in a few places. Recently, though, Dan Carpenter has enhanced the smatch tool to enable it to find possibly vulnerable code in the kernel.

Spectre variant 1 is the result of the processor incorrectly predicting the results of a bounds check; it then speculatively executes code with a parameter (such as an array index) that falls outside of its allowed range. This problem can be mitigated by disabling speculative execution in situations where an array index is under the control of a potential attacker. In the kernel, that is done by replacing code like:

    value = array[index];

with:

    index = array_index_nospec(index, ARRAY_SIZE);
    value = array[index];

That's the easy part; the hard part is finding the places in the kernel where the array_index_nospec() macro should be used. Until now, the only tool available has been the proprietary Coverity checker, which is not accessible to everybody and produces a fair number of false positives. As a result, there are only a handful of array_index_nospec() calls in current kernels.

Carpenter's addition to smatch changes that situation by providing a free tool that can search for potential Spectre variant-1 vulnerabilities. The algorithm is simple enough in concept:

What the test does is it looks at array accesses where the user controls the offset. It asks "is this a read?" and have we used the array_index_nospec() macro? If the answers are yes, and no respectively then print a warning.

This test returns a list of about 800 places where array_index_nospec() should be used. Carpenter assumes that a large percentage of these are false positives, and has asked for suggestions on how the test could be made more accurate. Instead of offering suggestions, though, both Thomas Gleixner and Peter Zijlstra confirmed that a number of the reports were accurate; Zijlstra said "I fear that many are actually things we want to fix". He followed up with a patch series fixing seven of them — nearly doubling the number of array_index_nospec() calls in the kernel.

Once the low-hanging fruit has been tackled, there probably will be a focus on improving the tests in smatch to filter out the inevitable false positives and to be sure that vulnerable sites are not slipping through. But, now that there is a free tool to do this checking, progress in this area can be expected to accelerate. Perhaps it will be possible to find — and fix — many of the existing Spectre vulnerabilities before the attackers get there.

Comments (14 posted)

Zero-copy TCP receive

By Jonathan Corbet
April 19, 2018

In the performance-conscious world of high-speed networking, anything that can be done to avoid copying packet data is welcome. The MSG_ZEROCOPY feature added in 4.14 enables zero-copy transmission of data, but does not address the receive side of the equation. It now appears that the 4.18 kernel will include a zero-copy receive mechanism by Eric Dumazet to close that gap, at least for some relatively specialized applications.

Packet reception starts in the kernel with the allocation of a series of buffers to hold packets as they come out of the network interface. As a general rule, the kernel has no idea what will show up next from the interface, so it cannot know in advance who the intended recipient of the next packet to arrive in a given buffer will be. An implementation of zero-copy reception will thus have to map these packet buffers into user-space memory after the packets come in and are associated with an open socket.

That, in turn, implies a set of constraints that must be met. Mapping of memory into a process's address space is done on a per-page granularity; there is no way to map a fraction of a page. So inbound network data must be both page-aligned and page-sized when it ends up in the receive buffer, or it will not be possible to map it into user space. Alignment can be a bit tricky because the packets coming out of the interface start with the protocol headers, not the data the receiving process is interested in. It is the data that must be aligned, not the headers. Achieving this alignment is possible, but it requires cooperation from the network interface; in particular, it is necessary to use a network interface that is capable of splitting the packet headers into a different buffer as the packet comes in.

It is also necessary to ensure that the data arrives in chunks that are a multiple of the system's page size, or partial pages of data will result. That can be done by setting the maximum transfer unit (MTU) size properly on the interface. That, in turn, can require knowledge of exactly what the incoming packets will look like; in a test program posted with the patch set, Dumazet sets the MTU to 61,512. That turns out to be space for fifteen 4096-byte pages of data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP header.

The core of Dumazet's patch set is the implementation of mmap() for TCP sockets. Normally, using mmap() on something other than an ordinary file creates a range of address space that can be used for purposes like communicating with a device. When it is called on a TCP socket, though, the behavior is a bit different. If the conditions are met (the next incoming data chunk is page-sized and page-aligned), the buffer(s) containing that data will be mapped into the calling process's address space, where it can be accessed directly. This operation also has the effect of consuming the incoming data, much as if it had been obtained with recvmsg() instead. That is, needless to say, an unusual side effect from an mmap() call.

When the incoming data has been processed, the process should call munmap() to release the pages and free the buffer for another incoming packet.

If things are not just right (there is only a partial page of data available, for example, or that data is not page-aligned), the mmap() call will fail, returning EINVAL. That will also happen if there is urgent data in the pipeline. In such cases, the call does not consume the data, and the application must fall back to recvmsg() to obtain it.

It has long been conventional wisdom in the kernel community that zero-copy schemes dependent on memory-mapping tricks will struggle to outperform implementations that simply copy the data. There is quite a bit of overhead involved in setting up and tearing down these mappings. Indeed, Dumazet cautioned in the patch introduction that there may not be a benefit if the application uses a lot of threads, since the contention for the mmap_sem lock will become too expensive. But it is still natural to wonder if performing zero-copy packet reception in this way is worth the trouble.

One way of reducing the cost would be to not call mmap() until several pages of data are available to be consumed, so that they can all be mapped in a single batch. The network stack provides a way to request that the application not be notified until a certain amount of data is pending in the form of the SO_RCVLOWAT option. That said, the socket() man page cautions:

The select(2) and poll(2) system calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available. A subsequent read from the socket will block until SO_RCVLOWAT bytes are available.

That shortcoming would make SO_RCVLOWAT useless for this purpose. That problem appears to to have been fixed in 2008 for the 2.6.28 kernel, though, so the man page is a bit behind the times. Even so, there were still some shortcomings with SO_RCVLOWAT, including spurious wakeups, that Dumazet fixed as a part of this series.

In some benchmark results posted with the core patch, Dumazet shows some impressive improvements in packet-processing performance — from 129µs/MB to just 45µs/MB. Naturally, this is a tuned test running in a controlled setting, but it shows that there are indeed benefits to be had. Those benefits will be generally available before too long; networking maintainer Dave Miller has applied the series for the 4.18 merge window.

Comments (14 posted)

Rhashtables: under the hood

April 21, 2018

This article was contributed by Neil Brown

The first article in this series described the interface to the "rhashtable" resizable hash-table abstraction in Linux 4.15. While a knowledge of the interface can result in successful use of rhashtables, it often helps to understand what is going on "under the hood", particularly when those details leak out through the interface, as is occasionally the case with rhashtable. The centerpiece for understanding the implementation is knowing exactly how the table is resized. So this follow-on article will explain that operation; it will also present the configuration parameters that were skimmed over last time and discuss how they affect the implementation.

The chain gang — tables within tables

An rhashtable table (struct rhashtable) contains a list of bucket tables (struct bucket_table). Normally there is just one bucket table, or two when the table is being resized. In general there can be more, though making this happen in practice would require adding many objects more quickly than the table can be "rehashed", which is the processes of moving objects from the early bucket tables to the last.

Each bucket table has an array of struct rhash_head buckets for the various hash chains and a separate array of spinlocks, each of which protects two or more of the chains during modification. The different insert, lookup, and remove operations act on each bucket table in turn, stopping when they find what they are looking for or when they get to the last bucket table. Insertion will only insert into that last table, if anywhere.

There are a few different times when a new bucket table is added to the chain. When an insert function finds that the total number of objects in the table exceeds 70% of the number of buckets in the last bucket table in the chain, it schedules an asynchronous worker thread to look into this situation. Similarly, when object removal finds that, in a table where automatic shrinking is enabled, the number of objects is less than 30% of the number of buckets, the worker is again scheduled. This worker repeats the checks and possibly allocates a bucket table that is either double or half the previous size. If there is more than one bucket table at that point, the worker will proceed to move all objects to the last.

If insertion ever finds that the number of objects exceeds 100% of the bucket count it assumes that the thread is taking too long, so it takes matters into its own hands and allocates a new bucket table directly. Since this must happen without blocking (none of the rhashtable interfaces except rhashtable_init() and rhashtable_destroy() ever block), that allocation can fail. If it doesn't fail, the new object is inserted in the new table and, hopefully, the rehashing worker thread will eventually catch up. Insertion will also trigger an immediate allocation if it finds that the bucket it wanted to insert into already has 16 objects. Since the table is resized to keep the average bucket occupancy below one, this suggests that something has gone wrong with the hash function — possibly an attacker has found a way to generate lots of objects with the same hash.

One of the per-bucket-table values is a random hash seed (hash_rnd) that is passed to the hash function to be included in the calculated hash. When a new bucket table is allocated, it will have a new hash_rnd, so objects will be placed in different chains. This should mean that the 17 objects that all had the same hash will now be more evenly distributed. This is a significant difference from the original "relativistic hash table" implementation, which assumed that the hash of an given object remained unchanged; the current implementation deliberately changes it for each new bucket table.

Performing the rehash

Whenever a new bucket table is allocated, whether to change the size of the table or just to change the hash seed, every object in the old table must be moved to the new table. This process seems labor-intensive, but is pleasantly simple. Since it happens rarely and is performed asynchronously with respect to all other accesses, the cost is unlikely to be a concern.

At a simple level, the rehash process takes one object from the old table and inserts it in the new table, then repeats that until no objects are left. This can only be correct if an asynchronous lookup for that object, which proceeds with no locking, will always be certain of finding the object. To provide this correctness, rhashtable always moves the last object from a hash chain, and attaches it to the start of the destination hash chain. It attaches the object to the new chain before removing it from the old chain; see the diagram to the right. In pseudocode, the full rehash sequence is:

    for each bucket in the first bucket table
        take a lock on the current bucket
    	while the current bucket is not empty
            find the last object in the bucket
            determine the destination bucket in the last bucket table
            lock the destination bucket
            attach the object to the head of the new bucket
            remove the object from the tail of the old bucket
       	    unlock the new bucket
        unlock the current bucket
    detach the first bucket table

This insert-before-remove sequence means there is never a time when the object is not in the table, so a lookup will never fail to find the object. It also means that, for a short time, a chain in the old table is connected to a chain in the new table. Having joined chains will not hurt lookup as it already expects to see unwanted objects in the chain, so a few more cannot hurt. It also cannot hurt insert or removal as they take bucket locks that the rehash process also takes, so they will never see the joined chain. It can affect a rhashtable_walk_next() though.

Walking the hash table while the table is being rehashed

As mentioned in the previous article, it is not possible to guarantee that a walk of the hash table will see objects at most once if a rehash could happen while the walk is proceeding. One reason is that, as we have just seen, the "current" object might move from one bucket table to another and be attached to the start of a hash chain. If that hash chain was not empty, then the objects already there might have been moved from a hash chain that has already been walked along. Since hash chains are kept short this is quite unlikely, but it is a real possibility.

When rhashtable_walk_next() completes the last hash chain in a bucket table, it checks if there is another in the chain of tables. Normally, there is not and rhashtable_walk_next() will return NULL to indicate completion. If there is, rhashtable_walk_next() will return ‑EAGAIN and prepare to walk through the next table. When a caller receives ‑EAGAIN, it will know that a rehash is underway and, thus, it might have already seen a duplicate. It can choose to give up and return an error or incomplete data, or it can keep going and expect to see every object in the table again, including any that it might have missed the first time as they might have been rehashed before the walker saw them.

If a caller needs to be aware of rehashing, and is using rhashtable_walk_stop() and rhashtable_walk_start() to drop the RCU read lock, it should use rhashtable_walk_start_check() instead of rhashtable_walk_start(). This alternate interface will report ‑EAGAIN if the bucket table that was being walked through was discarded while the lock wasn't held. This has exactly the same implications as rhashtable_walk_next() returning ‑EAGAIN in that subsequent calls will start walking the next bucket table.

The only way to guarantee that a walk sees all objects exactly once would be to disable the rehash process. There is no documented interface for doing this, but in the current implementation it can be done by claiming the hash table mutex:

    mutex_lock(&my_objects.mutex);

This call will wait for any current rehash to complete, and will stop future rehashes from starting. If lots of objects are added while the mutex is held, the insert function will eventually allocate a new bucket table and will start inserting into that. If this memory allocation fails, the insertion will fail. This is one of a small number of ways that insertion can fail.

Failure during insertion

It must be expected that insertion into the hash table will fail if the key is already in use, unless of course you use rhashtable_insert_fast(), which doesn't check. There are a few other circumstances that can trigger errors; three error codes (beyond ‑EEXIST) are possible:

‑ENOMEM is returned when the insert function tries to allocate a bucket table and fails. If the required table is larger than one page, it will try to fall back on allocating a single page and use that to hold pointers to other pages to be allocated later. If this works we might not get ‑ENOMEM immediately but could still get it if a subsequent single-page allocation fails.
‑EBUSY is returned if the insert function finds that there is more than one bucket table and the last one has 16 objects already in the hash chain. This is almost impossibly unlikely, unless the rhashtable is using a custom hash function that is ignoring the random seed or doing something else strange.
‑E2BIG is returned if there are already 2³¹ objects in the hash table. This prevents the element counter from overflowing. ‑E2BIG can also be returned if a maximum table size was specified when the table was created (as described below), and the number of objects has reached twice that maximum size.

The full scoop on configuration parameters

Now that we know how rehashing works, and how it affects walking the table, it remains only to see how it can be configured through values in the rhashtable_params structure. Of the twelve fields in this structure we have already met six (key_len, key_offset, head_offset, hashfn, obj_cmpfn, and obj_hashfn) and five are related to configuring the size for rehashing. The final parameter is nulls_base. It is essentially unused and unusable in the current implementation, so it is best ignored.

The size-related parameters are:

u16 nelem_hint;
nelem_hint is documented as a hint regarding the number of elements that the table will have, but is never really used that way; in fact I cannot imagine how anyone would be able to make a meaningful guess. Some callers set it to two or three with the intention to "use a really small initial bucket table" — both of these values result in four buckets being allocated.
Other callers use values like 192, 384, or 768 which mean "use a bigger initial table, maybe 256, 512, or 1024 buckets". I can understand memory conscious coders asking for a small initial number of buckets if they don't think the table will be used much, though the default of 64 buckets won't often be a burden. It is harder to understand why people would use a resizable hash table, but not just give it complete control of the size.
unsigned int max_size;
The max_size is rounded down to a power of two, then used as a maximum size for the bucket table and half the maximum number of objects in the table. This maximum number of objects is a hard limit; insertion will fail when it is reached. The default maximum is 2³¹.
bool automatic_shrinking;
If this flag is set, the table will be resized downwards (halving the size) when the residency drops below 30% (objects per bucket). Resizing upward is always enabled, resizing downward must be explicitly requested as in some cases it might be wasteful effort.
unsigned int min_size;
min_size is rounded down to a power of two but kept to at least four (the default). This puts a lower bound on resizing the bucket table. When automatic_shrinking is not set, this value is ignored; it does not, for example, impose a minimum on the initial allocation.
u8 locks_mul;
As mentioned, each bucket table has a set of locks protecting the buckets: two or more buckets per lock. On machines with few CPUs, even having only half as many locks as buckets can be wasteful when there are a great many buckets. To avoid wasting memory on useless locks, rhashtable limits the number of locks using the following formula:
```
    num_locks = min(num_chains/2, locks_mul * min(num_possible_cpus, 64))
```
If locks_mul is not given, 32 is used, so by default my eight-CPU desktop would get at most 256 locks for each bucket table. The only examples of setting this field in the current kernel involve setting to it to one, which is likely to be plenty for all but the busiest tables.

That completes the set of parameters that affect resizing. My recommendation would be to not set any of them unless you feel that memory conservation is more important than raw speed, in which case set automatic_shrinking to true and locks_mul to 1.

A word of warning

It is nice to have rhashtables properly documented, but I wouldn't expect that situation to last. While working on this article, I have been discussing possible changes with the maintainers. None of these are major and should not invalidate this documentation, but they are unlikely to be the last changes ever suggested. Rhashtables are quite different from what they were four years ago, and may be different again in four more years — the Linux kernel is ever a dynamic place. So please don't trust this description implicitly — check it against the code, just in case.

Comments (5 posted)

Rewiring x86 system-call dispatch

By Jonathan Corbet
April 23, 2018

Each kernel development cycle includes a vast number of changes that are not intended to change visible behavior and which, as a result, go unnoticed by most users and developers. One such change in 4.17 is a rewiring of how system-call implementations are invoked within the kernel. The change is interesting, though, and provides an opportunity to look at the macro magic that handles system-call definitions.

In user space, system calls look like ordinary functions, albeit with the strange convention of returning error codes in the global errno variable. Those functions are indeed just that, in that they are generally wrapper functions defined in the C library (or some other language-specific implementation). Those wrappers are responsible for organizing the system-call arguments properly (placing them into a set of registers defined by the architecture ABI) and triggering a trap into the kernel, where the real work gets done.

Imagine that the application is calling read(). In the 4.16 kernel, the implementation of this system call is:

    SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
    {
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
	    loff_t pos = file_pos_read(f.file);
	    ret = vfs_read(f.file, buf, count, &pos);
	    if (ret >= 0)
		file_pos_write(f.file, pos);
	    fdput_pos(f);
	}
	return ret;
    }

The SYSCALL_DEFINE3() macro at the beginning declares an implementation for the read() system call with three arguments. This is clearly not a normal function definition; one of the first things that jumps out is that the actual declaration of those arguments is done a little strangely. The names of the arguments and their types are separated in this way so that SYSCALL_DEFINE3() can perform whatever impedance-matching is required to map those arguments from the system-call ABI into what a normal C function expects. On the x86 architecture, after quite a bit of macro substitution, the SYSCALL_DEFINE3() line turns into something like:

    asmlinkage long sys_read(int fd, char __user *buf, size_t count) \
        __attribute__((alias(__stringify(SyS_read))));
    asmlinkage long SyS_read(long fd, long buf, long count)
    {
        long ret = SYSC_read((int) fd, (char __user *) buf, (size_t) count);
	return ret;
    }
    static inline SYSC_read(int fd, char __user *buf, size_t count)
    /* SYSCALL_DEFINE3() expansion ends here, function body follows */

As can be seen, two different functions (and one alias) are declared here. SyS_read() is declared with the asmlinkage attribute (so that it expects all arguments on the stack rather than in registers), and with all arguments declared as having the long type, which is how they are passed from user space. This function casts the arguments into the expected types, then calls SYSC_read(), which is the name of the function that ends up containing the actual code implementing the system call. Note that it is declared static inline, so it will be substituted directly into SyS_read().

A pointer to the SyS_read() version is placed in the appropriate location in the sys_call_table array. Then, when the kernel handles a trap for an incoming system call, it comes down to this bit of code in do_syscall_64() (again, on x86):

    if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) {
	nr = array_index_nospec(nr & __SYSCALL_MASK, NR_syscalls);
	regs->ax = sys_call_table[nr](
		regs->di, regs->si, regs->dx,
		regs->r10, regs->r8, regs->r9);
    }

(The use of array_index_nospec() prevents the processor from executing this call speculatively, thus blocking any attempts to create a speculative call to an address outside of sys_call_table). Since all of the entries in sys_call_table are declared asmlinkage, the arguments will be copied from registers onto the stack before the call is made. Note that six registers are pushed onto the stack regardless of the number of arguments that the system call expects; the unneeded values will simply be ignored. This code reflects the maximum of six arguments that any system call may have.

In 4.17, this mechanism has changed on the x86-64 architecture, thanks to some work done by Dominik Brodowski. The new convention makes use of the fact that the pt_regs structure, created on any trap into the kernel, contains the register state of the user-space process, so it will contain the system-call arguments too. Rather than push all six registers onto the stack, the relevant line in do_syscall_64() now looks like:

    if (likely(nr < NR_syscalls)) {
	nr = array_index_nospec(nr, NR_syscalls);
	regs->ax = sys_call_table[nr](regs);
    }

The wrapper for the system-call implementation needs to change, since the calling convention has changed. In 4.17, the boilerplate for read(), after macro substitution, looks something like:

    asmlinkage long __x64_sys_read(const struct pg_regs *regs)
    {
        return __se_sys_read(regs->di, regs->si, regs->dx);
    }
    static long __se_sys_read(long fd, long buf, long count)
    {
        long ret = __do_sys_read((int) fd, (char __user *) buf, (size_t) count);
	return ret;
    }
    static inline long __do_sys_read(int fd, char __user *buf, size_t count)

The __x64_sys_read() version goes into sys_call_table; its job is to unpack the arguments from the pt_regs structure before calling __se_sys_read(), which will cast those arguments into the proper type and call the real implementation (now named __do_sys_read()).

Getting to this point required quite a bit of work, including passing over the entire kernel to find and fix every direct call into the old system-call entry points. One might well ask why this effort was made. The new implementation is somewhat cleaner in general, but it also keeps unused, caller-supplied data from ending up on the stack. In current kernels, an attacker could call read() with six arguments; the final three would just end up on the stack, unchecked in any way, where they could conceivably turn out to be useful for any of a variety of exploits. By controlling the environment in which system-call code runs a bit more strictly, the new calling convention makes the kernel just a bit harder to compromise.

Comments (3 posted)

The 2018 Linux Storage, Filesystem, and Memory-Management Summit

By Jonathan Corbet
April 24, 2018

Every year, the Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) brings together 80 or 90 core kernel developers for a few days of highly technical discussions. The 2018 LSFMM was held April 23 to 25 in Park City, Utah. As usual, LWN was there and able to attend most of the sessions.

Plenary sessions

This year, a relatively small subset of the discussions involved the entire group of developers. Topics addressed by the whole group include:

Filesystem metadata memory management: looking for ways to improve the management of memory used to hold filesystem metadata caches.
Repurposing page->mapping: two separate sessions about freeing up the mapping field in the page structure for more productive uses.
The impact of page-table isolation on I/O performance: how much does PTI slow I/O down?
A mapping layer for filesystems: a way to handle space-management issues for filesystems on thin-provisioned devices that gives rise to a number of other uses.
The trouble with get_user_pages(): the kernel's mechanism for providing access to user-space pages can fail badly in some situations.
Willy's memory-management to-do list: a grab-bag of memory-management issues from Matthew Wilcox, plus a bonus session on reworking struct page.

Memory-management track sessions

The discussions held by the memory-management subgroup were:

A page-table isolation update: the current status of the kernel page-table isolation patches and how developers can avoid the worst performance impacts.
Supporting memory encryption: newer Intel and AMD processors can do memory encryption, but supporting that feature in the kernel presents some challenges.
Heterogeneous memory management and MMU notifiers: a pair of sessions on HMM and the notifier mechanism it depends on.
Exposing storage devices as memory: a proposal for a new character-device abstraction for fast access to storage devices.
Rethinking NUMA: is the NUMA abstraction sufficent to describe today's memory architectures?
The memory-management development process: a discussion on the health of the development community and what might be done to make things work better.
The LRU lock and mmap_sem: two sessions on difficult locking problems in the memory-management subsystem.
The slab and protected-memory allocators: two sessions on allocators for relatively small objects.
Three sessions on memory control groups: various issues, including background reclaim, swap accounting, and out-of-memory handling.
Improving support for large, contiguous allocations: sessions on the contiguous memory allocator and gigantic page allocations.
Toward better performance on large-memory systems: addressing some scalability issues on the largest systems.
Zone-lock and mmap_sem scalability: two more sessions on lock-contention issues in the memory-management subsystem.
Hotplugging and poisoning: problems with adding memory to and removing memory from running systems, whether it be a deliberate action or as a response to hardware trouble.
get_user_pages() continued: looking at possible solutions to the problem.
Reworking page-table traversal: a proposal for a new way of making changes to ranges of page-table entries.
Is it time to remove ZONE_DMA?: a holdover from the past may be reaching the end of its run.

Filesystem track

The filesystem developers participated in the following sessions:

Removing the kthread freezer?: more discussion on how to make progress in the longtime effort to get rid of the kthread freezer.
Fixing error reporting—again: looking at ways to better report errors to user space and to hopefully fix problems that recently bit PostgreSQL.
File-level integrity: a feature akin to dm-verity that allows detection of on-disk alteration of individual files.
A kernel integrity subsystem update: an overview of the subsystem that measures and appraises file contents.
Controlling block-I/O latency: providing a means to specify a latency target for the block-I/O controller.
A new API for mounting filesystems: mount() has some downsides; a proposed new API would try to provide something better.
XFS parent pointers: a new XFS feature that will make it possible to reconstruct paths from inodes.
Shared memory mappings for devices: supporting GPUs and other devices that need to share device memory among multiple processes.
XFS online filesystem scrubbing and repair: a discussion of filesystem scrubbing and whether common interfaces and tools could be developed.
The NOVA filesystem: a look at a proposed Linux filesystem that exclusively targets non-volatile memory.
PostgreSQL visits LSFMM: a discussion led by PostgreSQL developer Andres Freund on the problems the database system has recently encountered.
Network filesystem topics: a look at some needed features to better support SMB, NFS, and others.
SMB/CIFS compounding support: some discussion of support for compounding—combining multiple network filesystem operations into a single server message.
Case-insensitive filesystem lookups: a proposal for a way to add case-insensitive filesystems to the kernel.
A filesystem "change journal" and other topics: a way to get reliable notifications of filesystem changes, even after a crash, was discussed along with performance-regression testing for filesystems.
The ZUFS zero-copy filesystem: an introduction to a user-space filesystem framework targeted at zero-copy operation for persistent memory.
Flash storage topics: a variety of problem areas for flash storage and filesystems on Android.
Messiness in removing directories: Al Viro reported on some problems he found in the rmdir() implementation.
Filesystem test suites: a discussion of how to make filesystem test suites more accessible to other developers.
XArray and the mainline: the status of the XArray radix-tree replacement API.

Filesystem and Storage track sessions

There were a few sessions shared by participants in both the Filesystem and Storage tracks:

Supporting multi-actuator drives: a discussion about the design of the interface to devices with multiple actuators that provide multiple sets of read/write heads for parallel data access.
An update on bcachefs: a discussion on bcachefs and whether it is ready for mainline inclusion.
Handling I/O errors in the kernel: what to do about I/O errors, particularly during writeback?

Group photo

Acknowledgments

Thanks to the LSFMM 2018 program committee for inviting LWN to the event, and to the Linux Foundation, LWN's travel sponsor, for supporting our travel to Park City.

Comments (none posted)

Filesystem metadata memory management

By Jonathan Corbet
April 24, 2018

LSFMM

It is a good thing that strong coffee was served at the 2018 Linux Storage, Filesystem, and Memory-Management Summit; full awareness was required from the first session, in which Josef Bacik discussed some issues that have arisen in the interaction between filesystems and the memory-management subsystem. Filesystems cache a lot of data from files, but also a lot of metadata about those files. It turns out, though, that management of the cached metadata does not work as well as one might like.

In particular, Bacik said, he has been working on better tracking of dirty metadata in memory. This is a problem that every filesystem faces, and each one has come up with its own solution. Left unchecked, the amount of dirty metadata in the system is limited only by the amount of memory available — which can be problematic, given that the owner of the system usually wants to use at least some of that memory for other purposes. The Btrfs filesystem uses a special inode to track and limit dirty metadata, but that leads to "horrifying things" around the process of evicting that metadata when the time comes. Among other things, it doesn't support sub-page-sized blocks.

The Btrfs developers would like to take an approach closer to what is done in XFS, he said. That approach works well, but still suffers from the problem that the metadata cache can grow to fill all of memory. To address this problem, he's working on adding separate counters to track metadata pages, with the idea that filesystems will get a writeback callback from the memory-management subsystem indicating the number of these pages that should be freed. This callback will come from balance_dirty_pages(), which knows how much memory it needs to free and can pass that information down to filesystem code.

Dave Chinner noted that memory accounting at this level is all based on pages, which doesn't necessarily match the problem of tracking filesystem metadata, which exists in many sizes from a fraction of a page to many pages. So tracking dirty metadata in terms of pages doesn't really solve the problem. Bacik responded that he has changed all of the counters to units of bytes to address just this problem. He asked the memory-management developers in the room whether they objected to this change; nobody seemed to.

Moving on, he said that balance_dirty_pages() has a global limit of dirty memory that encompasses both data and metadata pages. But, he asserted, it is generally better to write out data pages than metadata pages when memory must be freed, so Btrfs focuses on the data pages first. Is that behavior good enough, he asked, or should there be separate limits for data and metadata?

Johannes Weiner said that one limit should apply to both types, since the system is committed to writing back all of the dirty data in the end. Jan Kara said, though, that there is a fundamental difference between writing back data, which just has to be pushed out of the page cache, and metadata, which must be written back via calls to shrinkers. Bacik said that the shrinker interface is a part of the problem, since it is not well designed. The Btrfs changes are going to put more pressure on the shrinker interface, since there will be unknown amounts of data attached to each object that might be written back. There is, he said, a lot of memory hidden in places where the memory-management subsystem has no idea how to find it.

Rik van Riel said that, when a filesystem allocates memory, it expects the memory-management code to be able to free up enough memory to satisfy the request. But, while allocations require pages, shrinkers work on smaller objects. A shrinker may free the amount of memory requested by the memory-management code, but those objects can be scattered across memory and the end result may be to free no pages at all. As more memory is taken up by filesystem caches, he admonished, the filesystem has to free those caches or it will find its allocation requests failing. Kara responded, though, that fragmentation tends not to be a huge problem in this context, since the objects involved are relatively large.

Chinner said that XFS does the majority of its metadata writeback by way of shrinkers, but also all of its page-reclaim work. The XFS cache structure is too complicated to express in the page cache, so the page cache is not used for this purpose. So, for XFS, there are no problems resulting from the shrinkers and the page cache not working together. Bacik said, though, that this results in XFS tending to write data randomly during reclaim, which can lead to excessive latencies, so Facebook has had to patch that feature out. There needs to be a way to do XFS metadata writeback in places where latency is expected.

Mike Snitzer suggested that "bufio", a shrinker-based mechanism used by the device mapper might be able to help here; it has recently been enhanced to handle objects with sizes that are not a power of two. Bacik said he'd looked at it, but he thinks that maybe the shrinker interface is not the best one for this problem. Chinner said the solution could be a new type of shrinker; the same mechanism could be used, but the accounting would be adapted to handle the metadata issues.

Chris Mason said that Btrfs has a set of throttling hooks that slow down allocations in places that are known to be safe; it is working well. The only caveat is that things fall apart if the kswapd thread starts performing I/O. In general, Weiner said, the entity that is issuing writes is not the one that is performing allocations, so throttling writes is the wrong way to address the problem. There should be two separate throttling points, one for writes and one for allocations; conflating the two has caused a lot of problems.

Bacik closed the session by saying that he will do the work to add a new shrinker interface with byte-based accounting for metadata.

Comments (none posted)

The impact of page-table isolation on I/O performance

By Jonathan Corbet
April 24, 2018

LSFMM

Ever since kernel page-table isolation (PTI) was introduced as a mitigation for the Meltdown CPU vulnerability, users have worried about how it affects the performance of their systems. Most of that concern has been directed toward its impact on computing performance, but I/O performance also matters. At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Ming Lei presented some preliminary work he has done to try to quantify how severely PTI affects block I/O operations.

This work was done by running the fio benchmark on current hardware. The initial tests, running in a virtual machine, showed a significant impact: a system that could execute just over 1 million I/O operations per second (IOPS) without PTI was reduced to 726,000 IOPS with PTI turned on. The situation changes significantly when the test is run on bare metal on the same machine; in that case, a system that could achieve 1,706,000 IOPS dropped to 1,568,000 IOPS when PTI is turned on. At a little under 10%, that is a smaller impact, but still a significant one.

It's not clear why performance regresses so severely when the test is run under virtualization. There was some theorizing that clock_gettime(), which is called frequently by fio, is not implemented properly on the guest system, but no real answers.

Further tests were done using an NVMe-attached drive. In this case, the IOPS rates were about the same regardless of whether PTI was being used, but the system's CPU utilization was significantly higher in the PTI case.

Lei concluded from his tests that enabling PTI adds about 0.2µs to the execution time of every system call. Normal synchronous I/O operations can be performed with a single system call, so they slow down slightly as a result. Asynchronous I/O operations, as used in the benchmark, require two system calls — one each to io_submit() and io_getevents(). As a result, asynchronous I/O feels the PTI penalty more severely. Interrupts add a similar penalty to each operation as well.

Dave Hansen (who did much of the work to bring PTI to Linux) noted that there was nothing new in these results. There has always been a cost to both interrupts and system calls; PTI just makes those costs worse. He did note that it was nice to see that the IOPS don't drop when there is adequate CPU time available, though.

Block maintainer Jens Axboe said that fio performs three clock_gettime() calls for every I/O operation by default. So, to a great extent, Lei's tests were measuring the impact of PTI on system-call execution time. Bart Van Assche suggested using the options that reduce the number of clock_gettime() calls, just as the session wound down.

Comments (9 posted)

A page-table isolation update

By Jonathan Corbet
April 25, 2018

LSFMM

Dave Hansen did much of the work to get kernel page-table isolation (PTI) into the kernel in response to the Meltdown CPU vulnerability. In the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, he ran a discussion on how PTI came about, what the costs are, and what can be done to minimize its performance impact.

Hansen started by saying that he was not going to talk about the Spectre vulnerabilities, which are not seen as being much of a problem at the memory-management level. This code is relatively far from user space, so forcing a particular speculative-execution path there is generally impractical. The PTI patches, instead, affect memory management directly.

PTI got its start in this paper by Daniel Gruss et al. [PDF], where it was called KAISER. The original purpose behind KAISER was to address threats to kernel address-space layout randomization; the Meltdown vulnerability was not known to the authors when the paper was written. Later on, when it became clear that there was something going on, the authors found Meltdown and disclosed it to Intel — they were one of four independent groups to do so. Hansen wondered how many others had also found the problem but kept it to themselves.

In any case, when Meltdown was discovered, Hansen realized that KAISER was a good way to address the problem. One of the first things he did with the code was to add support for the Intel processor context ID (PCID) feature. PCIDs, support for which was added by Andy Lutomirski around 4.14, allow page tables to be switched without flushing the translation lookaside buffer (TLB), thus mitigating some of the performance impact of PTI (which, by its nature forces a lot of extra page-table switches).

One interesting question is: how long should PTI be enabled in the kernel? The way things stand now, the kernel will turn it off if it detects that it is running on a processor that is not vulnerable to Meltdown. The "performance folks" want that, of course, since they want to get as much performance as possible out of their CPUs. People who are more concerned with security, though, are inclined to leave it enabled indefinitely. There are a number of useful security properties that come from completely separating the kernel and user-space address spaces.

One of the consequences of PTI is, of course, that system calls, page faults, and interrupts all get slower. The PTI patches also turned off the "global pages" feature, which kept the kernel's TLB entries resident at all times. He has a patch set to restore global pages conditionally in the 4.17-rc kernels; there are still some issues with the changes, he said. Another outcome of PTI is that TLB invalidation becomes much trickier. PCIDs allow a small number of address spaces to be active at one time; when pages are shared between address spaces, they must be separately flushed for each active process. There is also a slight increase in memory use when PTI is active.

Users who are concerned about the performance impacts have a few ways to minimize those impacts, most of which are good practice even without PTI. For example, he said, minimize the number of system calls that are made. Try to avoid extra TLB flushes; he mentioned the MALLOC_TRIM_THRESHOLD environment variable that, if misconfigured, can cause glibc to make excessive address-space changes. Huge pages should be used when possible; they can reduce the number of page faults, and also the number of system calls needed to manipulate them. Try to avoid page faults in general; mapping memory with MAP_POPULATE when it will all be accessed is one way to do that.

Michal Hocko said that it might make sense to add more multiplexing system calls for address-space operations. For example, database developers would appreciate a batched version of mprotect() that would make several changes with a single system call. Hansen agreed that this might make sense; he lamented that the API for memory protection keys (which he wrote) was designed around the notion that "system calls are fast". Now they aren't quite as fast and some of those decisions don't look as good as they did before. Rik van Riel said that a vectored version of madvise() would give big performance improvements for memory allocators, and Hansen added that a batched version of munmap() would also be helpful.

Another developer asked about ways to increase locality in general; Hansen replied that huge pages are the answer there. Huge pages were "an afterthought" in the processors where they first appeared, so they were not supported all that well. There were not many TLB entries available for huge pages, for example. But huge pages are a first-class citizen now, and they should be used when possible.

Alexei Starovoitov asked about the ability to turn off PTI on a per-process basis; patches to allow this were circulating in January, but haven't been discussed recently. Hansen replied that the most security-sensitive applications are the ones that need PTI, but they are also often quite performance-sensitive. As a result, the people who need PTI the most are often the ones who most want to turn it off. A per-process disable can thus be problematic; that said, it may now be time to revisit that issue.

Laura Abbott asked about the status of older hardware; there will be users who don't want to update. Hansen replied that older hardware will always need PTI; there are no microcode updates forthcoming that will address Meltdown on that hardware. Abbott also asked about PTI for 32-bit systems; Hansen said that the 32-bit patches will go in soon; they are "not as terrifying" as had been originally thought.

The final question came from Dan Williams, who wondered whether PTI was a useful mitigation for the newer "Spectre prime" variant. Hansen replied that any attack that takes advantage of a shared address space will be mitigated by PTI. Its costs notwithstanding, using PTI to separate the spaces brings some fundamental security benefits, he said.

Comments (none posted)

Supporting Intel/AMD memory encryption

By Jonathan Corbet
April 25, 2018

LSFMM

Once a niche feature, memory encryption is becoming mainstream with support in both Intel and AMD processors, Kirill Shutemov said at the beginning of his session during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit. Memory encryption can harden the system against attack, but it also presents some interesting challenges for the kernel.

Both Intel and AMD use the upper bits of the physical address to mark encrypted pages, he said. AMD processors currently only support a single encryption key, and so use a single bit to indicate that encryption is in use. Intel, instead, encrypts all of memory and uses up to six upper bits to indicate which encryption key is used for each page. The default key is marked by all zeroes in that field.

One interesting challenge is that the CPU's memory caches are based on the physical address — including the encryption bit(s). Encryption is handled by the memory controller, and the same page with two different keys will look like different addresses to the CPU. Unless due care is taken, the same page can thus appear multiple times in the cache under different encryption keys. Writing multiple competing cache lines to the same page will likely corrupt the data there, an outcome that is likely to increase the overall user disgruntlement level. Avoiding it requires carefully flushing the caches whenever the encryption status of a page changes to ensure that no entries remain under the old key.

Doing that flush turns out to be expensive. In an effort to minimize that cost, Shutemov is looking at adding encryption-key awareness to the page allocator. The key that was last used with each page would be remembered; if the page is allocated to a new use with the same key, there will be no need to flush any old cache entries. This can be implemented by putting a wrapper around the current page allocator. It is worth the effort, he said; otherwise allocation of encrypted pages can be up to three times as expensive. Since the intent is that all of memory will be encrypted, this extra cost could hurt overall performance significantly.

One question that arises, though, is: where should the key ID be stored? It needs to be associated with the page structure somehow, and it must be kept around after the page has been freed. Ross Zwisler suggested that perhaps pages could be kept in separate pools, one for each key ID. Shutemov agreed that could be done, but it would involve more significant surgery to the page allocator. There was a period of somewhat rambling exploration of ideas for solutions with no real conclusion reached.

Hugh Dickins asked how key IDs interact with the buddy allocator, which will want to join pages with different IDs into larger groupings. The buddy allocator ignores the IDs, Shutemov said. Cache flushing is done with a single-page granularity, though, so things work as expected.

For the time being, the key ID is being stored in the anon_vma structure; that means that memory encryption only works for anonymous pages. It also is not yet tracking the key ID after pages are freed. Dave Hansen said that the search for a permanent home for the key ID is likely to lead to a challenge all memory-management developers have faced at one time or another: poring over struct page in search of a few available bits that can be used for this purpose. For now, though, Shutemov is looking at storing it in the page_ext structure that is used for information that doesn't fit in struct page.

Michal Hocko worried that adding complexity to the page allocator may be a mistake, especially if memory encryption works better in future CPUs and this level of tracking is no longer needed. He also worried that encryption will add a degree of nondeterminism to the page allocator; the time required to satisfy a request will vary considerably depending on the previous encryption status of the allocated pages. The networking developers, who have been working to reduce allocation times, will complain. It would be better, he said, to ensure that the cost of using encrypted memory falls on those who choose to use it. That suggests just flushing caches when the memory is freed.

Shutemov raised another issue: the direct mapping (the portion of kernel space that, on 64-bit systems, maps directly to all of physical memory) uses the default key. But the kernel will often find itself having to access pages that are encrypted with other keys. One way to handle that would be to bring back something like the kmap() interface to create a temporary mapping to a specific page using the appropriate key. That would be somewhat less efficient than going through the direct mapping, though, and will hurt performance.

The alternative is to partition the direct mapping, creating one section for each possible key ID (of which there are up to 64 in current hardware). The promise of this approach is better, he said, but it's not working yet. The potential problem is that replicating the direct mapping in this way will use a lot of virtual address space, reducing the amount of physical memory that the machine can handle in the process. But, by grabbing a set of physical-address bits for the key ID, memory encryption already reduces the possible amount of physical memory anyway. The other possible disadvantage is that kernel address-space layout randomization would have fewer bits to play with.

The proper API for key management still needs to worked out. With 64 keys available, they can't just be handed out to any process that wants one — at least, not without complicating context switches in unpleasant ways. The user-space API is likely to be based on the existing kernel key-management facilities. A new mprotect() call would be used to apply a key to a range of memory; doing so will destroy the current contents of the indicated range. It would also be nice to have a control-group API at some point, he said.

The final challenge discussed was DMA, which also has to work with memory encryption. On systems with an I/O memory-management unit, encryption should just work. For other systems, it depends on whether the DMA mask for any specific device can handle the full range of physical addresses; encryption will just work if that is the case. Otherwise, bounce buffers will be needed so that the kernel can handle the encryption; that is easy to implement but slow to run.

Comments (11 posted)

Removing the kthread freezer?

By Jake Edge
April 25, 2018

LSFMM

Using the kernel thread (kthread) freezer has been a longtime problem for a variety of reasons. It is meant as a way to suspend kthreads on the way toward system suspend, but in practice has proved problematic to the point that it came up at both the 2015 and 2016 Kernel Summits (as well as on the mailing lists over the years); the intent is to try to remove the kthread freezer entirely. To that end, Luis Rodriguez led a discussion in the filesystem track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit on the problems and possible solutions.

Rodriguez has picked up the work that Jiri Kosina was doing to eliminate the kthread freezer, but is moving more cautiously than Kosina originally planned. One problem is that the kernel does not want to freeze kthreads in unexpected places, so there is a mechanism that allows the threads to block the freezing process. Part of the thinking there is that there should not be DMA in flight while the suspend is going on, Kent Overstreet said. He asked, wouldn't it be better if the drivers put themselves in a sane state for suspend?

Dave Chinner said that even if the devices are ready to suspend, the filesystems can still be making in-memory changes. A recurring problem is that suspend would sync a filesystem to make it stable, but the filesystem would still have threads and work on workqueues that were operating on the in-memory data. That led to an inconsistent state between what was on disk and what was in the memory image used by the suspend.

In general, Rodriguez said, the kernel should not be freezing kthreads. The threads want full control of where they can be frozen; it is hard to get it all right if it is imposed on them. But trying to address this problem in a generic form is "really hard"; phasing out the kthread freezer will be difficult, so he suggested a divide-and-conquer approach.

For filesystems that implement the freeze_fs() method, it should be straightforward, but there is still a problem in getting the order right. The current mechanism freezes the most recently mounted filesystems first and thaws them in the order in which they were mounted. That is simple to do using iterate_supers(), but does it work in all cases?

Al Viro said that it does not. There is a "nasty ioctl()", which he is sorry for implementing, that can break the ordering. It is quite possible that a filesystem that was mounted later shows up earlier in the list. The ordering described is also not sufficient for FUSE filesystems, Jan Kara said, though Chinner suggested those simply be skipped in the walk.

But there are filesystems that talk to several devices, such as those hosted on a RAID device or with their journal on a separate device, Viro said. These topologies can also change at run time, so he does not recommend relying on any kind of ordering.

In fact, a directed acyclic graph (DAG) could describe these relationships, Kara said. It would have nodes for filesystems and devices, with edges that describe the dependencies between them. It would be nice to build that DAG in the kernel, but it is not done today. Viro agreed that it is probably needed at some point.

Rodriguez wondered whether the DAG generation was required before making any progress on eliminating the kthread freezer. As long as the existence of the problem is kept in mind, Viro said, work can proceed. If these problem configurations can be detected, suspend could be prohibited for those systems, Rodriguez said. But that will be difficult to detect without the graph, Kara said.

There are a number of problem areas that came up in the discussion: freezing races with automounting, the control group (cgroup) freezer is "completely broken", freezing FUSE filesystems is problematic, and so on. It was noted that applications would like to know if the filesystem they are using is about to freeze so they can quiesce their own data to keep it consistent. Rodriguez was surprised to find out that there is no generic framework for the kernel to notify user space about an upcoming suspend: "That's insane!"

No real conclusions came out of the discussion. Rodriguez plans to post his notes to the mailing list for feedback. There was also talk about discussing it more later in the summit, though that has not been scheduled as of this writing.

Comments (7 posted)

Fixing error reporting—again

By Jake Edge
April 25, 2018

LSFMM

After a session at last year's Linux Storage, Filesystem, and Memory Management Summit (LSFMM), Jeff Layton was able to make some improvements to block-layer error handling. Those changes, which added a new errseq_t type to hold an error number and sequence number, seemed to help and were well received—except by the PostgreSQL developers. So Layton led a session at the 2018 LSFMM to discuss ways to improve things further; it would be followed later in the week with a session by one of the PostgreSQL developers to look at the specifics of the problem from their perspective.

Layton started by noting that Matthew Wilcox had come up with a patch to restore the behavior that PostgreSQL relies on, which is to receive I/O errors from periodic fsync() calls made by a checkpointing process. However, that change still could not guarantee that errors would always be reported; if the inode is evicted from memory, the error stored there will be lost. The lack of a guarantee made Layton leery of the patch, but Jan Kara noted that there never was a guarantee, even under the old behavior prior to errseq_t.

A writeback error may not actually be reported by fsync(), however; other calls, such as close(), could return it. There was some discussion about whether getting an error return from a close() call actually makes sense. Dave Chinner said that a write failure that is not noticed until well after the write has completed (during writeback, for example) could be reported when the file is closed.

But Layton mentioned Neil Brown's assertion that returning errors on close is not reasonable. For one thing, it is quite common for applications to ignore the return code from close(). But it is documented that the close() call can return errors, so some users will be dependent on that behavior, Chinner said.

Even with Wilcox's patch, there is still the problem of inodes that get evicted from memory. The bug has always been there but, since error reporting is being scrutinized and fixed, it is worth eliminating that problem. There was some talk of making the error persistent on disk, but Eric Sandeen suggested simply refusing to evict inodes that have errors associated with them. That should work well, and remove that lingering problem for PostgreSQL (and others) unless there is major memory pressure, Layton said. But Chinner does not think hanging on to a few extra inodes is going to affect the out-of-memory handling in any significant way.

Layton mentioned that there is another problem: syncfs() is "really broken" in its error reporting. He plans to fix that, probably by using another errseq_t in the superblock, since reporting from syncfs() requires a separate cursor on the error state.

As the session was wrapping up, Chinner asked about tests for the changes. Layton said that he had some to add to xfstests, but there is a need to create some specifically for the PostgreSQL fsync() problem. That will allow some assurance that the problem has really been solved and that it doesn't regress down the road.

Comments (11 posted)

Security quotes of the week

To address this type of attacker, we present Oblivious DNS (ODNS), which is a new design of the DNS ecosystem that allows current DNS servers to remain unchanged and increases privacy for data in motion and at rest. In the ODNS system, both the client is modified with a local resolver, and there is a new authoritative name server for .odns. To prevent an eavesdropper from learning information, the DNS query must be encrypted; the client generates a request for www.foo.com, generates a session key k, encrypts the requested domain, and appends the TLD domain .odns, resulting in {www.foo.com}k.odns. The client forwards this, with the session key encrypted under the .odns authoritative server’s public key ({k}PK) in the “Additional Information” record of the DNS query to the recursive resolver, which then forwards it to the authoritative name server for .odns. The authoritative server decrypts the session key with his private key, and then subsequently decrypts the requested domain with the session key. The authoritative server then forwards the DNS request to the appropriate name server, acting as a recursive resolver. While the name servers see incoming DNS requests, they do not know which clients they are coming from; additionally, an eavesdropper cannot connect a client with her corresponding DNS queries.

— Oblivious DNS project at Princeton

Up till now, writers of crypto and security software not only have to fight the bad guys. We also have to deal with compiler writers, who every so often dream up some new optimisation routine which spots the padding instructions that we put in to make our crypto algorithms run in constant time, or the tricks that we use to ensure that sensitive data will be zeroised when a function returns. All of a sudden some critical code is optimised away, your code is insecure, and you scramble to figure out how to outwit the compiler once more.

So while you’re fighting the enemy in front, the compiler writer is a subversive fifth column in your rear.

— Ross Anderson

Last year, the Defcon hackers' conference sponsored [PDF] a Voting Village. Organizers collected 25 pieces of voting equipment, including voting machines and electronic poll books. By the end of the weekend, conference attendees had found ways to compromise every piece of test equipment: to load malicious software, compromise vote tallies and audit logs, or cause equipment to fail.

It's important to understand that these were not well-funded nation-state attackers. These were not even academics who had been studying the problem for weeks. These were bored hackers, with no experience with voting machines, playing around between parties one weekend.

— Bruce Schneier

Comments (18 posted)

Kernel release status

The current development kernel is 4.17-rc2, released on April 22. Quoth Linus: "We've still got some known fallout from the merge window, but it shouldn't affect most normal configurations, so go out and test."

Stable updates: 4.16.3, 4.15.18, and 4.14.35 were released on April 19; 4.9.95 followed one day later. Then 4.16.4, 4.14.36, 4.9.96, 4.4.129, and 3.18.106 were released on April 24.

Note that 4.15.18 is the end of the line for the 4.15.x series.

The 4.16.5 and 4.14.37 updates are in the review process; they are due on April 27.

Comments (none posted)

CLOCK_MONOTONIC change reverted

The 4.17 merge window included a change in the behavior of the CLOCK_MONOTONIC system clock; in particular, it would advance after a resume to reflect the time that the system was suspended. At the time, the developers involved acknowledged that the change might have to be reverted if it caused regressions. It seems that some such regressions have indeed been reported, so a revert has been queued with this comment:

As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed.

It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change.

Comments (6 posted)

Vetter: Linux Kernel Maintainer Statistics

Daniel Vetter looks at some kernel-development statistics, with a focus on patches written by the maintainers who commit them. "Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that. This is a huge contrast to the 'everything is getting better, bigger, and the kernel community is very healthy' fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community."

Comments (10 posted)

Quote of the week

As the Linux networking maintainer, I feel the need to reiterate my statement of a month ago, and to provide some supporting facts on my end so that there is no confusion in the community on these issues.

I am no longer associated with either the Netdev Society or their event, the NetDev conference.

Neither is endorsed nor supported by me.

— David Miller

Comments (21 posted)

Distribution quote of the week

Debian itself will always face challenges but I sincerely believe that the Project remains as healthy as ever. We are uniquely cherished and remain remarkably poised to improve the free software ecosystem as a whole. Moreover, our stellar reputation for technical excellence, stability and software freedom remains highly respected and without peer. It is truly an achievement to be proud of.

— Chris Lamb

Comments (none posted)

FFmpeg 4.0 released

Version 4.0 of the FFmpeg multimedia toolkit is out. There is a long list of new filters, formats, and more; see the announcement for details.

Comments (1 posted)

Development quotes of the week

Who cares about performance of a terminal emulator, if it can't draw emojis? I mean, please, let's focus on the important things, and I take emojis a thousand times over frickin' speed, thank you very much...

😬 😬 😬

— Lennart Poettering

BTW, building a 400 line library and writing reams of control code for a fridge that has not been installed yet is what we Haskell programmers call "laziness".

— Joey Hess

Of course, I didn't only remove stuff; I also added things like more efficient sorting of strings, added a new microbenchmark framework, sped up the new Unicode 9.0 collations by 20x, and probably added some legacy on my own. :-) But sometimes, it's good to remove. Remove some code today!

— Steinar H. Gunderson (Thanks to Paul Wise)

Comments (none posted)

DistroWatch Weekly (April 23)

This Week in Lubuntu Development (April 23)

Lunar Linux Weekly News (April 20)

Mageia Weekly Roundup (April 23)

openSUSE Tumbleweed Review of the Week (April 21)

Reproducible Builds Weekly Report (April 24)

Ubuntu Weekly Newsletter (April 21)

Emacs News (April 23)

GCC 8.0.1 Status Report (April 25)

GCC 9.0.0 Status Report (April 25)

What's cooking in git.git (April 25)

LLVM Weekly (April 23)

LXC/LXD/LXCFS Weekly Status (April 23)

OCaml Weekly News (April 24)

OpenStack Technical Committee Report (April 24)

OpenStack Technical Committee Status Update (April 20)

Perl Weekly (April 23)

Weekly changes in and around Perl 6 (April 23)

PostgreSQL Weekly News (April 22)

Python Weekly Newsletter (April 19)

Ruby Weekly News (April 19)

This Week in Rust (April 24)

Wikimedia Tech News (April 23)

Fedora Council minutes (April 25)

Fedora FESCO meeting minutes (April 20)

CFP Deadlines: April 26, 2018 to June 25, 2018

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

Deadline	Event Dates	Event	Location
April 29	August 29 August 31	Open Source Summit	Vancouver, Canada
April 29	July 6 July 11	GUADEC	Almería, Spain
April 30	August 7 August 11	JuliaCon 2018	London, UK
May 4	August 4 August 5	DevConf India	Bangalore, India
May 15	October 29 October 30	OpenWrt Summit	Lisbon, Portugal
May 16	October 18 October 19	Osmocom Conference 2018	Berlin, Germany
May 20	August 11 August 12	GNOME.Asia Summit	Taipei, Taiwan
May 20	July 11 July 13	Netdev 0x12	Montréal, Québec, Canada
May 20	July 23 July 29	EuroPython 2018	Edinburgh, Scotland, UK
May 23	August 25 August 26	Free and Open Source Software Conference	Sankt Augustin, Germany
May 24	October 29 October 31	LISA18	Nashville, TN, USA
May 30	July 18 July 19	RISC-V Workshop in Chennai	Chennai, India
May 31	August 28 September 1	EuroSciPy 2018	Trento, Italy
June 1	April 1 April 4	‹Programming› 2019	Genova, Italy
June 1	November 12	RabbitMQ Summit 2018	London, UK
June 3	August 27 August 28	Linux Security Summit	Vancouver, Canada
June 3	October 24 October 26	PyCon.DE 2018	Karlsruhe, Germany
June 7	August 25 August 26	Linux Developers Conference Brazil	Campinas, Brazil
June 9	June 11 June 12	Czech and Slovak Network Operators Group	Brno, Czech Republic
June 14	September 26	Open Source Backup Conference 2018	Cologne , Germany
June 14	October 24 October 26	KVM Forum 2018	Edinburgh, UK
June 15	August 11 August 12	openSUSE.Asia Summit	Taipei, Taiwan
June 17	July 29 August 5	DebConf18	Hsinchu, Taiwan
June 17	September 20 September 23	EuroBSDcon 2018	Bucharest, Romania
June 23	June 23 June 24	Linuxwochen Linz	Linz, Austria

If the CFP deadline for your event does not appear here, please tell us about it.

Netdev 0x12 Conference update

Netdev will take place July 11-13 in Montréal, Québec, Canada. Two talks have been announced; Alexander Krizhanosky on Kernel TLS handshakes for HTTPS DDoS mitigation, and Vikram Siwach on Subscriber Billing System Over Commodity. The conference location will be at the Hyatt Regency Montreal. The call for papers remains open until May 1.

Full Story (comments: none)

Events: April 26, 2018 to June 25, 2018

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
April 26 April 27	FLOSSUK – Spring Conference	Edinburgh, Scotland
April 27 April 28	Grazer Linuxtage 2018	Graz, Austria
April 28 April 29	LinuxFest Northwest	Bellingham, WA, USA
May 2 May 4	KubeCon+CloudNativeCon	Copenhagen, Denmark
May 3 May 5	Linuxwochen Wien 2018	Vienna, Austria
May 4 May 5	PyDays Vienna	Vienna, Austria
May 7 May 10	8th RISC-V Workshop	Barcelona, Spain
May 7 May 13	Wireless-Meshup (Wireless Battle Mesh)	Berlin, Germany
May 9 May 13	PyCon	Cleveland, OH, USA
May 16 May 20	Mini-DebConf	Hamburg, Germany
May 17 May 18	Devopsdays Silicon Valley	Mountain View, CA, USA
May 18 May 19	French Perl Workshop 2018	Paris, France
May 21 May 22	OpenMP UK Users' Conference	Oxford, UK
May 21 May 24	OpenStack Summit	Vancouver, Canada
May 22 May 23	Infoshare 2018	Gdańsk, Poland
May 23 May 27	DjangoCon Europe 2018	Heidelberg, Germany
May 25 May 27	openSUSE Conference 2018	Prague, Czech Republic
May 29 June 1	PGCon 2018	Ottawa, Canada
May 31 June 1	Netconf	Boston, MA, USA
June 1 June 2	PyCon Taiwan	Taipei, Taiwan
June 4 June 6	PyCon Israel 2018	Israel
June 5 June 7	SambaXP 2018 Conference	Göttingen, Germany
June 7 June 10	Linux Audio Conference	Berlin, Germany
June 8 June 9	Texas Linux Fest	Austin, TX, USA
June 8 June 10	South East Linux Fest	Charlotte, NC, USA
June 11 June 12	Czech and Slovak Network Operators Group	Brno, Czech Republic
June 11 June 14	O'Reilly Velocity Conf	San Jose, CA, USA
June 12 June 14	O'Reilly Fluent Conference	San Jose, CA, USA
June 12 June 13	Open Source Data Center Conference	Berlin, Germany
June 13 June 14	FOSS Backstage	Berlin, Germany
June 16 June 17	Hong Kong Open Source Conference 2018	Hong Kong, Hong Kong
June 20 June 22	Open Source Summit Japan	Tokyo, Japan
June 20 June 22	Automotive Linux Summit	Tokyo, Japan
June 22 June 24	XenProject Developer and Design Summit	Nanjing, China
June 23 June 24	Linuxwochen Linz	Linz, Austria
June 23 June 24	Prague Maker Faire	Prague, Czech Republic

If your event does not appear here, please tell us about it.

Summary of the Linux IPsec workshop 2018

A summary of the Linux IPsec workshop that was held March 26-28 in Dresden, Germany is available. "The page was created from the etherpad we used during the workshop, so don't expect anything fancy. It still does not cover all session notes, but it gets updated whenever new information comes in."

Full Story (comments: none)

Alert summary April 19, 2018 to April 25, 2018

Dist.	ID	Release	Package	Date
Arch Linux	ASA-201804-8		roundcubemail	2018-04-23
Arch Linux	ASA-201804-9		xfig	2018-04-23
Arch Linux	ASA-201804-7		zsh	2018-04-23
Debian	DLA-1357-1	LTS	gunicorn	2018-04-22
Debian	DLA-1356-1	LTS	libreoffice	2018-04-19
Debian	DSA-4178-1	stable	libreoffice	2018-04-20
Debian	DSA-4177-1	stable	libsdl2-image	2018-04-20
Debian	DSA-4179-1	stable	linux-tools	2018-04-24
Debian	DLA-1360-1	LTS	lucene-solr	2018-04-24
Debian	DLA-1355-1	LTS	mysql-5.5	2018-04-19
Debian	DSA-4176-1	stable	mysql-5.5	2018-04-20
Debian	DLA-1354-1	LTS	opencv	2018-04-18
Debian	DLA-1361-1	LTS	psensor	2018-04-24
Debian	DLA-1359-1	LTS	ruby1.8	2018-04-23
Debian	DLA-1358-1	LTS	ruby1.9.1	2018-04-23
Debian	DLA-1353-1	LTS	wireshark	2018-04-18
Fedora	FEDORA-2018-d87e29047d	F26	corosync	2018-04-19
Fedora	FEDORA-2018-b0253649be	F27	corosync	2018-04-19
Fedora	FEDORA-2018-9aa8064e12	F26	java-1.8.0-openjdk	2018-04-22
Fedora	FEDORA-2018-77533e644b	F27	java-1.8.0-openjdk	2018-04-23
Fedora	FEDORA-2018-b3f8bee2e0	F26	jgraphx	2018-04-21
Fedora	FEDORA-2018-b268b5bbb5	F27	jgraphx	2018-04-21
Fedora	FEDORA-2018-122ea355a7	F26	memcached	2018-04-22
Fedora	FEDORA-2018-4e26c06aef	F27	mingw-libid3tag	2018-04-23
Fedora	FEDORA-2018-cec96a9c41	F27	nghttp2	2018-04-21
Fedora	FEDORA-2018-ce5d7106d8	F26	pcs	2018-04-19
Fedora	FEDORA-2018-1c8b49fbc7	F27	perl	2018-04-21
Fedora	FEDORA-2018-1c8b49fbc7	F27	perl-Module-CoreList	2018-04-21
Fedora	FEDORA-2018-f6dc921a19	F26	roundcubemail	2018-04-21
Fedora	FEDORA-2018-57fbdb1cb5	F27	roundcubemail	2018-04-21
Gentoo	201804-22		chromium	2018-04-23
Gentoo	201804-16		clamav	2018-04-22
Gentoo	201804-21		librelp	2018-04-22
Gentoo	201804-19		mbedtls	2018-04-22
Gentoo	201804-17		quagga	2018-04-22
Gentoo	201804-18		tenshi	2018-04-22
Gentoo	201804-20		unadf	2018-04-22
Mageia	MGASA-2018-0210	6	freeplane	2018-04-22
Mageia	MGASA-2018-0209	6	libcdio	2018-04-22
Mageia	MGASA-2018-0208	5, 6	libtiff	2018-04-20
Mageia	MGASA-2018-0207	5	thunderbird	2018-04-20
Mageia	MGASA-2018-0206	6	zsh	2018-04-20
openSUSE	openSUSE-SU-2018:1049-1	42.3	PackageKit	2018-04-24
openSUSE	openSUSE-SU-2018:1038-1	42.3	cfitsio	2018-04-21
openSUSE	openSUSE-SU-2018:1042-1	42.3	chromium	2018-04-21
openSUSE	openSUSE-SU-2018:1051-1		hdf5	2018-04-24
openSUSE	openSUSE-SU-2018:1056-1	42.3	hdf5	2018-04-24
openSUSE	openSUSE-SU-2018:1041-1		mbedtls	2018-04-21
openSUSE	openSUSE-SU-2018:1039-1	42.3	mbedtls	2018-04-21
openSUSE	openSUSE-SU-2018:1040-1	42.3	nextcloud	2018-04-21
openSUSE	openSUSE-SU-2018:1062-1	42.3	ocaml	2018-04-24
openSUSE	openSUSE-SU-2018:1059-1		phpMyAdmin	2018-04-24
openSUSE	openSUSE-SU-2018:1058-1	42.3	phpMyAdmin	2018-04-24
openSUSE	openSUSE-SU-2018:1053-1	42.3	salt	2018-04-24
openSUSE	openSUSE-SU-2018:1057-1	42.3	virtualbox	2018-04-24
Oracle	ELSA-2018-1224	OL7	PackageKit	2018-04-24
Oracle	ELSA-2018-1099	OL7	firefox	2018-04-18
Oracle	ELSA-2018-1188	OL6	java-1.8.0-openjdk	2018-04-19
Oracle	ELSA-2018-1191	OL7	java-1.8.0-openjdk	2018-04-19
Oracle	ELSA-2018-4071	OL6	kernel	2018-04-18
Oracle	ELSA-2018-4071	OL7	kernel	2018-04-18
Oracle	ELSA-2018-1225	OL6	librelp	2018-04-24
Oracle	ELSA-2018-1223	OL7	librelp	2018-04-24
Oracle	ELSA-2018-1055	OL7	libvncserver	2018-04-18
Oracle	ELSA-2018-1058	OL7	libvorbis	2018-04-18
Oracle	ELSA-2018-1199	OL6	patch	2018-04-23
Oracle	ELSA-2018-1200	OL7	patch	2018-04-23
Red Hat	RHSA-2018:1224-01	EL7	PackageKit	2018-04-24
Red Hat	RHSA-2018:1195-01	EL6	chromium-browser	2018-04-23
Red Hat	RHSA-2018:1203-01	EL6	java-1.6.0-sun	2018-04-23
Red Hat	RHSA-2018:1205-01	EL7	java-1.6.0-sun	2018-04-23
Red Hat	RHSA-2018:1206-01	EL6	java-1.7.0-oracle	2018-04-23
Red Hat	RHSA-2018:1201-01	EL7	java-1.7.0-oracle	2018-04-23
Red Hat	RHSA-2018:1188-01	EL6	java-1.8.0-openjdk	2018-04-19
Red Hat	RHSA-2018:1191-01	EL7	java-1.8.0-openjdk	2018-04-19
Red Hat	RHSA-2018:1202-01	EL6	java-1.8.0-oracle	2018-04-23
Red Hat	RHSA-2018:1204-01	EL7	java-1.8.0-oracle	2018-04-23
Red Hat	RHSA-2018:1196-01	EL5	kernel	2018-04-23
Red Hat	RHSA-2018:1216-01	EL7.2	kernel	2018-04-24
Red Hat	RHSA-2018:1225-01	EL6	librelp	2018-04-24
Red Hat	RHSA-2018:1223-01	EL7	librelp	2018-04-24
Red Hat	RHSA-2018:1199-01	EL6	patch	2018-04-23
Red Hat	RHSA-2018:1200-01	EL7	patch	2018-04-23
Red Hat	RHSA-2018:1213-02	EL7	python-paramiko	2018-04-24
Red Hat	RHSA-2018:1192-01	SCL	rh-perl524-perl	2018-04-23
Scientific Linux	SLSA-2018:1188-1	SL6	java-1.8.0-openjdk	2018-04-19
Scientific Linux	SLSA-2018:1225-1	SL6	librelp	2018-04-24
Scientific Linux	SLSA-2018:1199-1	SL6	patch	2018-04-23
Slackware	SSA:2018-108-01		gd	2018-04-18
SUSE	SUSE-SU-2018:1047-1	SLE12	PackageKit	2018-04-23
SUSE	SUSE-SU-2018:0986-1	SLE12	kernel	2018-04-19
SUSE	SUSE-SU-2018:1048-1	SLE12	kernel	2018-04-23
Ubuntu	USN-3627-1	14.04 16.04 17.10	apache2	2018-04-19
Ubuntu	USN-3631-1	16.04	linux, linux-aws, linux-kvm, linux-raspi2, linux-snapdragon	2018-04-23
Ubuntu	USN-3630-1	17.10	linux, linux-raspi2	2018-04-23
Ubuntu	USN-3632-1	16.04	linux-azure	2018-04-23
Ubuntu	USN-3633-1	16.04	linux-euclid	2018-04-23
Ubuntu	USN-3630-2	16.04	linux-hwe, linux-gcp, linux-oem	2018-04-23
Ubuntu	USN-3631-2	14.04	linux-lts-xenial, linux-aws	2018-04-23
Ubuntu	USN-3629-1	14.04 16.04 17.10	mysql-5.5, mysql-5.7	2018-04-23
Ubuntu	USN-3629-2	12.04	mysql-5.5	2018-04-25
Ubuntu	USN-3628-2	12.04	openssl	2018-04-19
Ubuntu	USN-3628-1	14.04 16.04 17.10	openssl	2018-04-19
Ubuntu	USN-3634-1	17.10	packagekit	2018-04-24

Full Story (comments: none)

Linus Torvalds Linux 4.17-rc2 Apr 22

Greg KH Linux 4.16.4 Apr 24

Greg KH Linux 4.16.3 Apr 19

Greg KH Linux 4.15.18 Apr 19

Greg KH Linux 4.14.36 Apr 24

Greg KH Linux 4.14.35 Apr 19

Greg KH Linux 4.9.96 Apr 24

Greg KH Linux 4.9.95 Apr 20

Greg KH Linux 4.4.129 Apr 24

Greg KH Linux 3.18.106 Apr 24

Tom Zanussi 3.18.104-rt101 Apr 23

AKASHI Takahiro arm64: kexec: add kexec_file_load() support Apr 25

Jagan Teki arm64: allwinner: Add A64 DE2 pipeline support Apr 24

Alan Kao perf: riscv: Preliminary Perf Event Support on RISC-V Apr 20

Vikas Shivappa Memory bandwidth allocation software controller(mba_sc) Apr 20

Joerg Roedel PTI support for x86-32 Apr 23

Reinette Chatre Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Apr 25

Arnd Bergmann y2038: Convert IPC syscalls Apr 19

Paul E. McKenney [PATCH tip/core/rcu 0/3] Parallelize expedited grace periods for v4.18 Apr 22

Frederic Weisbecker softirq: Consolidate and optimize softirq mask v3 Apr 24

Stanislav Kinsburskii Introduce Xen fault injection facility Apr 20

Andrey Konovalov khwasan: kernel hardware assisted address sanitizer Apr 20

Dan Carpenter Smatch check for Spectre stuff Apr 19

Amelie Delaunay Introduce STM32MP1 RTC Apr 19

Pierre-Yves MORDRET Add DMAv2 and DMAMUX Support on STM32MP157C Apr 20

Pierre-Yves MORDRET Add I2C support for STM32H743 Apr 20

Pierre-Yves MORDRET Add MDMA Support on STM32MP157C Apr 20

Yixun Lan clk: meson-axg: Add AO Cloclk and Reset driver Apr 19

Jerome Brunet clk: meson: axg: add audio clock controller support Apr 25

Måns Andersson net: phy: TLK10X initial driver submission Apr 19

Rui Miguel Silva media: Introduce Omnivision OV2680 driver Apr 19

Rui Miguel Silva media: staging/imx7: add i.MX7 media driver Apr 19

Paul Kocialkowski Sunxi-Cedrus driver for the Allwinner Video Engine, using media requests Apr 19

Eric Anholt Broadcom V3D driver (7268/7278) Apr 19

Tom Hromatka sparc64: Add privileged ADI driver Apr 20

Javier Arteaga UP Squared board drivers Apr 21

Enric Balletbo i Serra DRM Rockchip rk3399 (Kevin) Apr 23

Rui Miguel Silva media: staging/imx7: add i.MX7 media driver Apr 23

Icenowy Zheng SY8106 regulator support and enable it on Orange Pi PC Apr 23

Peter Rosin iio: add unit-converter Apr 23

Rishabh Bhatnagar SDM845 System Cache Driver Apr 23

Eric Anholt V3D DRM driver Apr 23

Chunfeng Yun Add MediaTek XS-PHY driver Apr 25

Ganapatrao Kulkarni Add ThunderX2 SoC Performance Monitoring Unit driver Apr 25

Taniya Das clk: qcom: clk-rpmh: Add QCOM RPMh clock driver Apr 24

sean.wang@mediatek.com Add support to MT6323 RTC and its power device Apr 25

Todor Tomov Add support for ov7251 camera sensor Apr 24

Johan Hovold gnss: add new GNSS subsystem Apr 24

Jacopo Mondi drm: bridge: Add support for static image formats Apr 19

Marc Zyngier Level-triggered MSI support Apr 23

Maciej Purski Add coupled regulators mechanism Apr 23

Logan Gunthorpe Copy Offload in NVMe Fabrics with P2P PCI Memory Apr 23

Marc Zyngier Partitioned percpu IRQ rework Apr 24

Bartosz Golaszewski platform: device tree support for early platform drivers Apr 24

Logan Gunthorpe Copy Offload in NVMe Fabrics with P2P PCI Memory Apr 23

David Howells VFS: Introduce filesystem context [ver #7] Apr 19

Allison Henderson XFS Parent Pointers v5 Apr 18

David Sterba Btrfs: Add ioctl to clear unused space Apr 20

Christoph Hellwig simplify procfs code for seq_file instances V2 Apr 25

Eric Dumazet mm,tcp: provide mmap_hook to solve lockdep issue Apr 20

Roman Gushchin mm: introduce memory.min Apr 20

Björn Töpel Introducing AF_XDP support Apr 23

Eric Dumazet tcp: mmap: rework zerocopy receive Apr 24

Toke Høiland-Jørgensen Add Common Applications Kept Enhanced (cake) qdisc Apr 24

Igor Stoppa mm: security: write protection for dynamic data Apr 23

David Herrmann Introduce LSM-hook for socketpair(2) Apr 23

Sargun Dhillon Safe LSM (un)loading, and immutable hooks Apr 25

Tiwei Bie virtio: support packed ring Apr 25

LWN.net Weekly Edition for April 26, 2018

Copyright trolling

Arguments

Recommendations

The chain gang — tables within tables

Performing the rehash

Walking the hash table while the table is being rehashed

Failure during insertion

The full scoop on configuration parameters

A word of warning

Plenary sessions

Memory-management track sessions

Filesystem track

Filesystem and Storage track sessions

Group photo

Acknowledgments

Brief items

Security

Kernel development

Distributions

Development

Announcements

Newsletters

Distributions and system administration

Development

Meeting minutes

Calls for Presentations

CFP Deadlines: April 26, 2018 to June 25, 2018

Upcoming Events

Events: April 26, 2018 to June 25, 2018

Event Reports

Security updates

Kernel patches of interest

Kernel releases

Architecture-specific

Core kernel

Development tools

Device drivers

Device-driver infrastructure

Filesystems and block layer

Memory management

Networking

Security-related

Virtualization and containers