LWN.net Weekly Edition for July 11, 2019 [LWN.net]

Welcome to the LWN.net Weekly Edition for July 11, 2019

This edition contains the following feature content:

Mucking about with microframeworks: a look at the Bottle and Flask web frameworks.
Soft CPU affinity: a proposal for a more flexible CPU-binding policy.
clone3(), fchmodat4(), and fsinfo(): three new system calls for Linux.
Destaging ION: another piece of Android kernel infrastructure heads toward the mainline.
The third Operating-System-Directed Power-Management summit: a first set of contributed reports from the 2019 OSPM conference, including:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Mucking about with microframeworks

By Jake Edge
July 9, 2019

Python does not lack for web frameworks, from all-encompassing frameworks like Django to "nanoframeworks" such as WebCore. A recent "spare time" project caused me to look into options in the middle of this range of choices, which is where the Python "microframeworks" live. In particular, I tried out the Bottle and Flask microframeworks—and learned a lot in the process.

I have some experience working with Python for the web, starting with the Quixote framework that we use here at LWN. I have also done some playing with Django along the way. Neither of those seemed quite right for this latest toy web application. Plus I had heard some good things about Bottle and Flask at various PyCons over the last few years, so it seemed worth an investigation.

Web applications have lots of different parts: form handling, HTML template processing, session management, database access, authentication, internationalization, and so on. Frameworks provide solutions for some or all of those parts. The nano-to-micro-to-full-blown spectrum is defined (loosely, at least) based on how much of this functionality a given framework provides or has opinions about. Most frameworks at any level will allow plugging in different parts, based on the needs of the application and its developers, but nanoframeworks provide little beyond request and response handling, while full-blown frameworks provide an entire stack by default. That stack handles most or all of what a web application requires.

The list of web frameworks on the Python wiki is rather eye-opening. It gives a good idea of the diversity of frameworks, what they provide, what other packages they connect to or use, as well as some idea of how full-blown (or "full-stack" on the wiki page) they are. It seems clear that there is something for everyone out there—and that's just for Python. Other languages undoubtedly have their own sets of frameworks (e.g. Ruby on Rails).

Drinking the WSGI

Modern Python web applications are typically invoked using the Web Server Gateway Interface (WSGI, pronounced "whiskey"). It came out of an effort to have a common web interface instead of the many choices that faced users in the early days (e.g. Common Gateway Interface (CGI) and friends, mod_python). WSGI was first specified in PEP 333 ("Python Web Server Gateway Interface v1.0") in 2003 and was updated in 2010 to version 1.0.1 in PEP 3333, which added various improvements including better Python 3 support. At this point, it seems safe to say that WSGI has caught on; both Bottle and Flask use it (as does Django and it is supported by Quixote as well).

At its most basic, a WSGI application simply provides a way for the web server to call a function with two parameters every time it gets a request from a client:

    result = application(environ, start_response)

environ is a dictionary containing the CGI-style environment variables (e.g. HTTP_USER_AGENT, REMOTE_ADDR, REQUEST_METHOD) along with some wsgi.* values. The start_response() parameter is a function to be called by the application to pass the HTTP status (e.g. "200 OK", "404 Not Found") and a list of tuples with the HTTP response headers (e.g. "Content-type") and values. The application() function returns an iterable yielding zero or more strings of type bytes, which is generally the HTML response to the client.

The Python standard library has the wsgiref module that provides various utilities and a reference implementation of a WSGI server. In just a few lines of code, with no dependencies other than Python itself, one can run a simple WSGI server locally.

Similarly, both Bottle and Flask have the ability to simply run a development web server locally, which uses the application code in much the same way as it will be used on a "real" server. That feature is not uncommon in the web-framework world and it is quite useful. There are various easy ways to debug the code before deploying it using those local servers. The application can then be deployed, using Apache and mod_wsgi, say, to an internet-facing server.

Bottle

One of the nice things about Bottle is that it lacks any dependencies outside of the Python standard library. It can be installed from the Python Package Index (PyPI) using pip or by way of your Linux distribution's package repositories (e.g. dnf install python3-bottle as I did on Fedora). As might be expected, a simple "hello world" example is just that, simple:

    from bottle import route, run, template

    @route('/hello/<name>')
    def index(name):
	return template('<b>Hello {{name}}</b>!', name=name)

    run(host='localhost', port=8080)

Running that sets up a local server that can be accessed via URLs like "http://localhost:8080/hello/world". The route() decorator will send any requests that look like "/hello/XXX" to the index() function, passing anything after the second slash as name.

Bottle uses the SimpleTemplate engine for template handling. As its name implies, it is rather straightforward to use. Double curly braces ("{{" and "}}") enclose substitutions to be made in the text. Those substitutions can be Python expressions that evaluate to something with a string representation:

    {{ name or "amigo" }}

That will substitute name if it has a value (i.e. not None or "") or "amigo" if not. Those substitutions will be HTML escaped in order to avoid cross-site scripting problems, unless the expression starts with a "!", which disables that transformation. Obviously, that feature should be used with great care.

Beyond that, Python code can be embedded in the templates either as a single line that starts with "%" or in a block surrounded by "<%" and "%>". The template() function can be used to render a template as above, or it can be passed a file name:

    return template('hola_template', sub1='foo', sub2='bar', ...)

That will look for a file called views/hola_template.tpl to render; any substitutions should be passed as keyword arguments. The view() decorator can be used instead to render the indicated template based on the dictionary returned from the function:

    @route('/hola')
    @view('hola_template')
    def hola(name='amigo'):
        ...
	return { name=name, sub1='foo', ... }

There is support for using the HTTP request data via a global request object, which provides access mechanisms for various parts of the request such as the request method, form data, cookies, and so on. Likewise, a global response object is used to handle responses sent to the browser.

That covers the bulk of Bottle in a nutshell. Other functionality is available through the plugin interface. There is a list of plugins available, covering things like authentication, Redis integration, using various database systems, session handling, and so forth.

It was quite easy to get started with Bottle and to get quite a ways down the path of implementing my toy application. As I considered further features and directions for it, though, I started to encounter some of the limitations of Bottle. The form handling was fairly rudimentary, though the WTForms form rendering and validation library is mentioned as an option in the Bottle documentation. Beyond that, the largely blank page for the Bottle plugin for the SQLAlchemy database interface and object-relational mapping (ORM) library did not exactly inspire confidence. The latest bottle-sqlalchemy release was from 2015, which is also potentially worrisome.

Many of the limitations of Bottle are intentional, and perfectly reasonable, but as I cast about a bit more, I encountered Miguel Grinberg's Flask Mega-Tutorial, which caused me to look at Flask more closely. Part of my intention with this "project" was to investigate and learn; Grinberg's excellent tutorial makes using Flask even easier than Bottle (which was not particularly hard). I found no equivalent document for Bottle, which may have made all the difference.

Flask

Once I had poked around in the tutorial and the Flask documentation a bit, I decided to see how hard it would be to take the existing Bottle application and move it to Flask. The answer to that was surprising, at least to me. The alterations required were minimal, really, with some changes needed to the templates (by default Flask looks for .html files in the templates directory), call render_template() rather than using template() or @view(), and a little bit of change to the application set-up boilerplate. A Flask "hello world" might look like the following:

    from flask import Flask, render_template_string
    app = Flask(__name__)

    @app.route('/hello/<name>')
    def hello(name):
	return render_template_string('Hello {{name}}', name=name)

While the Bottle "hello world" program could be run directly from the command line to start its development web server, Flask takes a different approach. If the above code were in a file called hw.py, the following command would start the development server:

    $ FLASK_APP="hw" flask run

Note that on Fedora, the Python 3 version of flask is run as flask-3. A .flaskenv file can be populated with the FLASK_APP environment variable (along with the helpful "FLASK_ENV=development" setting for debugging features) so that it does not need to be specified on every run. In development mode, the server notices changes to the application and reloads it, which is rather helpful.

Flask uses the Jinja2 templating language, which shares many features with Bottle's template system, though with some syntax differences. The biggest difference, at least for the fairly simple templates I have been working with, is that statements are enclosed in "{%" and "%}", rather than Bottle's angle-bracket scheme. In truth, I have yet to run into things I couldn't do with either templating system. There are extensions for both frameworks to switch to a different templating language if that is needed or desired.

One nice feature is that templates can inherit from a base template in Flask. That can also be done in Bottle using @view() but it is less convenient—or so it seemed to me. Flask also has direct support for sessions, so values can be stored and retrieved from the object. Flask serializes the session data into a cookie that gets cryptographically signed. That means the session's contents are visible to users, but cannot be modified; it also means that session data needs to fit inside the 4K limit imposed for cookies by most browsers.

The difference between the core functionality of Flask and Bottle is not huge by any means. Either makes a good basis for a simple web application. The main difference between them seems to be embodied in the momentum of the project and its community. Perhaps Bottle is simply mature and has the majority of the features its users and developers are looking for, much like Quixote:

While Quixote is actively maintained, the frequency of releases is low. Existing Quixote users seem happy with the features Quixote provides, generally it just gets out of your way and lets you write application code.

Bottle has fairly frequent releases, but otherwise seems to be standing still. The Twitter feed, blog, mailing list, and GitHub repository have not been updated much recently, for example. Bottle also lacks the "killer tutorial" that Grinberg has put together for Flask. But part of what makes that Flask tutorial so useful is all of the plugins from the Flask community that Grinberg uses along the way.

In some sense, the tutorial takes Flask from a microframework to a full-stack framework (or a long way down that path anyway). It is an opinionated tutorial that picks and chooses various Flask plugins that help implement each chapter's feature for the "microblog" application that he describes. For example, it uses Flask-WTF to interface to WTForms, Flask-SQLAlchemy for an ORM, Flask-Login for user-session management, Flask-Mail for sending email, and so on.

While I haven't (yet, perhaps) needed some of those features, I did confirm that most or all of the packages are available for Fedora, which is convenient for me. In many ways, Grinberg's tutorial "tied the room together" in terms of seeing a simple Flask application growing into something "real". It showed how to add some functionality I wanted to Flask (form handling in particular) and to see how other possible features could also be added easily down the road.

One could perhaps argue that simply starting with a full-stack framework, rather than adding bits piecemeal to get there, might make more sense—and perhaps it does. But those larger frameworks are rather more daunting to get started with and are, obviously, opinionated as well. If I disagree with Grinberg about the need for a particular piece, I can just leave it out or choose something different; that's more difficult to do with, say, Django.

Lots of liquor references

Apparently working with web applications (and frameworks) leads developers to start thinking about whiskey bottles and flasks, or so it would seem based on some of the names. Web programming is a fiddly business in several dimensions. Web frameworks help with some of the server-side fiddly bits, but there are still plenty more available to be tackled.

HTML and CSS are sometimes irritatingly painful to work with and web frameworks can only provide so much help there. At one level, HTML/CSS is a truly amazing technology that is supported by so many different kinds of programs and devices. On another, though, it is an ugly, hard-to-use format with an inadequate set of user-interface controls and mechanisms so that it often seems much harder than it should be to accomplish what you are trying to do.</rant>

But, of course, web programming is fun and you can easily show your friends what silly thing you have been working on, no matter how far away they live. For that, Pythonistas (and perhaps others) should look at the huge diversity of web frameworks available for the language and, if the mood to create that silly thing strikes, give one of them a try. Bottle or Flask might be a great place to start.

Comments (29 posted)

Soft CPU affinity

By Jonathan Corbet
July 4, 2019

On NUMA systems with a lot of CPUs, it is common to assign parts of the workload to different subsets of the available processors. This partitioning can improve performance while reducing the ability of jobs to interfere with each other. The partitioning mechanisms available on current kernels might just do too good a job in some situations, though, leaving some CPUs idle while others are overutilized. The soft affinity patch set from Subhra Mazumdar is an attempt to improve performance by making that partitioning more porous.

In current kernels, a process can be restricted to a specific set of CPUs with either the sched_setaffinity() system call or the cpuset mechanism. Either way, any process so restricted will only be able to run on the specified CPUs regardless of the state of the system as a whole. Even if the other CPUs in the system are idle, they will be unavailable to any process that has been restricted not to run on them. That is normally the behavior that is wanted; a system administrator who has partitioned a system in this way probably has some other use in mind for those CPUs.

But what if the administrator would rather relax the partitioning in cases where the fenced-off CPUs are idle and going to waste? The only alternative currently is to not partition the system at all and let processes roam across all CPUs. One problem with that approach, beyond losing the isolation between jobs, is that NUMA locality can be lost, resulting in reduced performance even with more CPUs available. In theory the AutoNUMA balancing code in the kernel should address that problem by migrating processes and their memory to the same node, but Mazumdar notes that it doesn't seem to work properly when memory is spread out across the system. Its reaction time is also said to be too slow, and the cost of the page scanning required is high.

So Mazumdar has taken a different approach with a patch set that tries to resolve the issue by creating a concept of "soft affinity". It starts by adding a new system call:

    int sched_setaffinity2(pid_t pid, size_t cpusetsize, cpu_set_t *mask,
			   int flags);

The first three arguments mirror sched_setaffinity(): they identify the process to be modified and provide a mask of CPUs on which that process can run. The flags argument is new, though. If it is set to SCHED_HARD_AFFINITY, then this call will behave just like sched_setaffinity(), absolutely restricting the processes to the CPUs in the given mask. SCHED_SOFT_AFFINITY, instead, sets a new "soft affinity" mask (which must be a subset of the hard mask) and thereby requests the new behavior.

Said behavior is to treat the soft-affinity CPU mask the same as old-style "hard" affinity most of the time: the process will run only on the CPUs listed in that mask. If, however, those CPUs are busy and other CPUs in the process's hard-affinity mask are close to idle, the process will be allowed to run on the idle CPUs as well. That allows the workload to spread out across the system, but only when CPUs are underutilized.

In other words, this patch creates two levels of CPU affinity masks, where in current kernels there is only one. Both masks default to containing all CPUs in the system (as the hard mask does in current kernels). The behavior of the hard mask is unchanged, but the new soft mask can be used to further restrict processes to a smaller group of CPUs; that further restriction can be relaxed by the kernel at times when CPUs found only in the hard mask are idle.

The decision on whether to allow a constrained process to "break out" of its soft-affinity mask is based on two new sysctl knobs, called sched_allowed and sched_preferred. If the ratio of sched_allowed to sched_preferred is greater than the ratio of the CPU utilization of the soft-affinity CPUs to that of another CPU, then that other CPU will be considered for placing a task. The default is to set sched_allowed to 100 and sched_preferred to one, meaning that a CPU outside of the soft-affinity set must be only 1% as loaded as the CPUs inside the set before a soft-affinity process will be moved there. That ratio is a pretty high bar; the target CPU would have to be idle indeed to pass this test. In sites where this mechanism is used, the administrator would probably want to tune those parameters differently.

One question not addressed within the patch set is what happens when a process that has been moved out of the soft-affinity CPUs inevitably raises the utilization of the CPU it is moved to. The soft-affinity decision is made whenever a process wakes up, so the expected behavior would seem to be that the process would run on the outside CPU until it sleeps again. If that sleep is relatively short, it seems likely that the process would be moved again on its next wakeup.

Benchmarks provided with the patch set show performance increases of up to about 7% for some workloads, and regressions in some others. Most of the improvements are relatively small, though, and the data seems noisy. Reviewers might also look closely at Mazumdar's claim that the AutoNUMA balancing does not do a good enough job and, in particular, whether it would be better to improve that code rather than adding a new mechanism and a new set of scheduler tunables.

So it is not clear that the case for this work has yet been made convincingly, something that would need to happen before this work would be considered for merging. Work along these lines seems destined to continue, though. The pressure to get as much work as possible out of every CPU seems unlikely to decrease, and even a relatively small performance improvement is worth a fair amount when it is replicated across a large data center. Soft affinity may or may not be an answer to this problem, but it is indicative of the needs that are driving kernel development for those environments.

Comments (4 posted)

clone3(), fchmodat4(), and fsinfo()

By Jonathan Corbet
July 5, 2019

The kernel development community continues to propose new system calls at a high rate. Three ideas that are currently in circulation on the mailing lists are clone3(), fchmodat4(), and fsinfo(). In some cases, developers are just trying to make more flag bits available, but there is also some significant new functionality being discussed.

`clone3()`

The clone() system call creates a new process or thread; it is the actual machinery behind fork(). Unlike fork(), clone() accepts a flags argument to modify how it operates. Over time, quite a few flags have been added; most of these control what resources and namespaces are to be shared with the new child process. In fact, so many flags have been added that, when CLONE_PIDFD was merged for 5.2, the last available flag bit was taken. That puts an end to the extensibility of clone().

The natural solution is to clone the clone() system call into a new one that would be able to accept more flags. Christian Brauner, perhaps feeling guilty for having snagged the last flag for CLONE_PIDFD, set out to do this work. His first attempt was called clone6() but, after some discussion, it was downgraded to clone3(). (For the curious, there is a clone2() that appears to only be of interest on the ia64 architecture). The prototype for this system call looks something like this:

    struct clone_args {
        u64 flags;
        int *pidfd;
        int *child_tid;
        int *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
    };

    int clone3(struct clone_args *args, size_t size);

The clone_args structure contains much of the information that was previously passed directly to clone() or crammed into the flags field. The new flags is wider (64 bits on all architectures) and regains some space due to the relocation of information like the exit signal number. That should provide enough flags to last, as they say, "for a while".

The size argument is the size of the clone_args structure itself. Should there ever be a need to expand that structure in the future, the kernel will be able to tell whether any given user-space caller is using the new or the old version of the structure by examining size and do the right thing either way. So, with luck, there should be no need to create a clone4() anytime soon.

This interface seems to be satisfactory to everybody involved, though Jann Horn did point out one significant problem: the seccomp mechanism is unable to examine system-call arguments that are passed in separate structures, so it will be unable to make decisions based on the flags given to clone3(). That, he said, means that code meant to be sandboxed with seccomp may not use clone3() at all. Kees Cook has suggested a new mechanism for fetching user-space data for system calls that could be used by seccomp, but nobody appears to be working on that idea currently.

Meanwhile, clone3() is in linux-next, and so can be expected to appear in 5.3.

`fchmodat4()`

A look at the man page for fchmodat() reveals the following prototype:

    int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

The flags argument is documented to have one possible value: AT_SYMLINK_NOFOLLOW, which would cause fchmodat() to operate directly on a symbolic link rather than its target. There's only one little problem: fchmodat() as implemented in the kernel does not actually accept a flags argument. That is why the man page concludes with: "This flag is not currently implemented".

Palmer Dabbelt was motivated to action by a seemingly unpleasant experience: "I spent half of dinner last night being complained to by one of our hardware engineers about Linux's lack of support for the flags argument to fchmodat()". The result was a patch set implementing support for fchmodat4(), which has the same prototype as fchmodat() but which actually implements the flags argument.

This patch set seems uncontroversial, so there should be no real barrier to its merging, though it has not yet found its way into linux-next.

`fsinfo()`

The statfs() system call can be used to get certain types of information about a filesystem, including its format, block size, available free blocks, maximum file-name length, and so on. But it turns out that there is a lot more to know about a filesystem than that, and statfs() is unable to provide that information. It seems like a situation just begging for somebody to come along and implement statfs2(), but instead we get fsinfo() from David Howells.

The prototype for fsinfo() looks like this:

    struct fsinfo_params {
	__u32	at_flags;
	__u32	request;
	__u32	Nth;
	__u32	Mth;
	__u64	__reserved[3];
    };

    int fsinfo(int dfd, const char *filename,
    	       const struct fsinfo_params *params, void *buffer,
	       size_t buf_size);

The dfd and filename arguments identify the filesystem about which information is needed. params is an optional array describing the requested information, while buffer and buf_size define the output buffer.

If params is null, the returned information will be essentially the same as what statfs() would provide. But it is possible to get more, including limits on the filesystem's capabilities, timestamp resolution, mount-time parameters, remote server information, and more. Once this patch set is applied, fsinfo() will also be able to return information about the system's mount topology.

This system call is complex, to say the least; there is not space here to try to describe how it all works. Fortunately, there is some good documentation provided with it. This patch provides a fair amount of information about what fsinfo() can do, liberally intermixed with API information for filesystem developers. But see also this patch for information on how the mount-topology queries work, and this one for the somewhat baroque mechanism used to format parameter values passed back to user space.

While there is clear value in the creation of an interface for extracting arbitrary filesystem-related information from the kernel, the complexity of the fsinfo() patch set has proved daunting to reviewers, who have asked for it to be broken up in the past. Filesystem developers have, in recent years, become more insistent that new features come with additions to the xfstests suite as well; those have not yet been provided in this case. fsinfo() has been circulating for a while — Howells posted a version nearly one year ago — but chances are good that it will need to circulate for a bit longer still before it's ready for the mainline.

Comments (15 posted)

Destaging ION

July 9, 2019

This article was contributed by Marta Rybczyńska

The Android system has shipped a couple of allocators for DMA buffers over the years; first came PMEM, then its replacement ION. The ION allocator has been in use since around 2012, but it remains stuck in the kernel's staging tree. The work to add ION to the mainline started in 2013; at that time, the allocator had multiple issues that made inclusion impossible. Recently, John Stultz posted a patch set introducing DMA-BUF heaps, an evolution of ION, that is designed to do exactly that — get the Android DMA-buffer allocator to the mainline Linux kernel.

Applications interacting with devices often require a memory buffer that is shared with the device driver. Ideally, it would be memory mapped and physically contiguous, allowing direct DMA access and minimal overhead when accessing the data from both sides at the same time. ION's main goal is to support that use case; it implements a unified way of defining and sharing such memory buffers, while taking into account the constraints imposed by the devices and the platform.

ION provides a number of memory pools, called "heaps", that have different properties, like whether they are physically contiguous. These heaps include the "system" heap, containing memory allocated via vmalloc_user(), the "system_contig" heap, which uses kzalloc(), and the "carveout" heap, managing a physically contiguous region set aside at boot. The user-space API allows applications to allocate, free, and share buffers from any of these heaps.

ION was developed, out of tree, in parallel with in-tree kernel APIs like DMA buffer sharing (DMA-BUF) and the contiguous memory allocator (CMA). It naturally duplicates parts of their functionality. In addition, as ION's first platform was Android on 32-bit ARM processors, it used ARM-specific kernel APIs when there were no generic ones available. This obviously did not help the upstreaming process. The new DMA-BUF heaps patch set is a complete rework of the ION internals: it uses CMA to implement a physically contiguous heap from a special memory zone and it does not make use of any architecture-specific functions. A self-test included with the patch set presents the API.

Heaps and allocations

Each heap is represented by a special file in the /dev/dma_heap directory; an application will open a specific heap file to be able to allocate from that heap. The allocations are done using the DMA_HEAP_IOC_ALLOC ioctl() on the resulting file descriptor. This command takes one parameter, a pointer to a dma_heap_allocation_data structure:

    struct dma_heap_allocation_data {
        __u64 len;
        __u32 fd;
        __u32 fd_flags;
        __u64 heap_flags;
        __u32 reserved0;
        __u32 reserved1;
    };

len is the size of the desired allocation in bytes. fd should be set to zero when setting up the structure; it will be filled in by the DMA_HEAP_IOC_ALLOC operation with a file descriptor representing the allocated DMA-BUF. fd_flags describes how the file descriptor will be set up (the valid flags are O_CLOEXEC, O_RDONLY, O_WRONLY, and O_RDWR) and heap_flags stores the flags passed to the heap allocator; it should be set to zero. Finally, there are two reserved fields that should also be set to zero. The ioctl() returns zero when the allocation is successful.

To access the allocated memory, the application needs to call mmap() on the returned buffer file descriptor. When the buffer is no longer needed, the user code should close its file descriptor, which will free the allocated memory.

To summarize, each heap used by an application has an associated file descriptor, as does every allocated buffer. The buffer handles in DMA-BUF heaps are generic DMA-BUF handles that can be passed to drivers that understand such buffers. This API differs from the original ION approach, where there was one centralized special file for the allocator itself, and separate non-standard handles for the buffers. There was also a specific ioctl() to share the memory that does not exist in DMA-BUF heaps.

Memory access synchronization

One of the complex issues when handling a buffer that is shared between devices and CPUs is deciding who can access it at any given time. This is because of the caches: a processor's accesses typically involve the cache, while device accesses may not. Concurrent access may cause a mismatch between the cache and memory, leading to data corruption. To handle this issue, the drivers and applications must declare when they need to access shared memory for reading or writing; this allows the kernel to manage the caches correctly.

The original ION did not handle synchronization; DMA-BUF heaps uses the DMA-BUF API directly for this purpose. Synchronization is controlled by the DMA_BUF_IOCTL_SYNC ioctl(), which takes a structure with flags to describe the synchronization type. Before accessing the shared buffer, user code should use the DMA_BUF_SYNC_START flag with the required access mode (DMA_BUF_SYNC_READ, DMA_BUF_SYNC_WRITE, or DMA_BUF_SYNC_RW, which is a combination of the two). When the access is finished, it should use DMA_BUF_SYNC_END with the same access-mode flags.

Available heaps and adding new ones

The implementation uses a modular approach to the heaps; it defines a general framework that is used by each specific heap implementation. The patch set offers two heap types: the system heap using alloc_page() and the "cma" heap using the CMA allocator (if available in the system).

As in the original ION, it is up to the application developer to choose the right heap, which must correspond to the requirements of all the devices involved. This is a limitation, but the problem is complicated and currently no mainline solution exists. In embedded systems, where DMA heaps will most likely be used, the hardware configuration is fixed, so that the memory constraints of the devices are known in advance.

Kernel developers are given a framework to add more heaps, which currently can only be done at boot time. Each heap needs to fill the operations structure and the export structure. The operations structure, struct dma_heap_ops, is currently simple, containing a single function:

    struct dma_heap_ops {
        int (*allocate)(struct dma_heap *heap,
                        unsigned long len,
                        unsigned long fd_flags,
                        unsigned long heap_flags);
    };

The export structure has the following format:

    struct dma_heap_export_info {
        const char *name;
        struct dma_heap_ops *ops;
        void *priv;
    };

name is the name of the heap, ops is the pointer to the operations structure shown above, and the priv is a place for heap-specific data. The parameters exactly match the structure of the DMA_HEAP_IOC_ALLOC ioctl(); the allocator should return the file descriptor of the allocated DMA-BUF.

After filling out the two structures, the heap implementation needs to add the heap using:

    struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);

Next steps

Since the time ION was first introduced, the Linux kernel has gained generic functionality that can be now used in the new implementation. The DMA-BUF heaps interface is simple, and the patch set intentionally leaves out certain functionalities (including a few heap types) and performance optimizations. Its goal is to define the API; the optimizations and new functionalities are expected to come later. A logical next step is the optimization of allocation performance. Stultz has the patches ready, but decided not to include them in the submission to simplify the review process. Allocation performance is expected to be similar to or even better than the original ION.

The patch set is currently in its sixth version and the simple, step-by-step approach seems to be working; the discussions show no major controversies. We can expect that the new API may appear in the kernel soon.

Comments (5 posted)

The third Operating-System-Directed Power-Management summit

By Jake Edge
July 10, 2019

The third edition of the Operating-System-Directed Power-Management (OSPM) summit was held May 20-22 at the ReTiS Lab of the Scuola Superiore Sant'Anna in Pisa, Italy. The summit is organized to collaborate on ways to reduce the energy consumption of Linux systems, while still meeting performance and other goals. It is attended by scheduler, power-management, and other kernel developers, as well as academics, industry representatives, and others interested in the topics.

As with previous years (2018 and 2017), LWN is happy to be able to bring our readers some extensive writeups of the talks and discussions that went on at OSPM. The OSPM home page has links for slides and video of the talks. These summaries were written by Juri Lelli, Rafael Wysocki, Dario Faggioli, Marco Solieri, Claudio Scordino, Dietmar Eggemann, Valentin Schneider, Morten Rasmussen, Patrick Bellasi, Giovanni Gherdovich, Douglas Raillard, Subhra Mazumdar, Luca Abeni, Parth Shah, Volker Eckert, and Dhaval Giani. They were edited by Jake Edge.

Here is the traditional group photo from this year's event:

All of the available articles have been edited and added in below.

Day one

Rock and a hard place: How hard it is to be a CPU idle-time governor: A talk by Rafael Wysocki about the difficulties of CPU idle-time management.
Virtual-machine scheduling and scheduling in virtual machines: A talk by Dario Faggioli about scheduling for virtual machines versus scheduling inside virtual machines.
I-MECH: realtime virtualization for industrial automation: A talk by Marco Solieri and Claudio Scordino about using virtualization in automation settings.
Reworking CFS load balancing: A talk by Vincent Guittot on what is needed to rework load balancing for asymmetries between groups of CPUs.
CFS wakeup path and Arm big.LITTLE/DynamIQ: A talk by Dietmar Eggemann on some problems with Geekbench on Arm DynamIQ systems.
Scheduler behavioral testing: A talk by Valentin Schneider on the scheduler testing that Arm does.

Day two

The future of SCHED_DEADLINE and SCHED_RT for capacity-constrained and asymmetric-capacity systems: A talk by Morten Rasmussen and Patrick Bellasi on problems for the deadline and realtime scheduling classes with certain kinds of processors.
Frequency scale invariance on x86_64: A talk from Giovanni Gherdovich on handling frequency differences for per-entity load tracking (PELT) on x86_64 systems.
How can we make schedutil even more effective?: A talk from Douglas Raillard on heuristics for skipping inefficient low-frequency operating power points on mobile platforms.
Scheduler soft affinity: A talk from Subhra Mazumdar on a proposed soft CPU affinity feature for the kernel.
SCHED_DEADLINE on heterogeneous multicores: A talk from Luca Abeni on changes to the deadline scheduler to support systems like big.LITTLE.

Day three

TurboSched: A talk by Parth Shah about ways to deal with the "turbo" mode of some newer processors.
New approaches to thermal management: A talk by Volker Eckert about using the CFS bandwidth controller for thermal management.
Proxy execution: A talk by Juri Lelli on the proxy execution feature that is meant to be a better priority-inheritance mechanism.

Comments (1 posted)

Rock and a hard place: How hard it is to be a CPU idle-time governor

July 10, 2019

OSPM

In the opening session of OSPM 2019, Rafael Wysocki from Intel gave a talk about potential problems faced by the designers of CPU idle-time-management governors, which was inspired by his own experience from the timer-events oriented (TEO) governor work done last year.

In the first place, he said, it should be noted that "CPU idleness" is defined at the level of logical CPUs, which may be CPU cores or simultaneous multithreading (SMT) threads, depending on the hardware configuration of the processor. In Linux, a logical CPU is idle when there are no runnable tasks in its queue, so it falls back to executing the idle task associated with it (there is one idle task for each logical CPU in the system, but they all share the same code, which is the idle loop). Therefore "CPU idleness" is an OS (not hardware) concept and if the idle loop is entered by a CPU, there is an opportunity to save some energy with a relatively small impact on performance (or even without any impact on performance at all) — if the hardware supports that.

The idle loop runs on each idle CPU and it only takes this particular CPU into consideration. As a rule, two code modules are invoked in every iteration of it. The first one, referred to as the CPU idle-time-management governor, is responsible for deciding whether or not to stop the scheduler tick and what to tell the hardware to do; the second one, called the CPU idle-time-management driver, passes the governor's decisions down to the hardware, usually in an architecture- or platform-specific way. Then, presumably, the processor enters a special state in which the CPU in question stops fetching instructions (that is, it does literally nothing at all); that may allow the processor's power draw to be reduced and some energy to be saved as a result. If that happens, the processor needs to be woken up from that state by a hardware event after spending some time, referred to as the idle duration, in it. At that point, the governor is called again so it can save the idle-duration value for future use.

Driver and governor

Since the role of the driver is to pass the decisions made by the governor to the hardware, the governor and driver must be able to communicate with each other. For this purpose they use a common data structure which represents a list of low-power (idle) states that the processor can be asked to enter. It is assumed that all idle states supported by the processor can be arranged in a "ladder" such that if one idle state is "located above" another one (in other words, it is shallower than the other one), power drawn by the processor in that idle state will be greater than power drawn in the other (deeper) idle state. In addition, the time needed to exit that idle state (that is, the time needed by the CPU to resume executing instructions) will be shorter than for a deeper idle state.

This representation is based on a model that is mostly applicable to systems with one CPU. In that model, some time is needed to get into an idle state and some time is needed to get out of it. They are referred to as the entry and exit times of the idle state, respectively. During the entry time, power drawn by the processor initially goes up (as the processor prepares itself to enter the target idle state) and then it falls down (as individual functional units of the processor are put into lower-power states) until it stabilizes at the target "idle state" level. Analogously, during the exit time, power drawn by the processor goes up (as its functional units are brought back to their full-power states) and then it falls down to stabilize at a new operational level at which the CPU can execute instructions again. Hence, there is some cost of entering and exiting an idle state in terms of energy, so there is a minimum time the processor must spend in it, in order to save more energy than could be saved by using a shallower idle state, or in order to save any energy at all.

Each idle state in the list used by the CPU idle-time governor and driver is characterized by two parameters derived from the model outlined above: the target residency and the exit latency. The former is the sum of the entry time and the minimum time to spend in the idle state in order to save energy, and the latter is the sum of its entry and exit times. Both of these values are assumed to be worst-case and the list of idle states used by the CPU idle-time governor and driver is sorted by the target residency and exit latency in this order.

Unfortunately, accurate worst-case values are quite difficult to come by. Namely, in order to measure the worst-case value, one has to experience the worst case in the first place, but it is not known whether or not the worst case has been experienced unless the worst-case value is known beforehand. Worst-case values can be estimated theoretically (in which case they are likely to be overestimated) or they can be approximated by the results of a series of measurements (in which case they are likely to be underestimated). In any case, there is some uncertainty associated with them, which potentially is the source of the first problem for a CPU idle-time governor to face: input data that may be inaccurate.

The role of the governor, of course, is to select an idle state for the processor in every iteration of the idle loop (the driver will ask the processor to enter that state subsequently). For best results, that needs to be done accurately every time. If the selected idle state is too shallow, this means that more energy could be saved by selecting a deeper one instead, so energy-efficiency is hurt. In turn, if the selected idle state is too deep, this means that more energy could be saved by selecting a shallower one instead, so energy-efficiency is hurt in this case too; but, in addition, the exit time of the selected state may be excessive which may also hurt performance (through excessive latency). Thus there is a narrow path for the governor to walk, so to speak, kind of between a rock and a hard place. Staying on that path may be difficult to achieve in practice, because the governor's choices need to be based on prediction.

Event timing

Generally speaking, there are two types of events that can wake up the processor from idle states: predictable events (timers) and random events (device interrupts and inter-CPU wakeups). While the first category is straightforward to deal with directly (the kernel sets up the timers, so it knows when they will trigger), the second one can only be taken into account through statistics. That, in turn, requires time intervals between events to be measured by the kernel, which is susceptible to inaccuracies.

The kernel takes time stamps by running some code to read a clock source, so as a rule, time intervals measured by it are not between events themselves but between the times when some kernel code can run after these events. This does not matter (too much) if the CPU taking the time stamp has not been idle during the corresponding event, but if the CPU has been idle and the processor has been woken up from an idle state by that event, the measured time interval will include the exit time of the idle state. If that idle state is known, its exit latency can be used for "sanitizing" the measurement result in principle. But the exit latency is assumed to be the worst-case value and it is reasonable to expect that the worst case will not occur too often. The real idle-state exit time is likely to be much shorter than the worst-case exit latency in the majority of cases, so using it to estimate the possible inaccuracy may be misleading. In any case, this leads to the problem that the measurements of time intervals the governor can use to build its statistics are not entirely reliable.

On top of that, modern processors may be multicore and may support SMT as well. That complicates the picture considerably, as it leads to the presence of hierarchies of idle states and the different cores or SMT threads within one processor generally interact with each other.

Consider, for example, a processor with four cores in one package and two SMT threads in each core. Suppose that each core may be put into an idle state on its own (depending on what the SMT threads in it do). When the last core goes into an idle state at the "core" level, the package as a whole can be put into an idle state, which covers all of the cores and allows extra energy to be saved. Say that there is a list of core-level idle states for each core and there is a list of package-level idle states for the entire package, such that if all cores are in idle state Cx, the package can go into idle state Px (where x is a positive integer) or into a shallower package-level idle state (it cannot go into a deeper package-level idle state though).

Recall now that the list of idle states to be used by the CPU idle-time governor and driver needs to be provided for each logical CPU and logical CPUs are SMT threads in this example. That requires the idle-state hierarchy described above to be represented as a "flattened" sequence of idle states for each CPU and the target residency and exit latency parameters of these idle states need to be derived from the properties of the "real" idle states in the hierarchy.

One way to do that is to make each idle state in the "flattened" list ("derived" idle state in what follows) represent the combination of core-level idle state Cx and package-level idle state Px, so that if both SMT threads in one core ask for idle state x (or deeper), the core will be allowed to enter Cx (or a shallower core-level idle state) and if all SMT threads in the package ask for idle state x (or deeper), the package will be allowed to enter Px (or a shallower package-level idle state). Accordingly, it appears reasonable to use the target residency and exit latency of Px (at the package level) as the target residency and exit latency of the derived idle state x, respectively, because they are the worst-case values of these parameters for it.

However, that is not sufficient in general, because of the possible interactions between the different cores and SMT threads in the processor. The problem is that the CPU idle-time governor running on a particular SMT thread does not know about the other SMT threads, or cores, and so they cannot be taken into account by it. Suppose, for example, that the governor runs on SMT thread 0 in core 0 and it finds that there is a timer event scheduled for that SMT thread and the time until that event is equal to the target residency of (derived) idle state 3 (which is also the target residency of package-level idle state P3). The governor will then select (derived) idle state 3 for SMT thread 0 and that choice will be passed to the processor by the CPU idle-time management driver.

Next, say one-half of the P3 target residency later, the governor runs on SMT thread 1 in the same core (core 0) and it does not see any events to wake up that SMT thread any time soon, so it selects the deepest available (derived) idle state for it (idle state 6, say). That choice may turn out to be suboptimal if all of the other cores in the package are idle states C3 or deeper, because core 0 will now be allowed to enter C3 (as both SMT threads in it asked for derived idle state 3 or deeper) and the package will be allowed to enter P3, but if it does so, the processor will be woken up from that idle state prematurely by the timer event on SMT thread 0 (it will only stay in that state for half of its target residency). In this case the problem is that the governor running on SMT thread 1 does not take the timer event scheduled for SMT thread 0 into account because it does not know about it.

The only way to mitigate that issue in the kernel would be to look at all of the logical CPUs that may interact with the given one and take all of the wakeup events on all of them into account while selecting an idle state for that CPU. But that would not scale beyond a relatively small number of CPUs. For a processor with, say, 48 cores and 96 SMT threads in one package, that is not a practical approach. In practice, for sufficiently complicated processors, this issue needs to be taken care of by the hardware, which needs to figure out when each of the SMT threads or cores in question may be woken up. But if the hardware does that anyway, the choices made by the governor may not matter much after all. It may be sufficient to ask for the deepest (derived) idle state every time (at least for part of the list of idle states) and let the hardware decide what to do.

An experiment

That was illustrated by an experiment carried out by Wysocki on an XPS13 9360 laptop from Dell in which two different lists of (derived) idle states were compared with each other. One of them was a "vanilla" one constructed in accordance with the rules outlined above (the target residency of each derived idle state in it was the target residency of the corresponding package-level idle state) and in the other one the target residency of all idle states deeper than idle state 6 were set to the same value (equal to the idle state 6 target residency). In each of the two configurations, he measured idle power of the processor, ran a few web-browser-based performance benchmarks and measured "active" power (that is, power drawn while the performance benchmarks were running), in several cycles. The result was that the variation between the cycles in each case was greater than the difference between the two cases which was not really significant anyway. The interpretation of that was that whatever happened was dominated by the processor's own prediction of wakeup events that overrode the governor's choices. However, Paul Turner commented that this was probably related to the choice of the governor (TEO) and using a different governor in the experiment might lead to a different outcome.

In summary, Wysocki said, the life of a CPU idle-time-management governor is really hard. It needs to walk a narrow path between "too shallow" and "too deep" idle-state choices, but it needs to rely on input data that may be inaccurate (idle-state parameters). It also needs to use statistics based on measured time intervals that may be "fuzzy". Moreover, regardless of how good it is, its choices may be suboptimal if the processor is complicated enough, because it only knows about the particular logical CPU it runs on at the moment. Finally, the hardware (that arguably needs to track CPU wakeups by itself at least in some cases) may override the choices made by the governor anyway. With that in mind, at least for sufficiently complicated hardware, the CPU idle-time-management governor design principles should be simplicity and minimum overhead, leaving the hardest part to the hardware. Moreover, the decision on whether or not to stop the scheduler tick may be more important than the idle-state selection itself.

Comments (none posted)

Virtual-machine scheduling and scheduling in virtual machines

July 10, 2019

OSPM

As is probably well known, a scheduler is the component of an operating system that decides which CPU the various tasks should run on and for how long they are allowed to do so. This happens when an OS runs on the bare hardware of a physical host and it is also the case when the OS runs inside a virtual machine. The only difference being that, in the latter case, the OS scheduler marshals tasks among virtual CPUs.

And what are virtual CPUs? Well, in most platforms they are also a kind of special task and they want to run on some CPUs ... therefore we need a scheduler for that! This is usually called the "double-scheduling" property of systems employing virtualization because, well, there literally are two schedulers: one — let us call it the host scheduler, or the hypervisor scheduler — that schedules the virtual CPUs on the host physical CPUs; and another one — let us call it the guest scheduler — that schedules the guest OS's tasks on the guest's virtual CPUs.

Now what are these two schedulers? That depends on the virtualization platform. They are always different, in the sense that it will never happen that, at runtime, a scheduler has to deal with scheduling virtual CPUs and also scheduling tasks that want to run on those same virtual CPUs (well, it can happen, but then you are not doing virtualization). They can be the same, in terms of code, or they can be completely different from that respect as well.

Trying to clarify things a little with an example, we have KVM, where Linux runs on the host and acts as the hypervisor, and hence it is the Linux scheduler that schedules the virtual CPUs of the KVM virtual machines onto the host's physical CPUs. At the same time, if Linux runs in the VM as well, it is still the Linux scheduler that schedules the tasks onto the VM's virtual CPUs. The huge benefit of an approach like this is that work (like features, bug fixes, and performance enhancements) done on the Linux scheduler automatically benefit both the hypervisor scheduler and the guest scheduler (it's the same code!). It also has the drawback, though, that if one wants to add a feature that greatly improves the Linux scheduler's behavior as a hypervisor scheduler, but at the same time has a negative impact on other workloads and use cases, that "might be difficult" (which is a euphemism for saying that comments will be of the "not even when hell freezes!" kind).

On the other hand, we have the Xen project approach, where it is Xen that runs on the host, and hence it is Xen's scheduler that is in charge of the "virtual CPUs on physical CPUs" side. Inside the VM, we still have Linux, and it is that scheduler that deals with running tasks on the virtual CPUs like in the KVM case. This means that the Xen project development community can add whatever virtualization-specific tweaks they like to the Xen scheduler, which is good. But the downside now is that there are no features, testing, or profiling shared and coming for free from the much wider Linux world.

Faggioli has been working in virtualization for quite a few years, mostly on Xen (and, in fact, on the Xen scheduler) in the past. Now, at SUSE, he is playing with both Xen and KVM. He, therefore, focused on hypervisor scheduling and wanted to give an overview of the Linux and Xen schedulers. The conference audience, however, seemed very well aware that the Linux scheduler includes multiple scheduling policies (such as OTHER, FIFO, RR, and DEADLINE), that we can set tasks' affinity to CPUs, that we can do "pooling" with control groups (cgroups), that we can control resource allocation, also with cgroups, and that we deal with NUMA systems in multiple ways, for instance with automatic NUMA balancing.

He therefore quickly skipped through all that, and gave the overview of the Xen scheduler only, with the hope of having said something that sounded new to most conference attendees. Xen also implements multiple scheduling algorithms (CREDIT2, CREDIT, RTDS, NULL), although they are not exactly scheduling policies in the Linux sense. Similarly, Xen has pooling capabilities via a feature called cpupools, although it is really partitioning the host, rather than grouping tasks like cgroups do in Linux. Xen also supports affinity of virtual CPUs to physical CPUs and it supports a soft variant of affinities, which Linux does not have. Though there has been a talk at this very conference about a proposal to implement it.

Xen has recently switched to using CREDIT2 as its own default scheduler, which is a more advanced and much more maintainable scheduler than what was being used before (CREDIT). Xen also has a scheduler called NULL, which is a scheduler that does not schedule and, in fact, does — literally — nothing, which Faggioli implemented himself a few years ago and it has become popular. It might mean something if, after years of research and work on OS and hypervisor scheduling, what people like the most from what you have done is the scheduler that does nothing.

Now that we know a little bit more about this problem of double-scheduling, and about the schedulers in use within the major open-source virtualization platforms, we can start thinking about whether or not the hypervisor scheduler and the guest scheduler interact with each other. And, if not, whether or not they should. Having these two schedulers interacting would go under the name of scheduler paravirtualization. This is nothing new, as it has been discussed many times. Faggioli did not talk about it much, mostly because if he ever proposes scheduler paravirtualization for Linux, that will be done via email where (very) angry replies is all he will get back, not when he is in a room with actual Linux scheduler maintainers who can throw frozen sharks at him!

Instead, he talked about system topology, both physical and virtual. How CPUs are arranged and how much they share is something that has an impact on the scheduler behavior — quite an impact, actually. In fact, things like whether or not it is better to wake-up a task (and even a virtual CPU) on one CPU or another, and how frequently and how thoroughly the load among the CPUs should be balanced, all depends on the system topology. And, when it comes to virtual machines, virtual topology. VMs do run on hosts that have physical topology — determined by how the various chips of CPUs and memory are arranged on actual silicon — but they also have a virtual topology that the hypervisor and the virtualization tools can define (almost) at will.

[Update below]

Topology is particularly interesting because, while for physical hosts it is determined by wiring of chips on silicon boards, and hence it cannot change (well, if we decide to ignore CPU and memory hotplug, at least), for virtual machines, it can. Strictly speaking, it is not the virtual topology of the VM that varies. That is defined when the VM is created and never (well, if we decide to ignore virtual CPU and memory hotplug) changes.

The fact is that two virtual CPUs can run on two physical CPUs of the same hardware core, at a given point in time, but unless appropriate measures are taken, the host scheduler can move them and have them running one on a NUMA node and the other one on another, whenever it wants. So this is one of the tricky aspects of the relationship and the interactions between host and guest schedulers, that happens as a consequence of there being a physical and virtual topology.

In fact, if one does not define a virtual topology for their VMs, all of the topology-related optimizations that the guest scheduler contains are useless or, even worse, are pure overhead. That is why it is usually necessary to define a virtual topology for the VM itself in order to achieve close to host performance from inside a VM, it . It is, however, better if we do that with some care, or the guest scheduler will, for example during load-balancing, move a task to a virtual CPU that is supposed to be on the same core as the one where its currently running, but in reality is running two NUMA hops away.

The talk included some examples, with graphs and data coming from experiments done on some large AMD servers. What they show is that, to achieve the best performance inside a VM, we absolutely need the VM to have a virtual topology, and one which is and remains consistent with the topology of the host. Which basically means that to achieve the best performance, we need both the host and the guest scheduler, and we also need them to "play well" with each other.

And here we are again, stating not only that both host and guest scheduler matter, but also that they seem to need to interact somehow in order for the best outcome to be achieved. In fact, it looks like it could be good to know, from inside the guest, whether or not the virtual CPUs are pinned to the physical CPUs at the host level. Once we know that, we can decide whether or not the guest scheduler should either honor or ignore the virtual topology.

Currently, in the Linux scheduler, the link between topology (no matter whether physical or virtual) and the actual scheduler behavior is represented by the scheduling domains, which are constructed according to what the topology is. Scheduling domains also come with some flags, which are the things directly responsible for what the Linux scheduler does when making decisions about tasks running on CPUs that are part of a specific domain. The structure of the scheduling-domain hierarchy and the flags of the domains in the hierarchy have a direct impact on the behavior of the scheduler, and hence on performance.

And that was the bulk of Faggioli's proposal: let's try to figure out what the best scheduling-domain hierarchy and set of scheduling-domain flags is for a virtual machine, depending on physical host topology, virtual machine topology, and also other things (such as whether or not the vCPUs are pinned, etc.). And that is what we will use to "drive" the behavior of the guest scheduler. Basically, instead of running full throttle toward some scheduler paravirtualization, we use something that is already there, and see how far we can get. We "only" need a way to figure out (e.g. at VM boot time), whether or not this special "virtual scheduling-domain hierarchy" shall be built and used. And, on this front, Faggioli was happy to hear that the thoughts of the Linux scheduler people in the audience were that it should be possible.

This is all still in the proposal state, but more data from benchmarks is already being collected, and work toward a prototype will start "soon" (for some definition of soon).

Comments (none posted)

I-MECH: realtime virtualization for industrial automation

July 10, 2019

OSPM

The typical systems used in industrial automation (e.g. for axis control) consist of a "black box" executing a commercial realtime operating system (RTOS) plus a set of control design tools meant to be run on a different desktop machine. This approach, besides imposing expensive royalties on the system integrator, often does not offer the desired degree of flexibility for testing/implementing novel solutions (e.g., running both control code and design tools on the same platform).

In the talk "I-MECH: realtime virtualization for industrial automation", Marco Solieri (from the University of Modena and Reggio Emilia), illustrated the work done in the context of the I-MECH project, aiming at effectively using virtualization techniques in the domain of industrial automation. In particular, an improved version of the Xen hypervisor has been used for running both the realtime control code (on the ERIKA Enterprise open-source RTOS) and the design tools on the same hardware, reducing the overall hardware costs.

Besides illustrating the overall software architecture and describing the various open-source components, the talk addressed the problem of interference on hardware resources shared by multiple OSes (which undermines the performance of the realtime part), showing the techniques implemented on Xen for preventing or reducing these problems.

After the talk, Evidence Srl showed a demo consisting of an EtherCAT master device based on the ERIKA Enterprise RTOS. The code, automatically generated from Simulink running on a different Xen guest, drove a Beckhoff EtherCAT slave connected to a DC motor.

Comments (none posted)

CFS wakeup path and Arm big.LITTLE/DynamIQ

July 10, 2019

OSPM

"One task per CPU" workloads, as emulated by multi-core Geekbench, can suffer on traditional two-cluster big.LITTLE systems due to the fact that tasks finish earlier on the big CPUs. Arm has introduced a more flexible DynamIQ architecture that can combine big and LITTLE CPUs into a single cluster; in this case, early products apply what's known as phantom scheduler domains (PDs). The concept of PDs is needed for DynamIQ so that the task scheduler can use the existing big.LITTLE extensions in the Completely Fair Scheduler (CFS) scheduler class.

Multi-core Geekbench consists of several tests during which N CFS tasks perform an equal amount of work. The synchronization mechanism pthread_barrier_wait() (i.e. a futex) is used to wait for all tasks to finish their work in test T before starting the tasks again for test T+1.

The problem for Geekbench on big.LITTLE is related to the grouping of big and LITTLE CPUs in separate scheduler (or CPU) groups of the so-called die-level scheduler domain. The two groups exists because the big CPUs share a last-level cache (LLC) and so do the LITTLE CPUs. This isn't true any more for DynamIQ, hence the use of the "phantom" notion here.

The tasks of test T finish earlier on big CPUs and go to sleep at the barrier B. Load balancing then makes sure that the tasks on the LITTLE CPUs migrate to the big CPUs where they continue to run the rest of their work in T before they also go to sleep at B. At this moment, all the tasks in the wake queue have a big CPU as their previous CPU (p->prev_cpu). After the last task has entered pthread_barrier_wait() on a big CPU, all tasks on the wake queue are woken up.

They pass through the CFS wakeup path where they are all routed through the fast path (select_idle_sibling()). The CPU search space is the big CPUs (the so called LLC domain). The fast path is chosen since the current CPU and the previous CPU are both big CPUs and neither the wakee_flips mechanism in wake_wide(), nor the CPU capacity check code in wake_cap() can force the tasks into the slow path (i.e. into find_idlest_cpu()) by setting want_affine to 0. The result is that all tasks wake up on a big CPU and it takes the load balancer some time to spread some of them to LITTLE CPUs during test T+1.

This temporal co-scheduling on big CPUs (and at the same time idling of the LITTLE CPUs) could be avoided by propagating the length of the wake queue (related to the futex) with the task to the CFS wakeup path and compare it with the LLC size (i.e. the number of CPUs in that LLC domain). In case the wake queue length is greater than the LLC size, the task would be forced into the slow path by setting want_affine to 0. Although this particular example uses a futex, other mechanisms in the kernel use wake queues as well.

The feedback during the discussion was that the idea might be a bit too pattern-specific and a more generic proposal came up during the discussion. Basically, getting rid of PDs on DynamIQ should solve this problem better. This requires an audit of the existing big.LITTLE-specific wakeup path, specifically of the function wake_cap(). Of course, backward compatibility for traditional two-cluster big.LITTLE systems has to be preserved.

Comments (none posted)

Scheduler behavioral testing

July 10, 2019

OSPM

Validating scheduler behavior is a tricky affair, as multiple subsystems both compete and cooperate with each other to produce the task placement we observe. Valentin Schneider from Arm described the approach taken by his team (the folks behind energy-aware scheduling — EAS) to tackle this problem.

Energy-aware scheduling relies on several building blocks, and any bug or issue present in those could affect the final task placement. In no particular order, they are:

per-entity load tracking (PELT), to have an idea of how much CPU bandwidth tasks actually use
cpufreq (schedutil), to get just the right performance level for a given workload
misfit, to migrate CPU-bound tasks away from LITTLE CPUs.

The LISA test framework has been designed to help validate these. In Arm, it is mainly used in two ways:

Fortnightly mainline integration: This consists of taking the latest tip/sched/core and adding all in-flight (on the list or soon to be) patches from the team. It is then tested on different Arm boards with several hundred test iterations.
Rafael J. Wysocki pointed out that his patches don't land in tip/sched/core, but Arm folks would want to test them nonetheless, since they rely on a properly functioning cpufreq implementation. It was agreed that Wysocki's linux-pm branch should be part of the testing base for future mainline integrations.
Patch validation: Anyone can easily validate their patches (or patches they're reviewing) using LISA and some board on their desk or in the continuous integration (CI) system.

In short, the tests run by LISA consist of synthetic workloads generated via rt-app, which has its execution traced, and the resulting traces are post-processed. The point of using rt-app is to carefully craft synthetic workloads that target specific scheduler behavior and minimize the number of involved functionalities. It shouldn't be too difficult to see why hackbench doesn't fit that bill.

An example of those tests is the EAS behavior test suite. Since LISA uses rt-app to craft its test workloads, it is possible to read the generated rt-app workload profile to estimate the utilization of the generated tasks before even running them. Furthermore, the energy model used by EAS is available to user space (via debugfs), so it can be fetched by the test tool.

With these two pieces of data, we can estimate an energy-optimal task placement and compute an optimal "energy budget". We can then run the workload and record the task placement via the sched_switch and sched_wakeup trace events. With the same energy model, we can estimate how much energy this placement cost, and compare the optimal versus estimated costs.

Energy is not everything, however, as we must make sure we maintain a sufficient level of performance. Rt-app provides some post-execution statistics that we monitor, letting us validate both energy efficiency and performance on a single workload execution.

Load-tracking signal (PELT) tests also rely on trace events. However, there are no PELT events in the upstream kernel and the addition of new trace events is frowned upon by the scheduler maintainers. The concern is that this would create an ABI that could be leveraged by some user-space tool and would then have to be maintained. For now, this testing is done with out-of-tree trace events.

Giovanni Gherdovich interjected that these trace events are very useful for debugging purposes and that he's been backporting them to his kernels for a few years now. Thankfully, Qais Yousef has been working on an upstream-friendly solution involving the separation of the trace points and the definition of their associated trace events, which hasn't met any severe objection. See this thread for more information.

Schneider also pointed out that he used this framework to write a test case for his first-ever mailing list patch review. According to him, it was fairly straightforward to create a synthetic workload and verify that the values obtained from the trace events behaved as described by the patch set, even though the actual implementation might have eluded him. This is a nice ramping up activity that benefits both reviewer and developer.

Now, as mentioned earlier, these tests target specific scheduler bits and thus rely heavily on having little interference from undesired tasks. Buildroot is used to obtain a minimal-yet-functioning system. When the user space cannot be changed (e.g. for testing on Android devices), the freezer control group is used for comparable (though inferior) results.

Still, all of this careful "white-rooming" cannot prevent system tasks from executing, such as sshd, adbd, NFS exchanges, etc. That is why the tests also monitor non-test-related tasks ("noisy" tasks) and assert that their runtime was not too significant. Should that not be the case, the test result will be changed to "undecided". This extra result type (compared to "pass" and "fail") was added to prevent real failures from being ignored: if it is known that a test is easily impacted by background activity and always has ~20% failure rate because of that, these failures would never be investigated (akin to the boy who cried wolf). Properly discarding test results when the initial conditions and expected environment are not met makes it easier to detect bugs and errors.

The list of usual noisy tasks was given for the HiKey 960:

irq/63-tsensor (thermal alarm IRQ)
sshd
rcu_preempt
sugov (scheduler governor kthread)

These tasks mostly run for less than 1% of the entire test duration, which is acceptable. Still, it is interesting to know what else is being run on the system.

Comments (none posted)

The future of SCHED_DEADLINE and SCHED_RT for capacity-constrained and asymmetric-capacity systems

July 10, 2019

OSPM

The kernel's deadline scheduling class (SCHED_DEADLINE) enables realtime scheduling where every task is guaranteed to meet its deadlines. Unfortunately SCHED_DEADLINE's current view on CPU capacity is far too simple. It doesn't take dynamic voltage and frequency scaling (DVFS), simultaneous multithreading (SMT), asymmetric CPU capacity, or any kind of performance capping (e.g. due to thermal constraints) into consideration.

In particular, if we consider running deadline tasks in a system with performance capping, the question is "what level of guarantee should SCHED_DEADLINE provide?". An interesting discussion about the pro and cons of different approaches (weak, hard, or mixed guarantees) developed during this presentation. There were many different views but the discussion didn't really conclude and will have to be continued at the Linux Plumbers Conference later this year.

The topic of guaranteed performance will become more important for mobile systems in the future as performance capping is likely to become more common. Defining hard guarantees is almost impossible on real systems since silicon behavior very much depends on environmental conditions. The main pushback on the existing scheme is that the guaranteed bandwidth budget might be too conservative. Hence SCHED_DEADLINE might not allow enough bandwidth to be reserved for use cases with higher bandwidth requirements that can tolerate bandwidth reservations not being honored.

The discussion further evolved on the concept of a "tunable granted capacity", which is always available under "reasonable working conditions", and a maximum "best-effort capacity". Such a model would allow SCHED_DEADLINE to admit tasks under two different service-level objectives and thus to know which tasks will be more likely to suffer deadline misses in the case of reduced CPU capacity. However, this approach will make the kernel-space admission-control policy more complicated. A different strategy could be to keep kernel-space admission control simple and add more advanced admission-control policies that are managed by a privileged entity in user space.

There are various improvements to make SCHED_DEADLINE more applicable in modern mobile platforms. The "Capacity awareness for SCHED_DEADLINE" RFC patch set by Luca Abeni provides an essential feature for systems with asymmetric CPU capacity. More long-term tasks involve exploring the possibilities to make the admission control aware of guaranteed performance levels from the hardware or firmware, as well as introducing energy-aware task placement into the deadline scheduler class.

Regarding the realtime scheduling class (SCHED_RT), it also has issues when it comes to using it in a mobile system. Running RT tasks at the highest CPU capacity, the way mainline Linux does, is too expensive and not always required. SCHED_RT also assumes symmetric CPU capacities and is unaware of running or runnable CFS tasks.

Like for SCHED_DEADLINE, there are some improvements to make SCHED_RT more applicable in modern mobile platforms. The utilization clamping patch set (which has been merged for 5.3) allows for more fine-grained CPU frequency decisions via per-task, per-task-group, or system-wide performance constraints. Other possible improvements, like making SCHED_RT CPU capacity aware to get more predictable performance on asymmetric-CPU-capacity systems, or making SCHED_RT aware of CFS tasks (and vice versa) were also briefly mentioned.

Comments (none posted)

GnuPG 2.2.17 released

GnuPG 2.2.17 has been released to mitigate attacks on keyservers. In particular, GPG will now ignore all key-signatures received from keyservers by default.

Full Story (comments: 21)

Security quote of the week

tl;dr - a bug bounty should only be one component of your vulnerability reporting process. You need to be prepared for people to decline any restrictions you wish to place on them, and you need to be prepared for them to disclose on the date they initially proposed. If they give you 90 days, that's entirely within industry norms. Remember that a bargain is being struck here - you offering money isn't being generous, it's you attempting to provide an incentive for people to help you improve your security. If you're asking people to give up more than you're offering in return, don't be surprised if they say no.

— Matthew Garrett

Comments (none posted)

Kernel release status

The 5.2 kernel was released on July 7. Linus Torvalds said: "So despite a fairly late core revert, I don't see any real reason for another week of rc, and so we have a v5.2 with the normal release timing."

Some of the more significant changes in 5.2 are a new CLONE_PIDFD flag to clone() to obtain a pidfd for the new process, a significant BPF verifier performance improvement that allows the maximum size of a BPF program to be raised to 1 million instructions, a BPF hook to manage sysctl knobs, a new set of system calls for filesystem mounting, case-insensitive lookups for the ext4 filesystem, a process freezer for version-2 control groups, pressure-stall monitors, and, of course, a vast number of fixes. See the KernelNewbies 5.2 page for a lot more details.

Stable updates: 5.1.17, 4.19.58, 4.14.133, 4.9.185, and 4.4.185 were released on July 10.

Comments (none posted)

Ryabitsev: Patches carved into developer sigchains

Konstantin Ryabitsev has posted a lengthy blog entry describing his vision for moving away from email for kernel development. "I think it's way past due time for us to come up with a solution that would offer decentralized, self-archiving, fully attestable, 'cradle-to-grave' development platform that covers all aspects of project development and not just the code. It must move us away from mailing lists, but avoid introducing single points of trust, authority, and failure."

Comments (67 posted)

Quote of the week

I think that we're parsing the words "stable kernel" differently. You see "stable kernel" as a kernel that remains mostly the same over time and accepts a very small amount of critical fixes.

On the other hand, my expectation of a "stable kernel" is a kernel without known bugs. I associate the word "stable" with stable runtime rather than a stable codebase.

— Sasha Levin

Comments (none posted)

Debian 10 ("Buster") has been released

Debian version 10, code named "Buster", has been released. It has lots of new features, including: "In this release, GNOME defaults to using the Wayland display server instead of Xorg. Wayland has a simpler and more modern design, which has advantages for security. However, the Xorg display server is still installed by default and the default display manager allows users to choose Xorg as the display server for their next session. Thanks to the Reproducible Builds project, over 91% of the source packages included in Debian 10 will build bit-for-bit identical binary packages. This is an important verification feature which protects users against malicious attempts to tamper with compilers and build networks. Future Debian releases will include tools and metadata so that end-users can validate the provenance of packages within the archive. For those in security-sensitive environments AppArmor, a mandatory access control framework for restricting programs' capabilities, is installed and enabled by default. Furthermore, all methods provided by APT (except cdrom, gpgv, and rsh) can optionally make use of seccomp-BPF sandboxing. The https method for APT is included in the apt package and does not need to be installed separately." More information can be found in the release notes.

Comments (80 posted)

Miller: Red Hat, IBM, and Fedora

Fedora project leader Matthew Miller reassures the community that IBM's acquisition of Red Hat, which just closed, will not affect Fedora. "In Fedora, our mission, governance, and objectives remain the same. Red Hat associates will continue to contribute to the upstream in the same ways they have been."

Full Story (comments: 25)

Distribution quote of the week

At times of stress I’m prone to topical nightmares, but they are usually fairly mundane – last night, for example, I dreamed that I’d mixed up bullseye and bookworm in one of the announcements of future code names.

But Saturday night was a whole different game. Imagine taking a rucksack out of the cupboard under the stairs, and thinking it a bit too heavy for an empty bag. You open the top and it’s full of small packages tied up with brown paper and string. As you take each one out and set it aside you realise, with mounting horror, that these are all packages missing from buster and which should have been in the release. But it’s too late to do anything about that now; you know the press release went out already because you signed it off yourself, so you can’t do anything else but get all the packages out of the bag and see how many were forgotten. And you dig, and count, and dig, and it’s like Mary Poppins’ carpet bag, and still they keep on coming…

— Jonathan Wiltshire

Comments (1 posted)

Firefox 68.0 released

Firefox 68.0 has been released, with an Extended Support Release (ESR) version available, in addition to the usual rapid release version. The rapid release version features a dark mode in reader view, improved extension security and discovery, and more. See the release notes for details. The ESR release notes list some additional policies and other improvements.

Comments (3 posted)

Release of the Open Build Service, Version 2.10

The Open Build Service (OBS) project has announced the release of version 2.10 of OBS, which is a system to build and distribute binary packages built from source code. The new version has revamped the web user interface and upgraded the container delivery mechanisms. Beyond that, it has fixed plenty of bugs (of course), added a bunch of smaller features, and now provides integration with other online tools: "Another trend in the professional software world is to plug various tools together into grand continuous integration/deployment cycles (CI/CD). You, of course, also want to throw the OBS into the mix and we traditionally supported you to do that on GitHub with webhooks. The 2.10 release now brings the same kind of support to other tools like Gitlab and Pagure. You can trigger all kinds of actions on OBS for every git commit or other events that happen on those tools."

Comments (5 posted)

Development quote of the week

Even skilled and experienced engineers should be able to say "I don’t understand why I don’t like this"; it’s not an invitation to attack the position of the reviewer but rather an honest quest for knowledge.

— David Lloyd

Comments (none posted)

Software in the Public Interest board elections

Software in the Public Interest (SPI) has announced that nominations are open until July 15 for 3 seats on the SPI board. "The ideal candidate will have an existing involvement in the Free and Open Source community, though this need not be with a project affiliated with SPI."

Comments (none posted)

DistroWatch Weekly (July 8)

Lunar Linux Weekly News (July 5)

openSUSE Tumbleweed Review of the Week (July 5)

Ubuntu Weekly Newsletter (July 6)

Emacs News (July 8)

What's cooking in git.git (July 9)

LLVM Weekly (July 8)

LXC/LXD/LXCFS Weekly Status (July 8)

OCaml Weekly News (July 9)

Perl Weekly (July 8)

Weekly changes in and around Perl 6 (July 8)

PostgreSQL Weekly News (July 7)

Python Weekly Newsletter (July 4)

Ruby Weekly News (July 4)

This Week in Rust (July 9)

Wikimedia Tech News (July 8)

Fedora Council minutes (July 10)

GNOME Foundation Board Minutes (June 17)

GNOME Foundation Board Minutes (June 24)

Python Steering Council Update (July 8)

CFP Deadlines: July 11, 2019 to September 9, 2019

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

Deadline	Event Dates	Event	Location
July 12	November 19 November 21	KubeCon + CloudNativeCon	San Diego, CA, USA
July 12	October 8 October 11	Zeek Week (formerly BroCon)	Seattle, WA, USA
July 15	October 15 October 18	PostgreSQL Conference Europe	Milan, Italy
July 20	September 20 September 22	All Systems Go! 2019	Berlin, Germany
July 26	September 21	Central Pennsylvania Open Source Conference	Lancaster, PA, USA
July 31	September 12 September 15	GNU Tools Cauldron	Montreal, Canada
July 31	October 31 November 1	Linux Security Summit	Lyon, France
July 31	November 4 November 7	Open Source Monitoring Conference	Nuremberg, Germany
August 2	September 9 September 11	Linux Plumbers Conference	Lisbon, Portugal
August 3	September 23 September 24	Lustre Administrator and Developer Workshop 2019	Paris, France
August 11	January 13 January 17	linux.conf.au	Gold Coast, Australia
August 17	November 1 November 2	Ohio LinuxFest 2019	Columbus, OH, USA
August 18	November 19 November 20	DevOpsDays Galway 2019	Galway, Ireland
August 30	October 10 October 11	PyCon ZA 2019	Johannesburg, South Africa
August 31	November 12 November 15	Linux Application Summit	Barcelona, Spain
August 31	November 7	Open Source Camp \| #4 Foreman	Nuremberg, Germany
September 2	October 3	PostgresqlConf South Africa	Cape Town, South Africa

If the CFP deadline for your event does not appear here, please tell us about it.

Microconferences Accepted into 2019 Linux Plumbers Conference

The Android Microconference: "Android has a long history at Linux Plumbers and has continually made progress as a direct result of these meetings. This year's focus will be a fairly ambitious goal to create a Generic Kernel Image (GKI) (or one kernel to rule them all!). Having a GKI will allow silicon vendors to be independent of the Linux kernel running on the device. As such, kernels could be easily upgraded without requiring any rework of the initial hardware porting efforts. This microconference will also address areas that have been discussed in the past."

The RDMA Microconference: "RDMA has been a microconference at Plumbers for the last three years and will be continuing its productive work for a fourth year. The RDMA meetings at the previous Plumbers have been critical in getting improvements to the RDMA subsystem merged into mainline. These include a new user API, container support, testability/syzkaller, system bootup, Soft iWarp, and more. There are still difficult open issues that need to be resolved, and this year’s Plumbers RDMA Microconfernence is sure to come up with answers to these tough problems."

The Scheduler Microconference: "The scheduler determines what runs on the CPU at any given time. The lag of your desktop is affected by the scheduler, for example. There are a few different scheduling classes for a user to choose from, such as the default class (SCHED_OTHER) or a real-time class (SCHED_FIFO, SCHED_RT and SCHED_DEADLINE). The deadline scheduler is the newest and allows the user to control the amount of bandwidth received by a task or group of tasks. With cloud computing becoming popular these days, controlling bandwidth of containers or virtual machines is becoming more important. The Real-Time patch is also destined to become mainline, which will add more strain on the scheduling of tasks to make sure that real-time tasks make their deadlines (although, this Microconference will focus on non real-time aspects of the scheduler. Please defer real-time topics to the Real-time Microconference). This requires verification techniques to ensure the scheduler is properly designed."

The VFIO/IOMMU/PCI Microconference: "The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems. This requires the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for user space access and device pass-through) so that users (and virtual machines) can use them effectively. The kernel interfaces to control PCI devices have to be designed in-sync for all three subsystems, which implies that there are lots of intersections in the design of kernel control paths for VFIO/IOMMU/PCI requiring kernel code design discussions involving the three subsystems at once."

LPC will be held September 9-11 in Lisbon, Portugal.

Comments (none posted)

Events: July 11, 2019 to September 9, 2019

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
July 8 July 14	EuroPython 2019	Basel, Switzerland
July 9 July 11	Xen Project Developer and Design Summit	Chicago, IL, USA
July 10 July 12	2019 USENIX Annual Technical Conference	Renton, WA, USA
July 15 July 18	O'Reilly Open Source Software Conference	Portland, OR, USA
July 17 July 19	Automotive Linux Summit	Tokyo, Japan
July 17 July 19	Open Source Summit	Tokyo, Japan
July 21 July 28	DebConf 2019	Curitiba, Brazil
August 2 August 3	DEVCONF.in	Bengaluru, India
August 2 August 4	Linux Developer Conference Brazil	São Paulo, Brazil
August 6 August 10	PyCon Africa	Accra, Ghana
August 8 August 11	Flock To Fedora	Budapest, Hungary
August 10 August 11	FrOSCon 2019	St. Augustin (near Cologne), Germany
August 17 August 18	Conference for Open Source Coders, Users & Promoters	Taipei, Taiwan
August 17	FOSScon 2019	Philadelphia, PA, USA
August 19 August 21	Linux Security Summit	San Diego, CA, USA
August 20	Tracing Summit 2019	San Diego, CA, USA
August 21 August 23	Open Source Summit North America	San Diego, CA, USA
August 21 August 23	Embedded Linux Conference NA	San Diego, CA, USA
August 21 August 25	Chaos Communication Camp	Zehdenick, Germany
August 22 August 23	Rust Conf	Portland, Oregon, USA
August 23 August 28	GNOME User and Developer Conference	Thessaloniki, Greece
August 26 August 30	FOSS4G 2019	Bucharest, Romania
September 2 September 6	EuroSciPy 2019	Bilbao, Spain
September 3 September 6	Open Source Firmware Conference 2019	Sunnyvale & Menlo Park, CA, USA
September 4 September 6	GNU Hackers Meet	Madrid, Spain
September 6 September 9	PyColorado	Denver, CO, USA
September 7 September 13	Akademy 2019	Milan, Italy

If your event does not appear here, please tell us about it.

Alert summary July 4, 2019 to July 10, 2019

Dist.	ID	Release	Package	Date
Arch Linux	ASA-201907-1		irssi	2019-07-09
Arch Linux	ASA-201907-2		python-django	2019-07-09
Arch Linux	ASA-201907-3		python2-django	2019-07-09
CentOS	CESA-2019:1652	C6	libssh2	2019-07-03
CentOS	CESA-2019:1650	C6	qemu-kvm	2019-07-03
Debian	DLA-1845-1	LTS	dosbox	2019-07-07
Debian	DLA-1844-1	LTS	lemonldap-ng	2019-07-04
Debian	DLA-1848-1	LTS	libspring-security-2.0-java	2019-07-09
Debian	DSA-4476-1	stable	python-django	2019-07-05
Debian	DLA-1850-1	LTS	redis	2019-07-10
Debian	DLA-1847-1	LTS	squid3	2019-07-07
Debian	DLA-1846-1	LTS	unzip	2019-07-07
Debian	DLA-1849-1	LTS	zeromq3	2019-07-08
Debian	DSA-4477-1	stable	zeromq3	2019-07-08
Fedora	FEDORA-2019-18868e1715	F30	expat	2019-07-10
Fedora	FEDORA-2019-6e77507660	F29	filezilla	2019-07-06
Fedora	FEDORA-2019-7b9af09b17	F30	filezilla	2019-07-06
Fedora	FEDORA-2019-6e77507660	F29	libfilezilla	2019-07-06
Fedora	FEDORA-2019-7b9af09b17	F30	libfilezilla	2019-07-06
Fedora	FEDORA-2019-8015e5dc40	F30	samba	2019-07-06
Fedora	FEDORA-2019-d66febb5df	F29	tomcat	2019-07-04
Mageia	MGASA-2019-0205	6, 7	dosbox	2019-07-10
Mageia	MGASA-2019-0206	6, 7	irssi	2019-07-10
Mageia	MGASA-2019-0207	6, 7	microcode	2019-07-10
Mageia	MGASA-2019-0204	7	postgresql11	2019-07-10
openSUSE	openSUSE-SU-2019:1699-1	15.0	gvfs	2019-07-08
openSUSE	openSUSE-SU-2019:1697-1	15.1	gvfs	2019-07-08
Oracle	ELSA-2019-4703	OL6	kernel	2019-07-03
Oracle	ELSA-2019-4703	OL7	kernel	2019-07-03
Oracle	ELSA-2019-4708	OL7	kernel	2019-07-08
Oracle	ELSA-2019-4708	OL7	kernel	2019-07-08
Red Hat	RHSA-2019:1714-01	EL8	bind	2019-07-10
Red Hat	RHSA-2019:1726-01	EL6	dbus	2019-07-10
Red Hat	RHSA-2019:1696-01	EL8	firefox	2019-07-08
Red Hat	RHSA-2019:1722-01	OSP10.0	openstack-ironic-inspector	2019-07-10
Red Hat	RHSA-2019:1734-01	OSP13.0	openstack-ironic-inspector	2019-07-10
Red Hat	RHSA-2019:1742-01	OSP13.0	openstack-tripleo-common	2019-07-10
Red Hat	RHSA-2019:1728-01	OSP13.0	python-novajoin	2019-07-10
Red Hat	RHSA-2019:1700-01	SCL	python27-python	2019-07-08
Red Hat	RHSA-2019:1723-01	OSP10.0	qemu-kvm-rhev	2019-07-10
Red Hat	RHSA-2019:1743-01	OSP13.0	qemu-kvm-rhev	2019-07-10
Red Hat	RHSA-2019:1699-01	EL7	redhat-virtualization-host	2019-07-08
Scientific Linux	SLBA-2019:1651-1	SL6	kernel	2019-07-09
SUSE	SUSE-SU-2019:1773-1	SLE15	ImageMagick	2019-07-08
SUSE	SUSE-SU-2019:0838-2	OS7 SLE12 SES4	bash	2019-07-06
SUSE	SUSE-SU-2019:1733-1	SLE12	elfutils	2019-07-03
SUSE	SUSE-SU-2019:14114-1	SLE11	firefox, mozilla-nss, mozilla-nspr	2019-07-04
SUSE	SUSE-SU-2019:0048-2	SLE15	helm-mirror	2019-07-04
SUSE	SUSE-SU-2019:1744-1	SLE15	kernel	2019-07-04
SUSE	SUSE-SU-2019:1744-1	SLE15	kernel	2019-07-04
SUSE	SUSE-SU-2019:1802-1	SLE12	kernel-firmware	2019-07-10
SUSE	SUSE-SU-2019:1806-1	SLE12	libdlm, libqb	2019-07-10
SUSE	SUSE-SU-2019:1398-2	SLE15	libpng16	2019-07-05
SUSE	SUSE-SU-2019:1791-1	SLE15	libqb	2019-07-09
SUSE	SUSE-SU-2019:1750-1	SLE15	libu2f-host, pam_u2f	2019-07-04
SUSE	SUSE-SU-2019:1749-1	SLE12	libu2f-host	2019-07-04
SUSE	SUSE-SU-2019:1746-1	SLE12	php5	2019-07-04
SUSE	SUSE-SU-2019:1783-1	OS7 OS8 SLE12 SES4 SES5	postgresql10	2019-07-09
SUSE	SUSE-SU-2019:1772-1	OS7 SES4	python-Pillow	2019-07-09
SUSE	SUSE-SU-2019:1785-1	MP3.2 SLE12 SES4 SES5	zeromq	2019-07-09
SUSE	SUSE-SU-2019:14117-1	SLE11	zeromq	2019-07-09
SUSE	SUSE-SU-2019:1776-1	SLE15	zeromq	2019-07-09
Ubuntu	USN-4048-1	16.04 18.04 18.10 19.04	Docker	2019-07-08
Ubuntu	USN-4051-2	14.04	apport	2019-07-09
Ubuntu	USN-4051-1	16.04 18.04 18.10 19.04	apport	2019-07-09
Ubuntu	USN-4038-4	12.04 14.04	bzip2	2019-07-04
Ubuntu	USN-4038-3	16.04 18.04 18.10 19.04	bzip2	2019-07-04
Ubuntu	USN-4049-2	12.04 14.04	glib2.0	2019-07-08
Ubuntu	USN-4049-1	16.04 18.04 18.10	glib2.0	2019-07-08
Ubuntu	USN-4053-1	16.04 18.04 18.10 19.04	gvfs	2019-07-09
Ubuntu	USN-4046-1	16.04 18.04 18.10 19.04	irssi	2019-07-04
Ubuntu	USN-4047-1	16.04 18.04 18.10 19.04	libvirt	2019-07-08
Ubuntu	USN-4052-1	16.04 18.04 18.10 19.04	whoopsie	2019-07-09
Ubuntu	USN-4050-1	16.04 18.04 18.10 19.04	zeromq3	2019-07-08

Full Story (comments: none)

Linus Torvalds Linux 5.2 Jul 07

Alexandre Oliva GNU Linux-libre 5.2-gnu Jul 07

Greg KH Linux 5.1.17 Jul 10

Greg KH Linux 4.19.58 Jul 10

Greg KH Linux 4.14.133 Jul 10

Greg KH Linux 4.9.185 Jul 10

Greg KH Linux 4.4.185 Jul 10

Ben Hutchings Linux 3.16.70 Jul 10

Thomas Gleixner x86/apic: Support for IPI shorthands Jul 04

Thomas Garnier x86: PIE support to extend KASLR randomization Jul 08

Patrick Bellasi Add utilization clamping support (CGroups API) Jul 08

Joel Fernandes (Google) Add support to directly attach BPF program to ftrace Jul 10

Kris Van Hees tools/dtrace: initial implementation of DTrace Jul 03

Brendan Higgins [PATCH v6 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework Jul 03

Manivannan Sadhasivam Add IMX290 CMOS image sensor support Jul 04

Anson.Huang@nxp.com Add support for i.MX8MM thermal sensor driver Jul 04

Martin Blumenstingl Lantiq VRX200/ARX300 PCIe PHY driver Jul 04

Mika Westerberg thunderbolt: Intel Ice Lake support Jul 05

Pawel Laszczak Introduced new Cadence USBSS DRD Driver. Jul 05

Ezequiel Garcia RK3288 VP8 decoding support Jul 05

Manivannan Sadhasivam Add Bitmain BM1880 clock driver Jul 05

Sasha Levin fTPM: firmware TPM running in TEE Jul 05

Dmitry Osipenko memory: tegra: Introduce Tegra30 EMC driver Jul 08

Suman Anna Add TI PRUSS Local Interrupt Controller IRQChip driver Jul 07

dongchun.zhu@mediatek.com media: add support for DW9768 VCM driver Jul 08

frederic.chen@mediatek.com media: platform: Add support for Digital Image Processing (DIP) on mt8183 SoC Jul 08

Nagarjuna Kristam Tegra XUSB gadget driver support Jul 05

Vidya Sagar PCI: tegra: Add Tegra194 PCIe support Jul 10

Gregory CLEMENT Add CPU clock support for Armada 7K/8K Jul 10

Martin Blumenstingl Amlogic 32-bit Meson SDHC MMC controller driver Jul 08

Shannon Nelson Add ionic driver Jul 08

yongqiang.niu@mediatek.com add drm support for MT8183 Jul 09

Pi-Hsun Shih Add support for mt8183 SCP. Jul 09

Jerry-ch Chen media: platform: Add support for Face Detection (FD) on mt8183 SoC Jul 09

Greg Kroah-Hartman Platform drivers, provide a way to add sysfs groups easily Jul 04

Jordan Crouse iommu/arm-smmu: Split pagetable support Jul 08

Gao Xiang erofs: promote erofs from staging Jul 04

Al Viro (hopefully) saner refcounting for mountpoint dentries Jul 06

Aleksa Sarai namei: openat2(2) path resolution restrictions Jul 07

Qu Wenruo btrfs: Enhanced runtime defence against fuzzed images Jul 10

Pavel Tatashin "Hotremove" persistent memory Jul 09

Nitesh Narayan Lal mm: Support for page hinting Jul 10

Pablo Neira Ayuso netfilter: add hardware offload infrastructure Jul 05

Paul Blakey net/sched: Introduce tc connection tracking Jul 07

John Hurley Add MPLS actions to TC Jul 07

Salvatore Mesoraca S.A.R.A. a new stacked LSM Jul 06

Casey Schaufler LSM: Module stacking for AppArmor Jul 03

Pankaj Gupta virtio pmem driver Jul 05

Jan Kiszka Jailhouse 0.11 released Jul 08

Steven Rostedt trace-cmd v2.8.1 Jul 05

Stephen Hemminger iproute2 5.2 Jul 08

LWN.net Weekly Edition for July 11, 2019

Drinking the WSGI

Bottle

Flask

Lots of liquor references

clone3()

fchmodat4()

fsinfo()

Heaps and allocations

Memory access synchronization

Available heaps and adding new ones

Next steps

Day one

Day two

Day three

Driver and governor

Event timing

An experiment

Brief items

Security

Kernel development

Distributions

Development

Miscellaneous

Announcements

Newsletters

Distributions and system administration

Development

Meeting minutes

Calls for Presentations

CFP Deadlines: July 11, 2019 to September 9, 2019

Upcoming Events

Events: July 11, 2019 to September 9, 2019

Security updates

Kernel patches of interest

Kernel releases

Architecture-specific

Core kernel

Development tools

Device drivers

Device-driver infrastructure

Filesystems and block layer

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

`clone3()`

`fchmodat4()`

`fsinfo()`