LWN.net Weekly Edition for April 14, 2016
What's new in MythTV 0.28
The MythTV project released its latest stable version, 0.28, on April 11. While there are a few entirely new features worthy of users' attention, most of the changes are incremental improvements. But the improved components include services like Universal Plug and Play (UPnP) support, where MythTV has lagged behind other open-source media centers like Kodi, and the external API, which will hopefully make MythTV more developer-friendly. MythTV remains the most widely used open-source digital video recorder (DVR) but, as cord-cutting trends increase, it will need to offer more functionality to continue attracting users.
It has been just over two and a half years since the last major MythTV release, 0.27. From a functional standpoint, the most significant change to MythTV's core DVR and video-playback features is that the new release has migrated to FFmpeg 3.0 (released in February 2016), thus enabling hardware-accelerated video playback of Google's VP9 codec and providing a much better AAC encoder. VP9 is still not terribly widespread, but the new MythTV release also enables hardware acceleration for H.265 which, like VP9, targets ultra-high-definition video (e.g., 4K resolution).
UPnP and media browsing
UPnP is a specification meant to allow "smart" media-playback devices to discover compatible servers on the local network and automatically find the servers' audio, video, and photo collections. The MythTV back-end server has included UPnP support since version 0.20, but that support has never been particularly stunning. The 0.28 brings it up to speed, providing full compatibility with the current (2014) version of the standard and adding quite a few improvements to the experience of browsing the available media from a UPnP device.
For instance, UPnP supports an "album art" metadata feature; MythTV will now pick up plausibly named album-art images in music folders (e.g., album.png), and it will generate thumbnail images of videos and recorded TV programs (in multiple sizes, which is beneficial for those who may use both a smartphone and a smart TV at varying times). It also allows users to search through a media collection by length, by user-applied rating, and by several general metadata fields (e.g., director, studio, year, season, maturity rating, and so on). When you put those features together, it makes for a considerably more pleasant experience than the old UPnP support offered, which tended to present the user with a flat list of filenames, each accompanied by the same generic icon (designating "music" or "video," for example). Finally, rewind and fast-forward were broken for a number of UPnP client devices in earlier releases; these should all function correctly now.
While more and more devices ship with UPnP support these days (I personally know there are at least four devices in my home that can act as a UPnP client; all of them were purchased for their other functionality), MythTV also acts as a media-center front-end in its own right, providing the "couch ready" playback interface. Among other features, the MythTV front-end provides the same functionality as a UPnP client: browsing video, audio, and image collections. The image-browsing feature, MythGallery, was rewritten for the 0.28 release, updating the user interface to conform to the latest menu and theme updates and allowing multiple front-end devices to access the same image collection simultaneously.
The audio player component, MythMusic, received updates as well, including a lyrics-display option, a significantly refreshed collection of streaming audio services, some new visualizations, better support for metadata fields in FLAC files, and initial support for retrieving track metadata through the MusicBrainz service. Both MythMusic and MythGallery have also been updated to support MythTV's Storage Groups feature. Storage Groups allow MythTV back-ends to transparently use a variety of underlying file servers and disks in a single pool, and allow the user to share the pool between several back-ends.
Linux users accustomed to Logical Volume Management and network-attached storage may not find the feature particularly novel, since that type of functionality is provided at the operating-system level, but it can be useful at the application level, too. One could set up a separate Storage Group for recordings that resides on a machine attached to a TV, while keeping music in another Storage Group closer to a different part of the house. Or one might prefer to keep different types of media in different filesystems or configured for different back-up options. In prior releases, such choices were limited by the fact that MythMusic and MythGallery did not use the Storage Groups feature at all.
Alternative interfaces
The new release includes some less visible work that is likely to be good for MythTV in the long run. The primary example is the continued development of the MythTV Services API, an API framework introduced in version 0.25 that allows external applications to access and even configure a MythTV installation. The Services API is currently used by just a handful of applications (such as remote-control apps for Android phones), but it is a big improvement over the old, XML-based API.
Version 0.28 of the Services API introduces an entirely new Image service (for working with still-image collections, like MythGallery) and includes updates to several others. The DVR service received the most attention; it is now possible to manage almost every facet of a set of recording rules through the API. That means that external applications can provide the proper hooks to set up and manage scheduled recordings, not just display the schedule and play already-recorded items. The Guide service, which hooks into the electronic program guide, was also enhanced; applications can now filter and group channels (which is particularly useful for showing guide data on small screens). And the Frontend service, which controls playback, gained one important new feature: a generic SendKey command, which enables developers to fully customize the playback commands sent. Since MythTV key bindings are configurable, providing only a fixed set of key commands in the Frontend service was a serious limitation.
The last new feature worth pointing out is a still-in-development experiment: a complete rewrite of the MythTV web interface. The new interface is called WebFrontend, and it will eventually replace the existing interface, MythWeb. WebFrontend runs from a built-in web server, while MythWeb required configuring a separate Apache server. Although the design is still in flux, it aims to simplify the setup and configuration tasks presented in the interface, and to provide a better online-video-playback service.
For those who have used MythWeb, this is good news. Although it is functional enough to get by, the old interface could hardly be described as smooth. Furthermore, it had only rudimentary configuration and monitoring options—most implemented by providing direct access to the MythTV back-end's database tables. If something went wrong with a recording, one might be just as likely to make it worse as to fix it by poking around the database in MythWeb. Hopefully, the project will also take this opportunity to make WebFrontend more secure than MythWeb; strong authentication and TLS (neither of which were implemented for MythWeb) would be a welcome start.
Back in March, I expressed some criticism of MythTV for its complexity and awkward management features. It is hard to say at the outset whether or not version 0.28 improves that situation any. MythWeb is a legacy feature that is ripe for removal (as is the old XML-based API now supplanted by the Services API). At the same time, large features like MythMusic and MythGallery seem to still be undergoing periodic rewrites that do not make them any simpler. But perhaps improved UPnP support offers some hope. After all, if a MythTV back-end is perfectly usable through some other, UPnP-based client application, then there is less for the user to worry about. On the whole, though, each new MythTV release still makes progress. It might be slower than some users would like, but it is moving in the right direction.
This is why we can't have safe cancellation points
Signals have been described as an "unfixable design" aspect of Unix. A recent discussion on the linux-kernel mailing list served to highlight some of the difficulties yet again. There were two sides to the discussion, one that focused on solving a problem by working with the known challenges and the existing semantics, and one that sought to fix the purportedly unfixable.
The context for this debate is the pthread_cancel(3) interface in the Pthreads POSIX threading API. Canceling a thread is conceptually similar to killing a process, though with significantly different implications for resource management. When a process is killed, the resources it holds, like open file descriptors, file locks, or memory allocations, will automatically be released.
In contrast, when a single thread in a multi-threaded process is terminated, the resources it was using cannot automatically be cleaned up since other threads might be using them. If a multi-threaded process needs to be able to terminate individual threads — if for example it turns out that the work they are doing is no longer needed — it must keep track of which resources have been allocated and where they are used. These resources can then be cleaned up, if a thread is canceled, by a cleanup handler registered with pthread_cleanup_push(3). For this to be achievable, there must be provision for a thread to record the allocation and deallocation of resources atomically with respect to the actual allocation or deallocation. To support this Pthreads introduces the concept of "cancellation points".
These cancellation points are optional and can be disabled with a call to pthread_setcanceltype(3). If the cancel type is set to PTHREAD_CANCEL_ASYNCHRONOUS then a cancellation can happen at any time. This is useful if the thread is not performing any resource allocation or not even making any system calls at all. In this article, though, we'll be talking about the case where cancellation points are enabled.
On cancellation points and their implementation
From the perspective of an application, a "cancellation point" is any one of a number of POSIX function calls such as open(), and read(), and many others. If a cancellation request arrives at a time when none of these functions is running, it must take effect when the next cancellation-point function is called. Rather than performing the normal function of the call, it must call all cleanup handlers and cause the thread to exit.If the cancellation occurs while one of these function calls is waiting for an event, the function must stop waiting. If it can still complete successfully, such as a read() call for which some data has been received but a larger amount was requested, then it may complete and the cancellation will be delayed until the next cancellation point. If the call cannot complete successfully, the cancellation must happen within that call. The thread must clean up and exit and the interrupted function will not return.
From the perspective of a library implementing the POSIX Pthreads API, such as the musl C library (which was the focus of the discussions), the main area of interest is the handling of system calls that can block waiting for an event, and how this interacts with resource allocation. Assuming that pthread_cancel() is implemented by sending a signal, and there aren't really any alternatives, the exact timing of the arrival of the cancellation signal can be significant.
- If the signal arrives after the function has checked for any pending
cancellation, but before actually making a system call that might
block, then it is critical that the system call is not made at all.
The signal handler must not simply return but must arrange to
perform the required cleanup and exit, possibly using a mechanism
like longjmp().
- If the signal arrives during or immediately after a system call that performs some sort of resource allocation or de-allocation, then the signal handler must behave differently. It must let the normal flow of code continue so that the results can be recorded to guide future cleanup. That code should notice if the system call was aborted by a cancellation signal and start cancellation processing. The signal handler cannot safely do that directly; it must simply set a flag for other code to deal with.
There are quite a number of system calls that can both wait for an event and allocate resources; accept() is a good example as it waits for an incoming network connection and then allocates and returns a file descriptor describing that connection. For this class of system calls, both requirements must be met: a signal arriving immediately before the system call must be handled differently than a signal arriving during or immediately after the system call.
There are precisely three Linux system calls for which the distinction
between "before" and "after" is straightforward to manage:
pselect(), ppoll(), and epoll_pwait(). Each of
these takes a
sigset_t argument that lists some signals that are normally
blocked before the system call is entered. These system calls will
unblock the listed signals, perform the required action, then block them
again before returning to the calling thread. This behavior
allows a caller to block the cancellation signal, check if a signal has
already arrived, and then proceed to make the system call
without any risk of the signal being delivered just before the system
call actually starts. Rich Felker, the primary author of musl, did
lament
that if all system calls took a sigset_t and used it
this way, then implementing cancellation points correctly would be
trivial. Of course, as he acknowledged, "this is obviously not a
practical change to make.
"
Without this ability to unblock signals as part of every system call, many implementations of Pthread cancellation are racy. The ewontfix.com web site goes into quite some detail on this race and its history and reports that the approach taken in glibc is:
ENABLE_ASYNC_CANCEL(); ret = DO_SYSCALL(...); RESTORE_OLD_ASYNC_CANCEL(); return ret;
where ENABLE_ASYNC_CANCEL() directs the signal handler to terminate the thread immediately and RESTORE_OLD_ASYNC_CANCEL() directs it to restore the behavior appropriate for the pthread_setcanceltype() setting.
If the signal is delivered before or during the system call this works correctly. If, however, the signal is delivered after the system call completes but before RESTORE_OLD_ASYNC_CANCEL() is called, then any resource allocation or deallocation performed by the system call will go unrecorded. The ewontfix.com site provides a simple test case that reportedly can demonstrate this race.
A clever hack
The last piece of background before we can understand the debate about signal handling is that musl has a solution for this difficulty that is "clever" if you ask Andy Lutomirski and "a hack" if you ask Linus Torvalds. The solution is almost trivially obvious once the problem is described as above so it should be no surprise that the description was developed with the solution firmly in mind.
The signal handler's behavior must differ depending on whether the signal arrives just before or just after a system call. The handler can make this determination by looking at the code address (i.e. instruction pointer) that control will return to when the handler completes. The details of getting this address may require poking around on the stack and will differ between different architectures but the information is reliably available.
As Lutomirski explained when starting the thread, musl uses a single code fragment (a thunk) like:
cancellable_syscall: test whether a cancel is queued jnz cancel_me int $0x80 end_cancellable_syscall:
to make cancellable system calls. ("int $0x80" is the traditional way to enter the kernel for a system call by behaving like an interrupt). If the signal handler finds the return address to be at or beyond cancellable_syscall but before end_cancellable_syscall, then it must arrange for termination to happen without ever returning to that code or letting the system call be performed. If it has any other value, then it must record that a cancel has been requested so that the next cancellable system call can detect that and jump to cancel_me.
This "clever hack" works correctly and is race free, but is not perfect. Different architectures have different ways to enter a system call, including sysenter on x86_64 and svc (supervisor call) on ARM. For 32-bit x86 code there are three possibilities depending on the particular hardware: int $0x80 always works but is not always the fastest. The syscall and sysenter instructions may be available and are significantly faster. To achieve best results, the preferred way to make system calls on a 32-bit x86 CPU is to make an indirect call through the kernel_vsyscall() entry point in the "vDSO" virtual system call area. This function will use whichever instruction is best for the current platform. If musl tried to use this for cancellable system calls it would run into difficulties, though, as it has no way to know where the instruction is, or to be certain that any other instructions that run before the system call are all located before that instruction in memory. So musl currently uses int $0x80 on 32-bit x86 systems and suffers the performance cost.
Cancellation for faster system calls
Now, at last, we come to Lutomirski's simple patch that started the thread of discussion. This patch adds a couple of new entry points to the vDSO, the important one for us is pending_syscall_return_address, which determines if the current signal happened during kernel_vsyscall handling and reports the address of the system call instruction. The caller can then determine if the signal happened before, during, or after that system call.
Neither Linus nor Ingo Molnar liked this approach, though their
exact reasons weren't made clear. Part of the reason may have been that
the semantics of cancellation appear clumsy so it is hard to justify much
effort to support them. According to
Molnar, "it's a really bad interface to rely on
".
Even Lutomirski expressed
surprise that musl "didn't take the approach of 'pthread
cancellation is not such a great idea -- let's just not support
it'.
" Szabolcs Nagy's succinct
response "because of standards
" seemed to settle that
issue.
One clear complaint
from Molnar was that there was "so much complexity
" and it is
true that the code would require some deep knowledge to fully understand.
This concern is borne out by the fact that Lutomirski, who has that
knowledge, hastily withdrew
his first and second
attempts. While complexity is best avoided where possible, complexity
should not be, by itself, itself a justification for keeping something out
of Linux.
Torvalds and Molnar contributed both by exploring the issues to flesh out the shared understanding and by proposing extra semantics that could be added to the Linux signal facility so that a more direct approach could be used.
Molnar proposed "sticky signals" that could be enabled with an extra flag when setting up a signal handler. The idea was that if the signal is handled other than while a system call is active, then the signal remains pending but is blocked in a special new way. When the next system call is attempted, it is aborted with EINTR and the signal is only then cleared. This change would remove the requirement that the signal handler must not allow the system call to be entered at all if the signal arrives just before the call, since the system call would now immediately exit.
Torvalds's proposal was similar but involved "synchronous" signals. He saw the root problem being that signals can happen at any time and this is what leads to races. If a signal were marked as "synchronous" then it would only be delivered during a system call. This is exactly the effect achieved with pselect() and friends and so could result in a race-free implementation.
The problem with both of these approaches is that they are not selective in the correct way. POSIX does not declare all system calls to be cancellation points and, in fact, does not refer to system calls at all. It is only certain API functions that are defined as cancellation points and, as Torvalds clearly agreed that being able to use the faster system call entry made available in the vDSO was important, but neither he nor Molnar managed to provide a workable alternative to the solution proposed by Lutomirski.
Felker made his feelings on the progress of the discussion quite clear:
It is certainly important to get the best design, and exploring alternatives to understand why they were rejected is a valid part of the oversight provided by a maintainer. When that leads to the design being improved, we can all rejoice. When it leads to an understanding that the original design, while not as elegant as might be hoped, is the best we can have, it shouldn't prevent that design from being accepted. Once Lutomirski is convinced that he has all the problems resolved, it is to be hoped that a re-submission results in further progress towards efficient race-free cancellation points. Maybe that would even provide the incentive to get race-free cancellation points in other libraries like glibc.
LXD 2.0 is released
LXD is a relatively new entrant in the container-management arena; the project started roughly a year and a half ago. It provides a REST-based interface to Linux containers as implemented by the LXC project. LXD made its 2.0 release on April 11, which is the first production-ready version.
At its heart, LXD is a daemon that provides a REST API to manage LXC containers. It is called a "hypervisor" for containers and seeks to replicate the experience of using virtual machines but without the overhead of hardware virtualization. LXC containers are typically "system containers" that look similar to an OS running on bare metal or a virtual machine, unlike Docker (and other) container systems that focus on "application containers". The intent is to build a more user-friendly approach to containers than what is provided by LXC.
The REST API is the only way to talk to the LXD daemon. Unless it is
configured to listen for remote connections, it simply opens a Unix socket
for local communication. Then the lxc command-line tool can be used
to configure both the daemon and any containers that will be run on the
system. For remote connections, TLS 1.2 with a "very limited
set of allowed ciphers
" is used.
The easiest ways to get started with LXD are all based on Ubuntu systems, which is not surprising given that Canonical is the main sponsor of the project. There are provisions for other distributions (Gentoo, presently) and for building the Go code from source, though. There is also an online demo that can be used try out LXD from a web browser.
The LXD daemon uses a number of kernel technologies to make the containers
it runs more secure. For example, it uses namespaces and, in particular, user namespaces to separate the container
users from those of the system at large. As outlined in lead developer
Stéphane Graber's Introduction
to LXD (which is part of his still-in-progress series
on LXD) one of the core design principles "was to make it as safe as possible while allowing modern Linux distributions to run inside it unmodified
".
Beyond namespaces, AppArmor is used to add restrictions on mounts, files, sockets, and ptrace() access to prevent containers from interfering with each other. Seccomp is used to restrict certain system calls. In addition, Linux capabilities are used to prevent containers from loading kernel modules and other potentially harmful activities.
While control groups (cgroups) are used to prevent denial-of-service attacks from within containers, they can also be used to parcel up the resources (e.g. CPU, memory) of the system among multiple containers. Another entry in Graber's series shows the kinds of limits that can be imposed on containers for disk, CPU, memory, network I/O, and block I/O.
A container in LXD consists of a handful of different pieces. It has a root filesystem, some profiles that contain configuration information (e.g. resource limits), devices (e.g. disks, network interfaces), and some properties (e.g. name, architecture). The root filesystems are all image-based, which is something of a departure from the template-based filesystems that LXC uses. The difference is that instead of building the filesystem from a template when the container is launched (and possibly storing the result), LXD uses a pre-built filesystem image that typically comes from a remote image server (and is then cached locally).
These images are generally similar to fresh distribution images like those used for VMs. LXD is pre-configured with three remote image servers (for Ubuntu stable, Ubuntu daily builds, and a community-run server that has other Linux distributions). The images themselves are identified with an SHA-256 hash, so a specific image or simply the latest Ubuntu stable or daily build can be requested. Users can add their own remote LXD image servers (either public or private) as well.
Profiles provide a way to customize the container configuration and devices. A container can use multiple profiles, which are applied in order, with later profiles potentially overwriting earlier configuration entries. In addition, local container configuration is applied last for configuration entries that only apply to a single container, so they do not belong in the profiles. By default, LXD comes with two profiles, one that simply defines an "eth0" network device and a second that is suitable for running Docker images.
LXD uses Checkpoint/Restore In Userspace (CRIU) to allow snapshotting containers, either to restore them later on the same host or to migrate them elsewhere to be restored. These container snapshots look much the same as regular containers, but they are immutable and contain some extra state information that CRIU needs to restore the running state of the container.
LXD needs its own storage back-end for containers and images. Given the announcement that Canonical will be shipping ZFS with Ubuntu 16.04, it will not come as a surprise that the recommended filesystem for LXD is ZFS. But other options are possible, as described in another post in the series. In particular, Btrfs and the logical volume manager (LVM) can be used.
LXD can scale beyond just a single system running multiple containers; it can also be used to handle multiple systems each running LXD. But for huge deployments, with many systems and thousands of containers, there is an OpenStack plugin (nova-lxd) that provides a way for the OpenStack Nova compute-resource manager to treat LXD containers like VMs. That way, LXD can be integrated into OpenStack deployments.
As befits the "production-ready" nature of the release, LXD 2.0 has a stable API. Until June 2021, all of the existing interfaces will be maintained; any additions will be done using extensions that clients can discover. In addition, there will be frequent bug-fix releases, with backports from the current development tree.
There is a fair amount of competition in the container-management (or orchestration) world these days. Kubernetes, Docker Swarm, Apache Mesos, and others are all solving similar problems. LXD looks like it could have a place in that ecosystem, especially given the strong support it is receiving from Canonical and Ubuntu. For those looking for a container manager, taking a peek at LXD 2.0 may be well worth the time.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: A new stable security tree; New vulnerabilities in kernel, samba, xen, ...
- Kernel: Tracepoints with BPF; Background writeback; Static code checks for the kernel.
- Distributions: OpenBMC, a distribution for baseboard management controllers, CoreOS "Ignition", OpenStack Mitaka, ...
- Development: Python looks at paths; LXD 2.0; WordPress 4.5; Libinput configuration; ...
- Announcements: Let's Encrypt is no longer "beta", FSFE: Joint Statement on the Radio Lockdown Directive, articles by Moglen and Stallman, ...