LWN.net Weekly Edition for September 11, 2015

Automating architecture bootstrapping in Debian

By Nathan Willis
September 10, 2015

Debian supports a lengthy list of hardware architectures—twelve on the official list, plus twelve unofficial ports and a variety of other "port-like" projects such as distributions based on non-Linux kernels. Nevertheless, starting a new architecture-support effort involves a lot of repetitive work that Helmut Grohne (and others) think could be automated. Grohne presented the topic at DebConf 2015 in Heidelberg, discussing the issues involved when bootstrapping a new architecture and what needs to be improved. The good news is that progress is being made and that the work benefits the rest of the project, even those not interested in architecture bootstrapping.

In fact, Grohne started the session by discussing why everyone in Debian should care about automating the architecture-bootstrap process. "Bootstrapping," he said, just means the process of getting the initial, core suite of Debian packages up and running on the new platform. Roughly speaking, that means getting the new architecture to the point where the build-essential metapackage can be used; at that point most other Debian packages can be built on the target system.

The project averages about one new bootstrap per year, he said; ARM64 and PowerPC64-EL are the most recently added architectures, while MIPS64-EL, RISC-V, and OpenRISC are on the horizon. Improving the bootstrapping process will only make Debian a more inviting distribution in areas like embedded development, he said, where Debian may not be the OS of choice. But it also forces the project to re-examine much of its build-from-source tool set, which might otherwise languish, and improving the process could encourage new projects like bootstrapping sub-architectures (for example, creating an x32-optimized port of Debian, or a port that uses the musl C library).

Grohne is the author of rebootstrap, a QA tool for bootstrapping a new architecture. It currently runs on Debian's Jenkins server, testing 20 different architectures about once each week. Each test tries to cross-build about 100 packages, which is only a subset of the packages build-essential pulls in or depends on. Nevertheless, rebootstrap has caught 190 bugs so far (120 of which have been fixed). Grohne plans to expand the package set covered by rebootstrap, but said that one of the lasting benefits of the process is catching and fixing bugs in the core package set.

Cross toolchains and cross-building

He then turned his attention to outlining the steps involved in bootstrapping an architecture, beginning with a description of the cross toolchains used in Debian. Two options are in common usage; both include a version of GCC that can cross-compile for the target architecture, plus target-architecture versions of binutils, glibc, glibc headers, and gcc-defaults. The two toolchains differ in how dependencies are handled: one expects multi-architecture builds to be available on the build system for all dependencies, while the other expects target-architecture versions of all dependencies.

Both of the approaches work, Grohne said. The toolchain packages are now in Debian unstable (which was not true as recently as two years ago). Today, though, most bootstrapping projects can begin with the back-and-forth GCC/glibc "dance." First the user cross-compiles a minimalist version of GCC for the new architecture, which is then used to build the glibc-header package. Then a bit more of GCC can be built, which in turn allows more of glibc to be built, and so forth.

There are, however, a few architectures where cross toolchain support is still problematic. Alpha and HPPA have glibc conflicts, while OpenRISC, RISC-V, armel, armhf, and SuperH have GCC bugs. Patches are available to fix each of these problems, but they have not yet been merged. Thus, anyone needing to bootstrap or cross-compile on those architectures will need to get the patches from the bug-tracking system and apply them before proceeding. Grohne encouraged anyone who saw their "favorite architecture" on the problematic list to get in touch after the talk.

He then described the process for cross-building an individual package. Thanks to the Emdebian team, some packages have supported cross-building for close to ten years. For the rest, most Debian packages can be cross-built using sbuild or dpkg-buildpackage, so long as the appropriate flags are set to build for the target architecture. What does cause problems, though, is satisfying a package's Build-Depends dependencies when cross-building.

Problems and solutions

A lot of packages in the Debian archive are multi-architecture, which should allow the build system's version to satisfy Build-Depends for a cross-build. But, in reality, the long chains of transitive dependencies can break down if just one package without multi-architecture support is involved. Grohne said that out of Debian's 20,000 packages, Build-Depends problems mean that only about 3,000 can be automatically cross-built. There is a web page available that monitors the status of the dependency issues; interested developers can check there for packages that need attention.

In many cases, he said, the fixes required to unstick a problematic Build-Depends chain are simple enough—such as rewriting dependency rules that inadvertently assume that the build architecture and host architecture are the same. For example, he said, the dependency rule:

    Build-Depends: g++ (>= 4:5)

is probably meant to specify that the package should be built with a recent version of G++, but the rule is interpreted as a package that needs to be present on the target system. For now, bootstrappers usually solve these problems through a lot of manual effort. Better solutions have been proposed, such as special "compiler for host" packages, which could be specified in dependency rules:

    Build-Depends: g++-for-host (>= 4:5)

A proof-of-concept package for this idea is in Debian experimental.

Interested Debian contributors can also make a significant difference by adding multi-architecture support to more and more packages in the archive. Most of the work required involves straightforward fixes, such as changing compiler references to use target triplets (which allow different build and host architectures).

There are a few "funky issues" that arise when working on multi-architecture support, however. The most common is encountered in interpreted languages. For example, a "Architecture: any" Perl application may depend on a "Architecture: all" Perl module, which in turn depends on a "Architecture: any" Perl extension. But "all" and "any" are not the same to the dependency resolver. Whereas "all" usually designates a package that will work, unaltered, on any processor (such as a collection of Perl scripts), "any" means that the package can be built for any architecture.

Unfortunately, due to that minor distinction, passing through the "all" architecture rule in the middle of the chain breaks the chain, since the build system's version of the package satisfies that dependency. At that point, the dependency resolver stops looking for packages in the target architecture. The bootstrapping team has not yet decided on a solution to this problem, he said, although there is a workaround: manually changing the all to an any and adding another rule (Multi-Arch: same) to every dependency in the chain.

There are, of course, quite a few other problems encountered when cross-building a large set of packages. Grohne gave multiple examples, some of which raise difficult-to-answer questions. For example, there are some packages that are their own build dependency (he noted cracklib2 and nss in this group) because they expect to access certain data files during the build process, and those files are shipped in the same package as the source code. Fixing that circular dependency without breaking native builds requires careful thought, he said.

Grohne closed the session with a brief status report and some ideas for future development. Bootstrapping a new architecture currently involves about 500 source packages. His rebootstrap tool only tests 100 of those, which means it would require a lot of additional work to be comprehensive. Instead, he has proposed implementing the Build Profiles specification, which would essentially allow developers to define a separate set of build dependencies and compilation targets to be used for cross-builds. If widely implemented, it can reduce the amount of manual tweaking required. The architecture-bootstrapping team has added Build Profile support to a number of core packages already, but more remains to be done.

At the conclusion of the talk, the audience had quite a few questions for Grohne, most of which focused in on the particulars of cross-compilation or of specifying build dependencies. On the whole, it seems as though the Debian community is interested in doing what it can to make cross-building packages more reliable. For developers interested in bringing Debian up from scratch on a new processor architecture, the long-term outlook may be good, but there is considerable work to be done in the days ahead.

[The author would like to thank the Debian project for travel assistance to attend DebConf 2015.]

Comments (7 posted)

Realtime KVM

September 10, 2015

This article was contributed by Paolo Bonzini

KVM Forum

Realtime virtualization may sound like an oxymoron to some, but (with some caveats) it actually works and is yet another proof of the flexibility of the Linux kernel. The first two presentations at KVM Forum 2015 looked at realtime KVM from the ground up. The speakers were Rik van Riel, who covered the kernel side of the work (YouTube video and slides [PDF]) and Jan Kiszka, who explained how to configure the hosts and how to manage realtime virtual machines (YouTube video and slides [PDF]). This article recaps both talks, beginning with Van Riel's.

The `PREEMPT_RT` kernel

Realtime is about determinism, not speed. Realtime workloads are those where missing deadlines is bad: it results in voice breaking up in telecommunications equipment, missed opportunities in stock trading, and exploding rockets in vehicle control and avionics. These applications can have thousands of deadlines a second; the maximum allowed response time can be as low as a few dozen microseconds, and it has to be met 99.999% of the time, if not ... just always. Speed is useful, but guaranteeing this kind of latency bound almost always results in lower throughput.

Nearly every latency source in a system comes from the kernel. For example, a driver could disable interrupts and prevent high-priority programs from being scheduled. Spinlocks are another cause of latency in a non-realtime kernel, because Linux cannot schedule() while holding a spinlock. These issues can be controlled by running a kernel built with PREEMPT_RT, the realtime kernel patch set. A PREEMPT_RT kernel tries hard to make every part of the Linux kernel preemptible, except for short sections of code.

Most of the required changes have been merged into Linus's kernel tree: kernel preemption support, priority inheritance, high-resolution timers, support for interrupt handling in threads, annotation of "raw" spinlocks, and NO_HZ_FULL mode. The PREEMPT_RT patch, while still large, has to do much less than it used to. The main three things it does are: turn non-raw spinlocks into mutexes with priority inheritance, actually run all interrupt handlers in threads so that realtime tasks can preempt them, and an RCU implementation that supports preemption.

The main remaining problem is in firmware. System management interrupts (SMIs) for x86 take care of things such as fan speed, even on servers. SMIs cannot be blocked by the operating system and can take up to milliseconds to run in extreme cases. During this time, the operating system is completely blocked from running. There is no solution other than buying hardware that behaves well. A kernel module, hwlatdetect, can help detect the problem; it blocks interrupts on a CPU, looks for unexpected latency spikes, and uses model-specific registers (MSRs) to correlate the spikes to SMIs.

Realtime virtualization, really?

Now, realtime virtualization may sound implausible, but it can be done. Of course, there are problems: for example, the priority of the tasks in the virtual machine (VM) is not visible to the host and neither are lock holders inside a guest. This limits the scheduler's flexibility and prevents priority inheritance, so all of the virtual CPUs (VCPUs) have to be placed at a very high priority. Only ksoftirqd has a higher priority, since it delivers interrupts to the virtual CPUs. In order to avoid starving the host, systems have to be partitioned between CPUs running system tasks and isolated CPUs (marked with the isolcpus and nohz_full kernel command-line arguments) running realtime guests. The guest has to be partitioned in the same way between realtime VCPUs and those that run generic tasks. The latter could occasionally cause exits to the host user space, which are potentially long and—much like SMIs on bare metal—prevent the guest scheduler from running.

Thus, a virtualized realtime guest uses more resources than the same workload running on bare-metal, and those resources have to be dedicated to a particular guest. But this can be an acceptable price to pay for the improved isolation, manageability, and hardware compatibility that virtualization provides. In addition, lately each generation of processors has made more and more cores available within one CPU socket; Moore's Law seems to be compensating for this problem, at least for now.

Once the design of realtime KVM was worked out as above, the remaining piece is to fix the bugs. A lot of the fixes were either not specific to KVM, or not specific to PREEMPT_RT, so they will benefit all real-time users and all virtualization users. For example, RCU was changed to have an extended quiescent state while the guest runs. NOHZ_FULL support was extended to disable the timer tick altogether when running a SCHED_FIFO (realtime) task. In this case, that task will not be rescheduled, because anything with a higher priority would have already preempted it, so the timer tick is not needed. A few knobs were added to disable unnecessary KVM features that can introduce latency, such as synchronization of time from the host to the guest; this can take several microseconds and the solution is simply to run ntpd in the guest.

Virtualization overhead can be limited by using PREEMPT_RT's "simple wait queues" instead of the full-blown Linux wait queues. These only take locks for a bounded time so that the length of the operations is also bounded (wakeups often happen from interrupt handlers, so their cost directly affects latency). Merging simple wait queues in the mainline kernel is being discussed.

Another trick is to schedule KVM's timers a little in advance to compensate for the overhead of injecting virtual interrupts. It takes a few microseconds for the hypervisor to pass an interrupt down to the guest, and a parameter in the kvm kernel module allows for tuning the adjustment based on the guest's benchmarked latency.

And finally, new processor technology can help too. This is the case for Intel's "Cache Allocation Technology" (CAT), available on some Haswell CPUs. The combined cost of loads from DRAM and TLB misses can cause a single uncached context switch to add up to over 50 microseconds. CAT allows reserving parts of the cache to specific applications, preventing one workload from evicting another workload from the cache, and it is controlled nicely with a control-groups-based interface. The patches, however, have not yet been included in Linux.

The results, measured with cyclictest, are surprisingly good. Bare-metal latencies are less than 2 microseconds, but KVM's measurement of 6-microsecond latencies is also a very good result. To achieve these numbers, of course, the system needs to be carefully set up to avoid all kinds of high-latency system operations: no CPU frequency changes, no CPU hotplug, no loading or unloading of kernel modules, and no swapping. The applications also have to be tuned to avoid slow devices (e.g. disks or sound devices) except in non-realtime helper programs. So deploying realtime KVM requires deep knowledge of the system (for example, to ensure the time stamp counter is stable and the system will never fall back to another clock source) and the workload. Some new bottlenecks will be found as people use realtime KVM more, but the work on the kernel side is, in general, proceeding well.

"Can I have this in my cloud?"

At this point, Van Riel left the stage to Kiszka, who talked more about the host configuration, how to automate it, and how to manage the systems with libvirt and OpenStack.

Kiszka is a long-time KVM contributor who works for Siemens. He started using KVM many years ago to tackle hardware-compatibility problems with legacy software [PDF]. He has been toying with realtime KVM [YouTube] for several years, and people are now asking: "Can I have this in my cloud?".

The answer is "yes", but there are some restrictions. This is not something for a public cloud, of course. Doing realtime control for an industrial plant will not go well if you need to do I/O from some data center far away. "The cloud" here is a private cloud with a fast Ethernet link between the industrial process and the virtual machine. Many features of a cloud environment will also be left behind, because they do not provide deterministic latencies. For example, the realtime path must not use disks or live migration, but this is generally not a problem.

In going beyond the basic configuration that Van Riel had explained, the first thing to look at is networking. Most of QEMU is still protected by a "big QEMU lock", and device passthrough has latency problems too. While progress is being made on these fronts, it's already possible to use a paravirtualized device (virtio-net) together with a non-QEMU backend.

KVM supports two such virtio-net backends, namely vhost-net and vhost-user. vhost-net lies in the kernel; it connects a TAP device from the Linux network stack to a virtio-net device in a virtual machine. However, it does not have acceptable latency, yet, either. vhost-user, instead, lets any user-space process provide networking, and can be used together with specialized network libraries.

Examples of realtime-capable network libraries include Data Plane Development Kit (DPDK) or SnabbSwitch. These alternative stacks opt for an aggressive polling strategy; this reduces the amount of event signaling and, as consequence, latency as well. Kiszka's set up uses DPDK as a vhost-user client; of course, it runs at a realtime priority too. For the client to deliver interrupts to VCPUs in a timely fashion, it has to be placed at a higher priority than the VCPU threads.

Kiszka's application does not have high packet rates, so a single physical CPU is enough to run the switch for all the network interfaces in the systems; more demanding applications might require one physical CPU for each interface.

After prototyping realtime virtualization in the lab, moving it to the data center requires a lot more work. There are hundreds of VMs and many different networks, some of them realtime and some not; that needs to managed and accounted for flexibly. This requires a cloud-management stack, so OpenStack was chosen and extended with realtime capabilities. The reference architecture then includes (from the bottom up): the PREEMPT_RT kernel, QEMU (which has to be there for the guest's non-realtime tasks and to set up the vhost-user switch), the DPDK-based switch, libvirt, and OpenStack. Each host, or "compute node", is set up with isolated physical CPUs as explained in the first half of the talk. IRQ affinities also have to be set explicitly (or through the irqbalance daemon) because, by default, they do not respect the kernel's isolcpus setting. But, depending on the workload, little tuning may be needed and, in any case, the setup is easily replicated if there are many similar hosts. There is also a tool called partrt that helps to set up isolation.

Libvirt and OpenStack

Higher up comes libvirt, which doesn't require much policy, as it only executes commands from the higher layers. All required tunables are available in libvirt 1.2.13: setting the scheduling parameters (policy, priority, pinning to physical CPUs), asking QEMU to mlock() all guest RAM, and starting VMs connected to vhost-user processes. The consumer for these parameters is OpenStack's compute-node-handling Nova component.

Nova can already be configured to enable VCPU pinning and dedicated physical CPUs. Other settings, though, are missing in OpenStack, and are being discussed in a blueprint. While it is not yet complete (for example it doesn't support associating non-realtime physical CPUs to non-realtime QEMU threads), the blueprint will enable the usage of the remaining libvirt knobs. Patches for it are being discussed and the target is OpenStack's "Mitaka" release, due in the first half of 2016. Kiszka's team is integrating the patches into its deployment; the team will come up with extensions to the patches and to the blueprint.

OpenStack also controls networking through the Neutron component. However, realtime networks tend to be special: they might not use TCP/IP at all, and Neutron really wants to manage its networks in its own way. Siemens is thus introducing "unmanaged" networks (which do no DHCP and possibly even no IP) into Neutron.

All in all, work in the higher layers of the stack is mostly about standardizing the basic setup of realtime-capable compute nodes, and a lot of the work will be about improving the tuning process in tools such as partrt. As mentioned during the Q&A session, tuned is also being extended to support a realtime tuning profile. However, Kiszka also plans to take another look lower in the stack; the newest chipsets have functionality that eliminates interrupt latency introduced when assigning devices directly to VMs by directly routing the interrupt without involving the hypervisor. In addition, Kiszka's older work [PDF] to let QEMU emulate realtime devices could be brought back sometime in the future.

Comments (12 posted)

Tor's .onion domain approved by IETF/IANA

By Nathan Willis
September 10, 2015

The Tor project gained an important piece of official recognition this week when two key Internet oversight bodies gave their stamp of approval to Tor's .onion top-level domain (TLD). While .onion has been in use on the Tor network for several years, it was always as a "pseudo-domain" in the past. Its official recognition should make wider interoperability possible (as well as shield the domain from being claimed by a domain registrar).

To recap, Tor first introduced .onion in a 2004 white paper that described how hidden services on the Tor network could be accessed. A application designed for Internet usage (such as a web browser) needs the hostnames of servers to be looked up through a DNS-like mechanism that returns an IP address. The .onion TLD serves the corresponding purpose for a server running on the Tor network rather than on the Internet, but .onion hostnames are substantially different.

The server has a foo.onion hostname, where "foo" is the hash of the server's public encryption key. When the browser sends an HTTPS request to foo.onion, rather than performing a DNS lookup, the Tor proxy looks up the hash in Tor's distributed hash table and, assuming the server is online, gets the address of a Tor "rendezvous" node in return. Tor then contacts the rendezvous node and establishes the connection. The end result is functionally the same as the DNS case—the client gets a working connection to the server—but the .onion protocol makes the connection happen without either endpoint learning about the other's location.

Informalities

The .onion mechanism works reliably enough that recent years have seen several high-profile service providers add Tor hidden-service entry points. Facebook famously crunched through a massive set of hash calculations before it stumbled onto its easily remembered Tor address, facebookcorewwwi.onion [Tor link]. Search engine DuckDuckGo, news outlet The Intercept, and several other well-known web sites have followed suit (albeit without Facebook's easy-to-memorize hash).

Nevertheless, as long as .onion remained an unofficial TLD, nothing would formally prevent a new registrar from applying to the Internet Corporation for Assigned Names and Numbers (ICANN) to register and manage a .onion TLD on the public Internet. ICANN opened the doors to applications for new TLDs in 2012, and has received several thousand.

There have been other well-known pseudo-domains in years past—readers with long memories may recall .uucp or .bitnet—but those pseudo-domains were never formally specified. ICANN's new policy for accepting open submissions for new TLDs means that such informal conventions are a risky proposition. For example, RFC 6762 lists several TLDs "recommended" for private usage on internal networks, including .home, .lan, .corp, and .internal. Of those, .lan and .internal still seem to be unclaimed, but the ICANN site lists six registrar applications to manage .corp and eleven for the .home domain.

Consequently, Tor's Jacob Appelbaum (along with Facebook engineer Alec Muffett) submitted an Internet Draft proposal to the IETF to have .onion officially recognized as a "special-use domain name." The proposal specifies the expected behavior for application software and domain-name resolvers, and it forbids DNS registrars and DNS servers from interfering with Tor's usage of .onion. Specifically, it requires registrars to refuse any registrations for .onion domain names and it requires DNS servers to respond to all lookup requests for .onion domains with the "non-existent domain" response code, NXDOMAIN. Application software and caching DNS resolvers need to either resolve .onion domains through Tor or generate the appropriate error indicating that the domain cannot be resolved.

On September 9, the IETF approved Appelbaum and Muffett's proposal as a Draft RFC, and ICANN's Internet Assigned Numbers Authority (IANA) added .onion to the official list of special-use domain names. That list, unlike RFC 6762, is a formal one; apart from the reverse lookups for the reserved IP-address blocks, only a few domains are included (such as .test, .localhost, .local, .invalid, and several variations of "example").

What's next

The most immediate effect of the approval will likely be that general-purpose software can implement support for .onion, since there is now no concern that the TLD could be "overloaded" in the future by being adopted in a non-Tor setting. Appelbaum, of course, has lobbied the free-software community in recent years to start building in support for Tor as a generic network-transport layer. He proposed the idea at GUADEC 2012, and raised it again at DebConf 2015. Implementing system-wide Tor support would not be trivial, but it is perhaps now a more reasonable request.

In the longer term, though, the official recognition of .onion may have other ripple effects. Facebook's Tor team posted an announcement about the change, and noted that it raises the possibility of getting SSL certificates for .onion domains:

Jointly, these actions enable ".onion" as special-use, top-level domain name for which SSL certificates may be issued in accordance with the Certificate-Authority & Browser Forum "Ballot 144" - which was passed in February this year.

Together, this assures the validity and future availability of SSL certificates in order to assert and protect the ownership of Onion sites throughout the whole of the Tor network....

The CAB Forum ballot linked to by the announcement proposed a set of validation rules for issuing certificates for .onion domains and for certificate authorities (CAs) to sign those certificates. It makes straightforward arguments—namely, that users benefit if site owners can publicly prove their ownership of a .onion address. Apart from Facebook, after all, most .onion URLs are quite difficult to remember.

That said, the forum ballot passed with six "yes" votes from CAs, two "no" votes, and 13 abstentions, plus "yes" votes from three browser vendors. That result might not be interpreted as a strong mandate among CAs. In addition, the CAB Forum is not a governing body, so its approval does not necessarily dictate that any particular CA will issue .onion certificates in the future.

Nevertheless, approval for the .onion TLD is undoubtedly a positive sign for Tor and for hidden services in particular. The project can point to it as acceptance that the technology has grown in popularity among Internet users and is a far cry from the "dark web" so often alluded to in the general press. Just as importantly, developers can count on .onion as a stable service-naming scheme, which may lead to interesting new developments down the line.

Comments (none posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Security: Hardware technologies for securing containers; New vulnerabilities in bind, mediawiki, ntp, spice, ...
Kernel: 4.3 Merge window part 2; Identifier locator addressing; Android at LPC.
Distributions: Bringing Git workflows to Debian with dgit; Debian, ...
Development: Easier Python string formatting; Samba 4.3.0; The AXIOM Beta camera; QEMU 2.4; ...
Announcements: LPC call for organizers, FSF: 30 years in, Netdev 1.1, ...

Next page: Security>>