|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for May 17, 2018

Welcome to the LWN.net Weekly Edition for May 17, 2018

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The 2018 Python Language Summit

By Jake Edge
May 15, 2018

PyCon

Over the past three years, LWN and its readers have gotten a yearly treat in the form of coverage of the Python Language Summit; this year is no exception. The summit is a yearly gathering of around 40 or 50 developers from CPython, other Python implementations, and related projects. It is held on the first day of PyCon, which is two days before the main PyCon talk tracks begin. This year, the summit was held on May 9 in Cleveland, Ohio.

The summit consists of a dozen or so main "talks", which are usually more open-ended and discussion-oriented, rather than simply straight presentations, and a handful of [Larry Hastings & Barry Warsaw] lightning talks, all of which is meant to be crammed into five hours or so. As might be guessed, spillover is inevitable; this year it went three hours beyond its appointed slot. Topics ranged all over the Python landscape: development process issues, performance ideas, deprecations of various sorts, diversity in the development community, static typing, and more.

After four years of fez-enabled leadership for the summit, Larry Hastings and Barry Warsaw are handing that responsibility off to two new core developers for next year. Łukasz Langa and Mariatta Wijaya will be putting together the next summit. Hopefully LWN will be in Cleveland next year to report on the summit again. PyCon 2019 will be held May 1-9 at the Huntington Convention Center in downtown Cleveland, which is the same spiffy new venue as was used this year.

Here are the sessions:

The group photo was taken by me using Kushal Das's camera:

[Group photo]

[I would like to thank LWN's travel sponsor, the Linux Foundation, for supporting my travel to PyCon and the Python Language Summit.]

Comments (none posted)

Subinterpreter support for Python

By Jake Edge
May 15, 2018

Python Language Summit

Eric Snow kicked off the 2018 edition of the Python Language Summit with a look at getting a better story for multicore Python by way of subinterpreters. Back in 2015, we looked at his efforts at that point; things have been progressing since. There is more to do, of course, so he is hoping to attract more developers to work on the project.

Snow has been a core developer since 2012 and has "seen some interesting stuff" over that time. He has been working on the subinterpreters scheme for four years or so.

[Eric Snow]

The problem is that programmers expect to be able to take advantage of multiple cores, whether they really need to or not. The Python multicore story is murky, at best, which leads to a perception problem. If you start talking about threads, the global interpreter lock (GIL) rears its head. He got involved in trying to change things after a coworker expressed frustration with Python because of its multicore story; the coworker indicated that the reason the company they worked for was moving away from Python was because of multicore issues. That got him motivated to try to do something, he said.

So he suggested looking around at other languages' multicore support; JavaScript web workers is one example of a successful solution in that space. The key attributes of that mechanism are that the workers are independent, isolated from the others, and there are efficient means for them to cooperate.

CPython already has most of a solution that does that, but it is hidden away in the largely unused subinterpreter feature. Subinterpreters will allow multiple Python interpreters per process and there is the potential for zero-copy data sharing between them. But subinterpreters share the GIL, so that needs to be changed in order to make it multicore friendly. In his opinion, subinterpreters are the best avenue to address the multicore problem for Python. They can do so without breaking backward compatibility with the extensions written in C, which is not true of some of the other ideas for better multicore scalability (e.g. PyPy).

There are some missing pieces, most of which are addressed in PEP 554, which is his "Multiple Interpreters in the Stdlib" proposal that is targeting Python 3.8 (which is ostensibly planned for October 2019, though there was discussion of releasing it earlier, later in the summit). There is already a C API for subinterpreters, but it needs to be exposed to Python programs from a module in the standard library. There also needs to be a way to pass data between the interpreters. Both of those are addressed in PEP 554. Another piece is to stop sharing the GIL, which is something that looks "totally doable", Snow said.

Maintaining the isolation between the interpreters and managing the shared resources will be two of the challenges. The sys module contains a lot of state that will need to be compartmentalized. There is a separate effort aimed at cleaning up some of the cruft that has accumulated in CPython over the decades. PEP 432 proposes a restructuring of the CPython startup process, but many of the ideas there would be helpful to the subinterpreter effort. In particular, it consolidates the interpreter's runtime state; all of the static global variables are moved to a single structure. At a minimum that is helpful to get an idea of what all of the global state is.

The only real current user of subinterpreters that Snow is aware of is mod_wsgi, which implements the Python web services gateway interface (WSGI) for the Apache web server. There is also a list of subinterpreter bugs that he showed, which need to be addressed; many of those were reported by the mod_wsgi developers. There are some testing gaps too. A subinterpreters test has been merged for 3.7; Snow hopes that PEP 554 is approved and lands for Python 3.8 with even more tests.

The PEP provides for a shared-nothing concurrency model. It has a minimal Python API in an interpreters module. It also adds channels to pass immutable objects between interpreters. A subinterpreter will retain its state, so the interpreter can be "primed" with modules and other setup in advance of its use. He suggested that those interested should read the PEP, which includes several of the examples that he quickly ran through.

There are a few blockers for PEP 554, he said. He would like to put an interpreters module out on the Cheese Shop (i.e. the Python Package Index or PyPI) so that he can get more feedback on the implementation. There are some open questions to be addressed and the PEP needs to be updated and reposted—something he hoped to get to before PyCon is over on May 17.

The ultimate goal is to improve and clarify the multicore support in Python as he described in a September 2016 blog post. That was written as something of a post-mortem on the project, when he thought he was ready to give up. He went away and came back; "I'm OK now", Snow said with a chuckle.

His high-level plan is broken up into two phases. The first is to implement PEP 554, expose subinterpreters in Python, and support passing some objects over channels. Part of that would be to improve the isolation enough to make the feature usable, but the key piece of the puzzle is to stop sharing the GIL. Phase two would build on that base; it would allow C extension modules for subinterpreters by getting rid of all of the static globals in the interpreter, turning them into per-interpreter state.

One kind of global state that does need to move in phase one is the allocators, which need to change into per-interpreter allocators. Once that happens, the GIL can follow, so that there will be a GIL per interpreter. There are lots of things that can be done in phase two and beyond, but he is hoping to get some others to help out to reach the goal of subinterpreter support in 3.8. This is not his area of expertise, Snow said, but he recently started working for Microsoft, which generously allows him to work on Python one day a week.

Thomas Wouters noted that the examples passed code to be run in the subinterpreter as a string, which is rather painful to do in a language with significant white space; will there be support for passing functions to be run instead? Snow agreed that it is needed and could be added.

Wouters also wanted to know how many existing users of subinterpreters would be broken once they no longer share the GIL. Programs are known to use the serialization of the GIL to protect their own data structures from concurrent updates, so changing the GIL is likely to lead to unexpected race conditions and the like. Snow acknowledged that and agreed with Larry Hastings that keeping the shared GIL as an option might be the solution for that.

Comments (15 posted)

Modifying the Python object model

By Jake Edge
May 16, 2018

Python Language Summit

At the 2018 Python Language Summit, Carl Shapiro described some of the experiments that he and others at Instagram did to look at ways to improve the performance of the CPython interpreter. The talk was somewhat academic in tone and built on what has been learned in other dynamic languages over the years. By modifying the Python object model fairly substantially, they were able to roughly double the performance of the "classic" Richards benchmark.

Shapiro said that Instagram is a big user of Python and has been looking for ways to improve the performance of the CPython interpreter for its workloads. So the company started looking at the representation of data in the interpreter to see if there were gains to be made there. It wanted to stick with CPython in order to preserve the existing API, ecosystem, and developer experience

[Carl Shapiro]

He is just a "casual user" of Python, but has a background in language implementations. He has worked on the Java virtual machine (JVM) and on garbage collection for Go among other language work. He said he is "not afraid of dynamic languages". He started by looking at the Python runtime. His initial impression was that Python is relatively inefficient compared to other high-level languages. But there is nothing that is so different in Python from those other languages; it shares a lot of common operations with other language runtimes. However Python has never evolved into a faster implementation as has happened with other languages (e.g. V8 for JavaScript, HotSpot for Java)

He did some data collection with a heavily instrumented CPython interpreter. He compared different kinds of workloads and other languages on those workloads. He also ran the modified interpreter against real traffic data gathered from Instagram. The tests gathered counts of bytecodes used as well as CPU instructions for handling the bytecodes.

To a first approximation, the breakdown of which bytecodes were used the most track closely with other languages, which is what would be expected. Roughly 20% of the bytecodes executed are either CALL_FUNCTION or LOAD_ATTR, which are used to call functions and retrieve object attributes, as the names would imply. But, handling those bytecodes required nearly 50% of the CPU instructions that were executed. Those two opcodes turn into hundreds of CPU instructions each (the average was 498 instructions for CALL_FUNCTION and 240 for LOAD_ATTR). Those are much higher than the instruction count for the highly optimized versions in other languages.

In addition, when Python is dispatching to a method call, 85% of the time there is only one type involved at a given call site. The interpreter is set up to handle generic method dispatch, but a cache of the most recently used method will work in the vast majority of cases. That percentage rises to 97% if you cache four type/method pairs for the call site. It is not uncommon for high-level languages to have different strategies for a single type versus either four or eight types; call sites that fall outside of those constraints are handled a third way, he said.

Beyond that, the comparison and binary operation implementations were more general than needed by the vast majority of the calls in the Instagram workload. Comparisons and binary operations are normally done on the built-in types (int, dict, str), rather than on user-defined classes. But the code to handle those operations does a lot of extra work to accommodate the dynamic types that are rarely used.

Some of what Shapiro presented did not sit well with Guido van Rossum, who loudly objected to Shapiro's tone, which was condescending, he said. Van Rossum thought that Shapiro did not really mean to be condescending, but that was how he came across and it was not appreciated. The presentation made it sound like Shapiro and his colleagues were the first to think about these issues and to recognize the inefficiencies, but that is not the case. Shapiro was momentarily flustered by the outburst and its vehemence, but got back on track fairly quickly.

Shapiro's overall point was that he felt Python sacrificed its performance for flexibility and generality, but the dynamic features are typically not used heavily in performance-sensitive production workloads. So he believes it makes sense to optimize for the common case at the expense of the less-common cases. But Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.

Continuing, Shapiro said that the most interesting statistic in the data gathered for him was on object construction and "monkey patching" (e.g. adding new attributes to an object after its creation). The instrumented interpreter found that 70% of objects have all of their attributes set in the object's __init__() method. Another percent or so were added from the function where the object is initialized. Much of the rest were all instances of a single class that is frequently used at Instagram. The implication is that adding object attributes post-initialization is actually not frequent.

It is in keeping with what other high-level languages have found that Python code does not use the dynamic features of the language all that much. But the data structures used in the interpreter overemphasize these dynamic features that are not used that widely, he said. Better native code generation, which is a common place to look for better performance, does not directly address the performance of data structures that are not optimized for the 90% case. What would the object model look like if it were optimized for the most frequent operations and how much performance would be gained?

A method lookup and call takes two orders of magnitude more instructions than he would like to see. So instead of optimizing the call frames for environment introspection (for stack traces and the like), as it is today, he proposed optimizing for call speed. Also, the garbage collector is reference-count based, which has poor locality. It is part of the C API, though, so reference counting must be maintained for C extensions; in the experiment, the Instagram developers moved to a tracing garbage collector everywhere else.

The representation of an instance in CPython has a hash table for attributes that is optimized for adding and removing attributes. Since that is not done that often in practice, flattening the representation to an array (with an overflow hash table for additional attributes) was tried. The experiments also featured aggressive caching for attributes.

The changes made to the interpreter were all at the data structure level; they were conservative changes, overall, Shapiro said. The developers did not go out of their way to optimize any of the changes they made, so there is still a "lot of headroom" in the implementation. They raced the code against CPython 3.6. The caching for attributes and for methods were the most powerful optimizations; each cut the run time roughly in half. The experimental interpreter started at 3.62s versus CPython at 1.32s; that dropped to 1.8s with the attribute caching and 0.72s by adding method caching as well. Other optimizations had a smaller effect, but did get the run time down to 0.65s.

He concluded that the current object data model seems to benefit the uncommon cases more than the common cases—at least for the Instagram workload. He wondered if others had seen that also. There has been a lot of effort on compilation to native code over the years, but just changing the object model provided some big wins; perhaps that is where the efforts should be focused.

Eric Snow asked whether the dictionary versioning was being used to invalidate caches. Shapiro said that the experimental interpreter relies less on dictionaries than CPython does. The instances are now arrays instead of dictionaries, but there is an equivalent signal so that the cache entries can be invalidated. Thomas Wouters asked if he had looked at PyPy. Shapiro said the company had, but there was only a modest bump in performance for its workload. He was not the one who did the work, however. Wouters noted that PyPy is more than "Python with a JIT" because it has its own data model as well.

Mark Shannon said that Python 3.7 has added a feature that should provide a similar boost as the method-lookup caching used in the experiment. Shapiro said he had looked at those changes but still believed his proposed mechanism would provide more benefit. Attribute lookup still requires five lookups in CPython, while it is only one lookup in the experimental version. Shannon did not sound entirely convinced of that, however.

These changes are not "all or nothing", Shapiro said. The experiment showed that there is a lot of headroom in the data structures themselves. Some parts of the changes could be adopted if they proved compelling. One attendee asked how serious Facebook/Instagram is about getting some or all of the changes into CPython. Shapiro said that the company has not released the code for the experiment (yet, though it was a bit unclear when that might change) but that it did try to feed its changes back upstream. It does not want to have a fork of CPython that it runs in production.

Comments (15 posted)

A Gilectomy update

By Jake Edge
May 16, 2018

Python Language Summit

In a rather short session at the 2018 Python Language Summit, Larry Hastings updated attendees on the status of his Gilectomy project. The aim of that effort is to remove the global interpreter lock (GIL) from CPython. Since his status report at last year's summit, little has happened, which is part of why the session was so short. He hasn't given up on the overall idea, but it needs a new approach.

Gilectomy has been "untouched for a year", Hastings said. He worked on it at the PyCon sprints after last year's summit, but got tired of it at that point. He is "out of bullets" at least with that approach. With his complicated buffered-reference-count approach he was able to get his "gilectomized" interpreter to reach performance parity with CPython—except that his interpreter was running on around seven cores to keep up with CPython on one.

[Larry Hastings]

The old adage of "fast, cheap, good; pick any two" has a parallel in the world of the Gilectomy (and other similar projects). In that case the three items are: "high performance", "doesn't break the C API", and "uses multiple cores". CPython doesn't use multiple cores and Gilectomy 1.0 is not high performance, which leads him to consider breaking the C API.

For "Gilectomy 2.0", Hastings will be looking at using a tracing garbage collector (GC), rather than the CPython GC that is based on reference counts. Tracing GCs are more multicore friendly, but he doesn't know anything about them. He also would rather not write his own GC.

This new version of Gilectomy would also have "cheap local object locking". It would distinguish between objects that are only visible in one thread versus those that are visible to two or more. Objects can transition from local to non-local in various ways, some of which are not particularly obvious. It will be difficult to identify all of those, so it is not a change he makes lightly.

His next step will be to get rid of the existing code. He hopes to be able to gain access to the Instagram runtime discussed in the previous summit session and to gilectomize that. Though the code has not been released, he may be able to arrange an agreement with Facebook/Instagram to gain access to the code, he said. Since that version of the interpreter did not break the C API (at least for extensions that Instagram uses), it may mean that Gilectomy 2.0 will actually not have to break that API.

But Thomas Wouters pointed out that there are things that the Instagram experiment is doing that will immediately break big and important Python extensions like NumPy and SciPy. Hastings is a bit sanguine about that: if Gilectomy ever ships in CPython, the core developers can go on vacation for a month and the extension developers will have it all fixed by the time they return. In truth, it would be a great problem to have, but he is so far away from being successful with his Gilectomy efforts that he can't even see it from where he is, he said.

One audience member said that it would be nice to be able to see the Instagram experimental code to determine just what the C API compatibility issues are. Hastings agreed but noted that there was no way to make that happen until the company is ready to do so. There was also a question of how this work fits with the subinterpreter effort. Hastings said that he saw no reason that the Gilectomy and subinterpreters would not play well together.

Comments (12 posted)

An introduction to MQTT

May 10, 2018

This article was contributed by Tom Yates


FLOSS UK
A few years ago, I was asked to put temperature monitoring in a customer's server room and to integrate it with their existing monitoring and notification software. We ended up buying a rack-mountable temperature monitor, for nearly £200, that ran its own web server for propagating temperature data. Although the device ostensibly published data in XML, that turned out to be so painful to parse that we ended up screen-scraping the human-readable web pages to get the data. Temperature sensors are fairly cheap, but by the time you've wrapped them in a case with a power supply, an Ethernet port, a web server, enough of an OS to drive the above, and volatile and non-volatile storage for the same, they get expensive. I was sure that somewhere there must be physically-lightweight sensors with simple power, simple networking, and a lightweight protocol that allowed them to squirt their data down the network with a minimum of overhead. So my interest was piqued when Jan-Piet Mens spoke at FLOSS UK's Spring Conference on "Small Things for Monitoring". Once he started passing working demonstration systems around the room without interrupting the demonstration, it was clear that this was what I'd been looking for.

The lightweight protocol that he'd chosen is MQTT. This is an OASIS-standard (though not yet RFC) protocol for pub/sub (publish/subscribe) messaging and its transport on unreliable networks, the name of which, Mens says, no longer actually stands for anything. The server normally runs on TCP port 1883; the protocol has some idea of security, supporting TLS, authentication, ACLs, and the encryption of payloads, which can be up to 256MB in size. Mens said, though, that the use of payloads approaching that limit is "a little bit theoretical". A simple quality of service (QoS) flag is supported; [Jan-Piet Mens] setting this to 0 gives "fire and forget", or "delivery at most once"; 1 gives assured delivery, or "delivery at least once"; and 2 gives "delivery once only". Higher values, of course, impose higher overheads.

The protocol uses topic names to distinguish one message from another, though the arrangement of the namespace is entirely up to each implementer. These UTF-8 strings are hierarchical, looking a little like Unix filenames, and both single depth (+) and arbitrary depth (#) wildcards are supported. Examples he gave were temperature/room/living, devices/# (likely meaning "all devices" and including both devices/usb/scanner/0 and devices/scsi/disc/12) and finance/+/eur/rate (perhaps meaning "the euro rate from all providers"). Publishers publish data on the topic of their choice, and subscribers who have subscribed to that part of the namespace are informed of it.

The server software that enables all this is properly called the broker. There are a number of broker implementations, though the one Mens focused on is Mosquitto (the double-t is deliberate, and reflects the protocol's name). Well-behaved brokers, which include Mosquitto, support the concept of bridging; when two brokers are bridged, messages received by one are passed on to the other, so that the other broker's clients may subscribe to them if they choose.

Client-side, things are lightweight. Under Mosquitto, client-side software is limited to mosquitto_sub, for subscribing, and mosquitto_pub, for publishing. There are client implementations in Lua, Python, C, JavaScript, Perl, Ruby, Java, and many others (Mens thinks COBOL doesn't have one, but that's a rare exception). He gave some small (but functional) Python examples, using the Paho MQTT library. The publishing code was one substantive line long, while the subscription example was made eight lines longer by the asynchronous nature of subscription: you know when you're going to publish, but you don't know when somebody else is going to publish to a topic you're subscribed to, so callbacks to handle incoming data must be defined before entering the event loop. So the subscription example looked like:

    import paho.mqtt.client as paho
    def on_connect(mosq, userdata, rc):
         mqttc.subscribe("conf/+", 0)
    def on_message(mosq, userdata, msg):
         print "%s %s" % (msg.topic, str(msg.payload))
    mqttc = paho.Client(userdata=None)
    mqttc.on_connect = on_connect
    mqttc.on_message = on_message
    mqttc.connect("localhost", 1883, 60)
    mqttc.loop_forever()

There's nothing implicitly small or low-volume about MQTT — Mens mentioned that IBM's broker software can handle 15 million messages per second. However, it's seeing a lot of use in the Internet of Things (IoT). Mens here briefly but trenchantly expressed his opinion that IoT devices are best kept off the public internet, encouraging us to deploy in "walled garden" intranets unless we were sure we knew what we were doing.

At this point the demonstration began. Mens first passed around a GL-Inet AR150, a WiFi-enabled "pocket" router costing around €25, powered by an attached USB battery, and running OpenWrt. On top of that, it was running a Mosquitto broker. Mens explained that this demonstration required a broker, so he chose a small and lightweight platform to show that it could be done. But his real interest was in even smaller, cheaper devices that could act as MQTT clients.

To this end he introduced the ESP8266, a low-cost microcontroller with a full TCP stack. He described it as "like an Arduino" — you can use the same IDE for it, and it can be programmed in Lua, MicroPython, and C, among other languages — but it has integrated WiFi and it's really cheap. He showed a slide of several ESP8266-based boards, many of which came from Wemos, and all of which were significantly less than €5. He also showed a slide of an Electrodragon, an ESP8266-based board with a couple of 220V relays inside the package. The applications for home automation, or remote rebooting, are obvious, though he said the quality was on par with its €6 price tag - "the German TÜV would get heart attacks if they saw this". The Sonoff is another ESP8266-based appliance switch he's seen.

The Wemos D1 Mini is an ESP8266-based board costing around €5, that can take a variety of comparably cheap shields to add physical functionality, including a temperature/humidity sensor, and a button. He placed on the podium a Lego enclosure containing a D1 Mini running a counter and driving an Adafruit quadruple seven-segment display, and passed around a D1 Mini with a button shield and a USB battery pack. The latter D1 Mini was running about ten substantive lines of C code from the Homie-ESP8266 project; each button press and each release was published to the broker as a separate MQTT message, on a specific button-related topic. In turn, the ESP8266 driving the counter was subscribing to that topic, and on receiving notification of each new button press, incremented the counter. There was a little sleight-of-hand involving an injunction not to press the button more than once, on pain of buying Mens a gin-and-tonic, plus some code that double-counted each fourth button press, but this generated little protest from the crowd and I am optimistic that Mens later received several drinks.

The ESP8266 also supports over-the-air updates, and Mens has written a lightweight web application to help Homie users organize and, where necessary, update their sensors.

Returning to MQTT, Mens noted that if you have a sensor which reports on an MQTT topic every half-hour, a client that subscribes to that topic may not wish to wait up to 30 minutes to get its first data. MQTT addresses this through "retained" messages; when a client publishes a message to a topic, it can specify that the message is to be retained. The broker will send the most recent retained message on any topic to any subscribing client as soon as the client subscribes to that topic. This can be used to inform new clients both of most-recent readings, and of persistent data such as a sensor's location or name.

Similarly, the protocol supports a "last will and testament" message, which a publishing client can give to a broker to be published in a given topic at some future time, if the publishing client goes silent. If the broker doesn't hear from that client within the specified time, and the last communication with it didn't include a proper disconnection message, then the broker will publish the "last will and testament" message. The protocol also includes a keepalive message that the client can use to reassure the broker that all remains well, even if the client has no actual data payload to be published.

Mens found that, when processing data collected via MQTT, he was continually writing little utilities to pass the data into other systems. He finally decided to save time and write mqttwarn, a general-purpose tool for redistributing MQTT data into other systems; currently it supports more than 60 such systems, including Facebook, Twitter, and Slack, as well as via more-traditional channels such as ssh, SMTP, and XMPP. There is also check-mqtt, which allows a NAGIOS/ICINGA system to subscribe to a topic and extract data therefrom.

Although Mens has done a lot of work in this space, MQTT is by no means his pet project. It is used widely in the field by other free-software projects (for example, GitHub can publish to MQTT whenever your project receives a pull request or a new issue is opened) and in commercial endeavors (the electricity metering companies Flukso and RemakeElectric use MQTT). It seems to be gaining a lot of traction in spaces including alerting, metering, logging, location awareness, tracking, automation and control, and host monitoring. And probably some time in the next few weeks, it will also be used to monitor the temperature of the air going in and out of my co-located server.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (17 posted)

Using user-space tracepoints with BPF

May 11, 2018

This article was contributed by Matt Fleming


BPF in the kernel

Much has been written on LWN about dynamically instrumenting kernel code. These features are also available to user-space code with a special kind of probe known as a User Statically-Defined Tracing (USDT) probe. These probes provide a low-overhead way of instrumenting user-space code and provide a convenient way to debug applications running in production. In this final article of the BPF and BCC series we'll look at where USDT probes come from and how you can use them to understand the behavior of your own applications.

The origins of USDT probes can be found in Sun's DTrace utility. While DTrace can't claim to have invented static tracepoints (various implementations are described in the "related work" section of the original DTrace paper), it certainly made them much more popular. With the emergence of DTrace, many applications began adding USDT probes to important functions to aid with tracing and diagnosing run-time behavior. Given that, it's perhaps not surprising that these probes are usually enabled (as part of configuring the build) with the --enable-dtrace switch.

For example, MySQL provides a number of probes to aid database administrators and to help them understand who is connecting to the database, which SQL commands they're running, and low-level details on data transferred between clients and MySQL servers. Other popular tools such as Java, PostgreSQL, Node.js, and even the GNU C Library also come with the option of enabling probes. These probes cover a wide range of activities, from memory allocation to garbage-collection events.

There is a variety of tools on Linux to work with USDT probes. SystemTap is a popular choice and an alternative to DTrace since it's only recently that DTrace has been supported on Linux. Support for USDT probes (termed "statically defined traces" inside the kernel) for perf was merged in v4.8-rc1, and even LTTng has been able to emit USDT-compatible probes since 2012. The most recent additions to the developer's USDT tool chest — and arguably the most user-friendly — are the tools and scripts in the BPF Compiler Collection (BCC).

BCC tools for working with USDT probes

BCC has had support for USDT probes since March 2016, when Sasha Goldshtein sent a pull request to the GitHub project adding support to the existing tplist and trace tools.

The tplist tool allows you to see which probes (if any) are available for the kernel, an application, or a library, and it can be used to discover names of probes to enable with trace. Running it on a version of the C library compiled with SDT support shows:

    # tplist.py -l /lib64/libc-2.27.so
    /lib64/libc-2.27.so libc:setjmp
    /lib64/libc-2.27.so libc:longjmp
    /lib64/libc-2.27.so libc:longjmp_target
    /lib64/libc-2.27.so libc:memory_mallopt_arena_max
    /lib64/libc-2.27.so libc:memory_mallopt_arena_test
    /lib64/libc-2.27.so libc:memory_tunable_tcache_max_bytes
    /lib64/libc-2.27.so libc:memory_tunable_tcache_count
    /lib64/libc-2.27.so libc:memory_tunable_tcache_unsorted_limit
    /lib64/libc-2.27.so libc:memory_mallopt_trim_threshold
    /lib64/libc-2.27.so libc:memory_mallopt_top_pad
    [ ... ]

The -l parameter tells tplist that the file argument is a library or executable. Omitting -l instructs tplist to print the list of kernel tracepoints.

A filter can be applied to the list of tracepoints and probes, which helps to shorten the (potentially very long) default output of tplist. For example, using the filter sbrk prints only those probes with the string "sbrk" in their name. And using the -vv parameter prints the arguments available at the probe site. For example:

    ./tplist.py -vv -l /lib64/libc-2.27.so sbrk
    /lib64/libc-2.27.so libc:memory_sbrk_less [sema 0x0]
    location #1 0x816dd
    argument #1 8 unsigned bytes @ ax
    argument #2 8 signed   bytes @ bp
    /lib64/libc-2.27.so libc:memory_sbrk_more [sema 0x0]
    location #1 0x826af
    argument #1 8 unsigned bytes @ ax
    argument #2 8 signed   bytes @ r12

The argument details are necessary to understand which registers contain function parameters. Knowing the location of arguments allows us to print their contents with the BCC trace tool with a command like:

    # trace.py 'u:/lib64/libc-2.27.so:memory_sbrk_more "%u", arg1' -T
    TIME     PID     TID     COMM            FUNC             -
    21:46:51 12781   12781   ls              memory_sbrk_more 114974720

The trace utility takes a number of arguments and accepts a probe specifier, an example of which was used above. Probe specifiers allow users to describe exactly what they want to be printed when the probe fires. A list of examples (and a more thorough explanation of the format) is provided in the trace_example.txt file in the BCC repository. The output above shows one hit when a process running ls hit the memory_sbrk_more probe.

Additional tools in BCC enable USDT probes for popular high-level languages like Java, Python, Ruby, and PHP — lib/calls.py summarizes method calls, lib/uflow.py traces function entry and exit and prints a visual flow graph, lib/ugc.py traces garbage-collection events, lib/uobjnew.py prints summary statistics for new object allocations, and lib/uthreads.py prints details on thread creation. lib/ustat.py is a monitoring tool that pulls all of these together and displays their events with a top-like interface:

    # ustat.py
    Tracing... Output every 10 secs. Hit Ctrl-C to end
    12:17:17 loadavg: 0.33 0.08 0.02 5/211 26284

    PID    CMDLINE      METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXC/s  THR/s
    3018   node/node    0          3      0          0        0      0

The output above shows that pid 3018, a node process, generated three garbage collection events within a ten-second period. Like most of these scripts, ustat.py runs until interrupted by the user.

In addition to the language-specific tools, BCC also includes specialized scripts for specific applications. For example, bashreadline.py prints commands from all running bash shells:

    # bashreadline.py
    TIME      PID    COMMAND
    05:28:25  21176  ls -l
    05:28:28  21176  date
    05:28:35  21176  echo hello world
    05:28:43  21176  foo this command failed
    05:28:45  21176  df -h
    05:29:04  3059   echo another shell
    05:29:13  21176  echo first shell again

dbslower.py prints database (MySQL or PostgreSQL) operations with a latency that exceeds a specified threshold:

    # dbslower.py mysql -m 1000
    Tracing database queries for pids 25776 slower than 1000 ms...
    TIME(s)        PID          MS QUERY
    1.421264       25776  2002.183 call getproduct(97)
    3.572617       25776  2001.381 call getproduct(97)
    5.661411       25776  2001.867 call getproduct(97)
    7.748296       25776  2001.329 call getproduct(97)

Adding USDT probes to your application

SystemTap provides an API for adding static probes to an application. To create them, you'll need the systemtap-sdt-devel package, which provides the sys/sdt.h header file. The documentation for the SystemTap project provides an example of adding a probe, but we'll add one to a simple C program and use the BCC tools to list and enable the probe:

    #include <sys/sdt.h>
    #include <sys/time.h>
    #include <unistd.h>

    int main(int argc, char **argv)
    {
        struct timeval tv;

        while(1) {
            gettimeofday(&tv, NULL);
            DTRACE_PROBE1(test-app, test-probe, tv.tv_sec);
            sleep(1);
        }
        return 0;
    }

This simple program runs until interrupted. It fires a probe and then calls sleep() to wait for one second until the loop starts again. The DTRACE_PROBE() macro is used to create probe points at desired locations, in this case, immediately before sleeping. This macro takes a provider name, probe name, and arguments as parameters. There's a separate DTRACE_PROBEn() macro for each argument count. For example, if your probe has three arguments, you need to use DTRACE_PROBE3().

The DTRACE_PROBEn() macros are implemented by placing a no-op assembly instruction at the probe site and writing an ELF note in the application image that includes things like the probe address and name. Since the runtime overhead of an inactive probe is the cost of executing a no-op instruction and, given that the ELF note isn't loaded into memory, the impact on performance is minimal.

The provider name allows you to create a namespace for your probe. The most common value (and the one suggested in the SystemTap example) is to use the name of your application or library. In the example above I've used test-app, and the probe name was imaginatively titled, test-probe. The one and only argument to the probe is the time in seconds.

Using tplist, we can see the probe and its argument:

    # tplist.py -vv -l ./simple-c
    simple-c test-app:test-probe [sema 0x0]
    location #1 0x40057b
    argument #1 8 signed   bytes @ ax

We can then construct a probe specifier to print the first argument with trace, assuming the simple-c program above is running:

    # trace.py 'u:./simple-c:test-probe "%u", arg1' -T -p $(pidof simple-c)
    TIME     PID     TID     COMM            FUNC             -
    21:55:44 13450   13450   simple-c        test-probe       1524430544
    21:55:45 13450   13450   simple-c        test-probe       1524430545
    21:55:46 13450   13450   simple-c        test-probe       1524430546

The final column in the output shows the current time, in seconds, when the probe fires. This is information passed as the first argument in the probe declaration. You do need to be aware of the data type of the probe arguments, since it is reflected in the printf()-style format string used in the probe specifier.

Conclusion

USDT probes bring the flexibility of kernel tracepoints to user-space applications. Thanks to the rise of DTrace, many popular applications and high-level programming languages grew support for USDT probes. BCC provides simple tools for working with probes that allow developers to list available probes in libraries and applications, and trace them to print diagnostic data. Adding probes to your own code is possible with SystemTap's API and the collection of DTRACE_PROBE() macros. USDT probes can help you troubleshoot your applications in production with minimal run-time overhead.

Comments (7 posted)

Supporting multi-actuator drives

By Jake Edge
May 15, 2018

LSFMM

In a combined filesystem and storage session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Tim Walker asked for help in designing the interface to some new storage hardware. He wanted some feedback on how a multi-actuator drive should present itself to the system. These drives have two (or, eventually, more) sets of read/write heads and other hardware that can all operate in parallel.

He noted that his employer, Seagate, had invested in a few different technologies, including host-aware shingled magnetic recording (SMR) devices, that did not pan out. Instead of repeating those missteps, Seagate wants to get early feedback before the interfaces are set in stone. He was not necessarily looking for immediate feedback in the session (though he got plenty), but wanted to introduce the topic before discussing it on the mailing lists. Basically, Seagate would like to ensure that what it does with these devices works well for its customers, who mostly use Linux.

[Tim Walker]

The device is a single-port serial-attached SCSI (SAS) drive, with the I/O going to two separate actuators that share a cache, he said. Both actuators can operate at full speed and seek independently; each is usable on a subset of the platters in the device. This is all based on technology that has already been mastered; it is meant to bring parallelism to a single drive. The device would present itself as two separate logical unit numbers (LUNs) and each of the two actuator channels would map to its own LUN. Potential customers have discouraged Seagate from making the device be a single LUN and opaquely splitting the data between the actuators behind the scenes, Walker said.

One problem Walker foresees is that management commands, in particular those that affect the LUN as a whole, such as start and stop commands, could come addressed to either LUN but would affect the entire drive, thus the other LUN. Hannes Reinecke said that it would be better to have a separate LUN that was only for management commands rather than accepting management commands on the data LUNs. If not, though, making the stop commands do what is expected (park the heads if it is just for one LUN or spin down the drive if it is for both) would be an alternative.

Fred Knight said that storage arrays have been handling this situation for years. They have hundreds of LUNs and have just figured it out and made it all work. He noted that, even though it may not be what customers expect, most storage arrays will simply ignore stop commands. The kernel does not really distinguish between drives and arrays, Martin Petersen said; there really is no condition where the kernel would want to stop one LUN and not the other. Knight said that other operating systems will spin down a LUN for power-management reasons, but that the standards provide ways to identify LUNs that are tied together, so there should not be a real problem here.

Ted Ts'o said that a gathering like LSFMM (or the mailing lists) will not provide the full picture. Customers may have their own ideas about how to use this technology; the enterprise kernel developers may be able to guess what their customers might want to do, but that is only a guess. For the cloud, there is an advisory group that will give some input, he said, but it may be harder to get that for enterprises. Ric Wheeler said that he works for an enterprise vendor (Red Hat), which has internal users of disk drives (Ceph and others) that have opinions and thoughts that the company would be willing to share.

From the perspective of a filesystem developer, all of what is being discussed is immaterial; the filesystem developers "don't care about any of this", Dave Chinner said. The storage folks will figure out how and when drives spin up and down (and other things of that nature), but the filesystems will just treat the device as if it were two entirely separate devices. Knight pointed out that there are some different failure modes that could impact filesystems; if the spindle motor goes, both drives are lost, while a head loss will lead to inaccessible data, but that may just be handled with RAID-5, for example.

Ts'o noted that previously there had been "dumb drives and smart arrays", but that now we are seeing things that are between the two. Multi-actuator drives as envisioned by Seagate are just the first entrant; others will undoubtedly come along. It would be nice to standardize some way to discover the topology (spindles, heads, etc.) for these. Wheeler added that information about the cache would also be useful.

This device has a shared cache, but devices with split caches might be good, Reinecke said. Kent Overstreet worried that there could be starvation problems if there are different I/O schedulers interacting in the same cache. As time wound down, Walker said that the session provided him with exactly the kind of feedback he was looking for.

Comments (22 posted)

XFS online filesystem scrubbing and repair

By Jake Edge
May 16, 2018

LSFMM

In a filesystem track session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Darrick Wong talked about the online scrubbing and repair features he has been working on. His target has mostly been XFS, but he has concurrently been working on scrubbing for ext4. Part of what he wanted to discuss was the possibility of standardizing some of these interfaces across different filesystem types.

Filesystem scrubbing is typically an ongoing activity to try to find corrupted data by periodically reading the data on the disk. Online repair attempts to fix the problems found by using redundant information (or metadata that can be calculated from other information) stored elsewhere in the filesystem. As described in Wong's patch series, both scrubbing and repair are largely concerned with filesystem metadata, though scrubbing data extents (and repairing them if possible) is also supported. Wong said that XFS now has online scrubbing support, but does not quite have the online repair piece yet.

[Darrick Wong]

Btrfs has support for online scrubbing and ext4 will eventually as well. Wong wondered if there was an opportunity to create a common wrapper for user space. Ted Ts'o said that it would help if there was some clarity about the goals and requirements of a scrubber tool. He asked, is it a cron job that scrubs all the filesystems or might there be individual crontab entries for ext4 and XFS? Clearly the goal should be to make the system administrator's life better.

Chris Mason brought up the CRC checks that the filesystems currently do. When those CRC checks fail, each filesystem logs its own message to dmesg. There is no consistency between the filesystems for that message. Wong recommended that Btrfs return a "filesystem corrupt" error status to user space as ext4 and XFS do, but Mason pointed out that CRC errors are not only found during a filesystem scrubbing.

Kent Overstreet said that he had a framework that could be used for long-running jobs in the kernel. It returns a file descriptor that can be used to monitor the job. Wong said that the XFS scrubbing consists of many ioctl() commands that are called from user space. Overstreet said that sounded harder to deal with. Josef Bacik said that Btrfs is similar to XFS, but that having a single file descriptor might be better.

Dave Chinner wondered if there was a way to have a single scrubbing command that handled any kind of filesystem, so that users do not have to remember how to do it for each type. No one seemed opposed to the idea but getting there may take some time.

When data errors are found, some users may not really want to have the filesystem try to repair things, Ric Wheeler said. Instead they will just want the name of the file containing the error so that they can simply get a copy from another server. That requires mapping the blocks back to a path. He also said that a recent paper showed that, while SSDs will last a lot longer than rotating storage, they will generate many more errors (on the order of 10-15 times more) than rotating storage over that time. So these kinds of problems will become more prevalent.

Another thing that needs to be standardized is the I/O priority that these scanners will run with, Mason said.

Wong suggested starting with a simple common scrubbing wrapper that would do the right thing for each filesystem type. It would just report whether the metadata had errors and whether the data had errors. From that, administrators could then decide how to fix the errors. Chinner said that there needs to be some standard on what errors get returned, but Wong suggested starting with something simple: 0 for OK, 1 to indicate a problem and that the administrator should check the logs for more information. It was generally agreed that would be a reasonable place to start, though Ts'o cautioned there would be a need to eventually standardize more pieces at multiple levels.

Comments (8 posted)

Autoscaling for Kubernetes workloads

May 14, 2018

This article was contributed by Antoine Beaupré


KubeCon EU

Technologies like containers, clusters, and Kubernetes offer the prospect of rapidly scaling the available computing resources to match variable demands placed on the system. Actually implementing that scaling can be a challenge, though. During KubeCon + CloudNativeCon Europe 2018, Frederic Branczyk from CoreOS (now part of Red Hat) held a packed session to introduce a standard and officially recommended way to scale workloads automatically in Kubernetes clusters.

Kubernetes has had an autoscaler since the early days, but only recently did the community implement a more flexible and extensible mechanism to make decisions on when to add more resources to fulfill workload requirements. The new API integrates not only the Prometheus project, which is popular in Kubernetes deployments, but also any arbitrary monitoring system that implements the standardized APIs.

The old and new autoscalers

Branczyk first covered the history of the autoscaler architecture and how it has evolved through time. Kubernetes, since version 1.2, features a horizontal pod autoscaler (HPA), which dynamically allocates resources depending on the detected workload. When the load becomes too high, the HPA increases the number of pod [Fréderic
Branczyk] replicas and, when the load goes down again, it removes superfluous copies. In the old HPA, a component called Heapster would pull usage metrics from the internal cAdvisor monitoring daemon and the HPA controller would then scale workloads up or down based on those metrics.

Unfortunately, the controller would only make decisions based on CPU utilization, even though Heapster provides other metrics like disk, memory, or network usage. According to Branczyk, while in theory any workload can be converted to a CPU-bound problem, this is an inconvenient limitation, especially when implementing higher-level service level agreements. For example, an arbitrary agreement like "process 95% of requests within 100 milliseconds" would be difficult to represent as a CPU-usage problem. Another limitation is that the Heapster API was only loosely defined and never officially adopted as part of the larger Kubernetes API. Heapster also required the help of a storage backend like InfluxDB or Google's Stackdriver to store samples, which made deploying an HPA challenging.

In late 2016, the "autoscaling special interest group" (SIG autoscaling) decided that the pipeline needed a redesign that would allow scaling based on arbitrary metrics from external monitoring systems. The result is that Kubernetes 1.6 shipped with a new API specification defining how the autoscaler integrates with those systems. Having learned from the Heapster experience, the developers specified the new API, but did not implement it for any specific system. This shifts responsibility of maintenance to the monitoring vendors: instead of "dumping" their glue code in Heapster, vendors now have to maintain their own adapter conforming to a well-defined API to get certified.

The new specification defines core metrics like CPU, memory, and disk usage. Kubernetes provides a canonical implementation of those metrics through the metrics server, a stripped down version of Heapster. The metrics server provides the core metrics required by Kubernetes so that scheduling, autoscaling, and things like kubectl top work out of the box. This means that any Kubernetes 1.8 cluster now supports autoscaling using those metrics out of the box: for example minikube or Google's Kubernetes Engine both offer a native metrics server without an external database or monitoring system.

In terms of configuration syntax, the change is minimal. Here is an example of how to configure the autoscaler in earlier Kubernetes releases, taken from the OpenShift Container Platform documentation:

    apiVersion: extensions/v1beta1
    kind: HorizontalPodAutoscaler
    metadata:
      name: frontend 
    spec:
      scaleRef:
        kind: DeploymentConfig 
        name: frontend 
        apiVersion: v1 
        subresource: scale
      minReplicas: 1 
      maxReplicas: 10 
      cpuUtilization:
        targetPercentage: 80

The new API configuration is more flexible:

    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    metadata:
      name: hpa-resource-metrics-cpu 
    spec:
      scaleTargetRef:
        apiVersion: apps/v1beta1 
        kind: ReplicationController 
        name: hello-hpa-cpu 
      minReplicas: 1 
      maxReplicas: 10 
      metrics:
      - type: Resource
        resource:
          name: cpu
          targetAverageUtilization: 50

Notice how the cpuUtilization field is replaced by a more flexible metrics field that targets CPU utilization, but can support other core metrics like memory usage.

The ultimate goal of the new API, however, is to support arbitrary metrics, through the custom metrics API. This behaves like the core metrics, except that Kubernetes does not ship or define a set of custom metrics directly, which is where systems like Prometheus come in. Branczyk demonstrated the k8s-prometheus-adapter, which connects any Prometheus metric to the Kubernetes HPA, allowing the autoscaler to add new pods to reduce request latency, for example. Those metrics are bound to Kubernetes objects (e.g. pod, node, etc.) but an "external metrics API" was also introduced in the last two months to allow arbitrary metrics to influence autoscaling. This could allow Kubernetes to scale up a workload to deal with a larger load on an external message broker service, for example.

Here is an example of the custom metrics API pulling metrics from Prometheus to make sure that each pod handles around 200 requests per second:

      metrics:
      - type: Pods
        pods:
          metricName: http_requests
          targetAverageValue: 200

Here http_requests is a metric exposed by the Prometheus server which looks at how many requests each pod is processing. To avoid putting too much load on each pod, the HPA will then ensure that this number will be around a target value by spawning or killing pods as appropriate.

Upcoming features

The SIG seem to have rounded up everything quite neatly. The next step is to deprecate Heapster: as of 1.10, all critical parts of Kubernetes use the new API so a discussion is under way in another group (SIG instrumentation) to finish moving away from the older design.

Another thing the community is looking into is vertical scaling. Horizontal scaling is fine for certain workloads, like caching servers or application frontends, but database servers, most notably, are harder to scale by just adding more replicas; in this case what an autoscaler should do is increase the size of the replicas instead of their numbers. Kubernetes supports this through the vertical pod autoscaler (VPA). It is less practical than the HPA because there is a physical limit to the size of individual servers that the autoscaler cannot exceed, while the HPA can scale up as long as you add new servers. According to Branczyk, the VPA is also more "complicated and fragile, so a lot more thought needs to go into that." As a result, the VPA is currently in alpha. It is not fully compatible with the HPA and is relevant only in cases where the HPA cannot do the job: for example, workloads where there is only a single pod or a fixed number of pods like StatefulSets.

Branczyk gave a set of predictions for other improvements that could come down the pipeline. One issue he identified is that, while the HPA and VPA can scale pods, there is a different Cluster Autoscaler (CA) that manages nodes, which are the actual machines running the pods. The CA allows a cluster to move pods between the nodes to remove underutilized nodes or create new nodes to respond to demand. It's similar to the HPA, except the HPA cannot provision new hardware resources like physical machines on its own: it only creates new pods on existing nodes. The idea here is to combine to two projects into a single one to keep a uniform interface for what is really the same functionality: scaling a workload by giving it more resources.

Another hope is that OpenMetrics will emerge as a standard for metrics across vendors. This process seems to be well under way with Kubernetes already using the Prometheus library, which serves as a basis for the standard, and with commercial vendors like Datadog supporting the Prometheus API as well. Another area of possible standardization is the gRPC protocol used in some Kubernetes clusters to communicate between microservices. Those endpoints can now expose metrics through "interceptors" that get executed before the request is passed to the application. One of those interceptors is the go-grpc-prometheus adapter, which enables Prometheus to scrape metrics from any gRPC-enabled service. The ultimate goal is to have standard metrics deployed across an entire cluster, allowing the creation of reusable dashboards, alerts, and autoscaling mechanisms in a uniform system.

Conclusion

This session was one of the most popular of the conference, which shows a deep interest in this key feature of Kubernetes deployments. It was great to see Branczyk, who is involved with the Prometheus project as well, work on standardization so other systems can work with Kubernetes.

The speed at which APIs change is impressive; in only a few months, the community upended a fundamental component of Kubernetes and replaced it with a new API that users will need to become familiar with. Given the flexibility and clarity of the new API, it is a small cost to pay to represent business logic inside such a complex system. Any simplification will surely be welcome in the maelstrom of APIs and subsystems that Kubernetes has become.

A video of the talk and slides [PDF] are available. SIG autoscaling members Marcin Wielgus and Solly Ross presented an introduction (video) and deep dive (video) talks that might be interesting to our readers who want all the gory details about Kubernetes autoscaling.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (none posted)

Updates in container isolation

May 16, 2018

This article was contributed by Antoine Beaupré


KubeCon EU

At KubeCon + CloudNativeCon Europe 2018, several talks explored the topic of container isolation and security. The last year saw the release of Kata Containers which, combined with the CRI-O project, provided strong isolation guarantees for containers using a hypervisor. During the conference, Google released its own hypervisor called gVisor, adding yet another possible solution for this problem. Those new developments prompted the community to work on integrating the concept of "secure containers" (or "sandboxed containers") deeper into Kubernetes. This work is now coming to fruition; it prompts us to look again at how Kubernetes tries to keep the bad guys from wreaking havoc once they break into a container.

Attacking and defending the container boundaries

Tim Allclair's talk (slides [PDF], video) was all about explaining the possible attacks on secure containers. To simplify, Allclair said that "secure is isolation, even if that's a little imprecise" and explained that isolation is directional across boundaries: for example, a host might be isolated from a guest container, but the container might be fully visible from the host. So there are two distinct problems here: threats from the outside (attackers trying to get into a container) and threats from the inside (attackers trying to get out of a [Tim Allclair] compromised container). Allclair's talk focused on the latter. In this context, sandboxed containers are concerned with threats from the inside; once the attacker is inside the sandbox, they should not be able to compromise the system any further.

Attacks can take multiple forms: untrusted code provided by users in multi-tenant clusters, un-audited code fetched from random sites by trusted users, or trusted code compromised through an unknown vulnerability. According to Allclair, defending a system from a compromised container is harder than defending a container from external threats, because there is a larger attack surface. While outside attackers only have access to a single port, attackers on the inside often have access to the kernel's extensive system-call interface, a multitude of storage backends, the internal network, daemons providing services to the cluster, hardware interfaces, and so on.

Taking those vectors one by one, Allclair first looked at the kernel and said that there were 169 code execution vulnerabilities in the Linux kernel in 2017. He admitted this was a bit of fear mongering; it indeed was a rather unusual year and "most of those were in mobile device drivers". These vulnerabilities are not really a problem for Kubernetes unless you run it on your phone. Allclair said that at least one attendee at the conference was probably doing exactly that; as it turns out, some people have managed to run Kubernetes on a vacuum cleaner. Container runtimes implement all sorts of mechanisms to reduce the kernel's attack surface: Docker has seccomp profiles, but Kubernetes turns those off by default. Runtimes will use AppArmor or SELinux rule sets. There are also ways to run containers as non-root, which was the topic of a pun-filled separate talk as well. Unfortunately, those mechanisms do not fundamentally solve the problem of kernel vulnerabilities. Allclair cited the Dirty COW vulnerability as a classic example of a container escape through race conditions on system calls that are allowed by security profiles.

The proposed solution to this problem is to add a second security boundary. This is apparently an overarching principle at Google, according to Allclair: "At Google, we have this principle security principle that between any untrusted code and user data there have to be at least two distinct security boundaries so that means two independent security mechanisms need to fail in order to for that untrusted code to get out that user data."

Adding another boundary makes attacks harder to accomplish. One such solution is to use a hypervisor like Kata Containers or gVisor. Those new runtimes depend on a sandboxed setting that is still in the proposal stage in the Kubernetes API.

gVisor as an extra boundary

Let's look at gVisor as an example hypervisor. Google spent five years developing the project in the dark before sharing it with the world. At KubeCon, it was introduced in a keynote and a more in-depth talk (slides [PDF], video) by Dawn Chen and Zhengyu He. gVisor is a user-space kernel that implements a subset of the Linux kernel API, but which was written from scratch in Go. The idea is to have an independent kernel that reduces [gVisor] the attack surface; while the Linux kernel has 20 million lines of code, at the time of writing gVisor only has 185,000, which should make it easier to review and audit. It provides a cleaner and simpler interface: no hardware drivers, interrupts, or I/O port support to implement, as the host operating system takes care of all that mess.

As we can see in the diagram to the right (taken from the talk slides), gVisor has a component called "sentry" that implements the core of the system-call logic. It uses ptrace() out of the box for portability reasons, but can also work with KVM for better security and performance, as ptrace() is slow and racy. Sentry can use KVM to map processes to CPUs and provide lower-level support like privilege separation and memory-management. He suggested thinking of gVisor as a "layered solution" to provide isolation, as it also uses seccomp filters and namespaces. He explained how it differed from user-mode Linux (UML): while UML is a port of Linux to user space, gVisor actually reimplements the Linux system calls (211 of the 319 x86-64 system calls) using only 64 system calls in the host system. Another key difference from other systems, like unikernels or Google's Native Client (NaCL), is that it can run unmodified binaries. To fix classes of attacks relying on the open() system call, gVisor also forbids any direct filesystem access; all filesystem operations go through a second process called the gopher that enforces access permissions, in another example of a double security boundary.

According to He, gVisor has a 150ms startup time and 15MB overhead, close to Kata Containers startup times, but smaller in terms of memory. He said the approach is good for small containers in high-density workloads. It is not so useful for trusted images (because [Dawn Chen] it's not required), workloads that make heavy use of system calls (because of the performance overhead), or workloads that require hardware access (because that's not available at all). Even though gVisor implements a large number of system calls, some functionality is missing. There is no System V shared memory, for example, which means PostgreSQL does not work under gVisor. A simple ping might not work either, as gVisor lacks SOCK_RAW support. Linux has been in use for decades now and is more than just a set of system calls: interfaces like /proc and sysfs also make Linux what it is. gVisor implements none of those Of those, gVisor only implements a subset of /proc currently, with the result that some containers will not work with gVisor without modification, for now.

As an aside, the new hypervisor does allow for experimentation and development of new system calls directly in user space. The speakers confirmed this was another motivation for the project; the hope is that having a user-space kernel will allow faster iteration than working directly in the Linux kernel.

Escape from the hypervisor

Of course, hypervisors like gVisor are only a part of the solution to pod security. In his talk, Allclair warned that even with a hypervisor, there are still ways to escape a container. He cited the CVE-2017-1002101 vulnerability, which allows hostile container images to take over a host through specially crafted symbolic links. [Zhengyu He] Like native containers, hypervisors like Kata Containers also allow the guest to mount filesystems across the container boundary, so they are vulnerable to such an attack.

Kubernetes fixed that specific bug, but a general solution is still in the design phase. Allclair said that ephemeral storage should be treated as opaque to the host, making sure that the host never interacts directly with image files and just passes them down to the guest untouched. Similarly, runtimes should "mount block volumes directly into the sandbox, not onto the host". Network filesystems are trickier; while it's possible to mount (say) a Ceph filesystem in the guest, that means the access credentials now reside within the guest, which moves the security boundary into the untrusted container.

Allclair outlined networking as another attack vector: Kubernetes exposes a lot of unauthenticated services on the network by default. In particular, the API server is a gold mine of information about the cluster. Another attack vector is untrusted data flows from containers to the user. For example, container logs travel through various Kubernetes components, and some components, like Fluentd, will end up parsing those logs directly. Allclair said that many different programs are "looking at untrusted data; if there's a vulnerability there, it could lead to remote code execution". When he looked at the history of vulnerabilities in that area, he could find no direct code execution, but "one of the dependencies in Fluentd for parsing JSON has seven different bugs with segfault issues so we can see that could lead to a memory vulnerability". As a possible solution to such issues, Allclair proposed isolating components in their own (native, as opposed to sandboxed) containers, which might be sufficient because Fluentd acts as a first trusted boundary.

Conclusion

A lot of work is happening to improve what is widely perceived as defective container isolation in the Linux kernel. Some take the approach of trying to run containers as regular users ("root-less containers") and rely on the Linux kernel's user-isolation properties. Others found this relies too much on the security of the kernel and use separate hypervisors, like Kata Containers and gVisor. The latter seems especially interesting because it is lightweight and doesn't add much attack surface. In comparison, Kata Containers relies on a kernel running inside the container, which actually expands the attack surface instead of reducing it. The proposed API for sandboxed containers is currently experimental in the containerd and CRI-O projects; Allclair expects the API to ship in alpha as part the Kubernetes 1.12 release.

It's important to keep in mind that hypervisors are not a panacea: they do not support all workloads because of compatibility and performance issues. A hypervisor is only a partial solution; Allclair said the next step is to provide hardened interfaces for storage, logging, and networking and encouraged people to get involved in the node special interest group and the proposal [Google Docs] on the topic.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (11 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Firefox sandboxing; Email encryption vulnerabilities; Snap Store security; Rust 1.26; Quotes; ...
  • Announcements: Newsletters; events; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds