LWN.net Weekly Edition for March 27, 2014

Kicking the tires of Ford's OpenXC

By Nathan Willis
March 26, 2014

OpenXC is a software platform created by the US auto maker Ford. The goal is to enable developers to write applications that integrate with the data buses in Ford vehicles. To that end, the company has rolled out a set of open APIs and has defined a family of hardware interfaces to bridge the vehicle-to-computer gap. I recently got the chance to work with one of the hardware interfaces, and, while the capabilities of OpenXC are limited in scope, the project certainly hits the right notes in terms of software freedom and usability. The long-term challenge will be seeing how OpenXC cooperates with the larger open source automotive software ecosystem.

Hardware

Ford launched OpenXC in January of 2012 in a collaborative effort with Bug Labs, who designed the hardware interface modules. Officially called Vehicle Interfaces (VIs), these modules plug into a car's On-Board Diagnostic (OBD-II) port. On the inside, the VI runs a car-specific firmware that reads from the data bus and translates raw OBD-II messages into JSON objects (in OpenXC's own message format), which it then sends out over Bluetooth.

That setup all sounds good, but naturally one cannot get started with it without a compatible VI (a generic OBD-II dongle is not a substitute, since it does not output OpenXC messages). In mid-2013, when I first started looking at OpenXC, the official VIs were out of production and there was no expected delivery date for more. Nevertheless, I signed up to be notified if new reference VIs were released, and in early March received an email announcing that a fresh batch was available for purchase. When the ordered VI arrived, I took it for a test drive.

The good news is that it is entirely possible to build one's own VI, should the current batch run out again, using off-the-shelf components and the open hardware chipKIT Max32, an Arduino-compatible microcontroller. That option is not an inexpensive one, however, and requires some skill with assembling electronics. Whichever hardware route one takes, the VI must be flashed with firmware tailored to the vehicle on which it will be used. The list of supported vehicles is not long—currently 35 Ford models, the oldest being from 2005. But the firmware is BSD-licensed and the source is available on GitHub.

With the hardware in hand, one can attach it to the car's OBD-II port, pair with it over Bluetooth, and run a quick smoke test to make sure that data is indeed getting sent from the VI.

Development

OpenXC supports both Android and Python development. The Android API is accessible from an open source library that supports Android 2.3 through 4.4. The API provides a high-level VehicleManager service that returns vehicle data objects relayed from the VI. There is also an auxiliary app called OpenXC Enabler that starts the vehicle service and pairs with the VI whenever the Android device is booted.

The Python interface is a bit lower-level; a module is available (just for Python 2.6 and 2.7 at the moment) that provides classes for the supported OpenXC data types as well as callbacks, basic logging functionality, and a facility to relay vehicle data to a remote HTTP server. The Python module also supports connecting to the VI over a serial or USB connection (as opposed to Bluetooth alone), and there is a set of command-line tools that can be used to monitor, dump, and trace through messages.

Both the Android and Python interfaces provide access to the same set of vehicle data, which is limited to a set of measurements (e.g., vehicle_speed, accelerator_pedal_position, headlamp_status, etc.) and a set of events (at the moment, just steering-wheel button presses). Exactly which measurements and events are available varies with the vehicle model; my own test car is a Mustang, which does not support the full set (omitting such measurements as steering_wheel_angle and fuel_level).

But Ford's documentation is detailed enough that the same set of measurements could also be read by a simpler device that plugged into the same OBD-II port; there are in fact other devices that expose much the same functionality to other automotive Linux projects. Tizen IVI, for example, can utilize the OBDLink MX to monitor the same data. There are other devices that can not only read messages from the vehicle's Controller Area Network (CAN) bus (which, for the record, is the bus protocol used by Ford behind the OBD-II port), but can write to it as well.

Merging with the automotive ecosystem

Given that there are other options, the question for developers is whether or not OpenXC offers advantages over interfacing directly with the vehicle bus. It certainly does simplify Android app development—and, if there was any lingering doubt about its usability, the project maintains a list of known OpenXC apps, which cover uses as diverse as indicating the optimal shift moment (which is a relatively straightforward calculation) to parking collision warnings (which necessitates significant hardware modifications, including the addition of a camera). Hooks are also in place to tie in to Android devices' other useful hardware features (such as location data).

The app-development story, however, is limited to Android, at least for now. Apple iOS support appears to be out of the question, since (as it was explained on the mailing list) Apple requires the presence of an Apple authentication chip in order to make low-level Bluetooth connections. As one might expect, there has so far been no word about any of the alternative mobile platforms (Firefox OS, Ubuntu, or even Windows Phone).

The case for Python development with OpenXC is not quite as clear-cut. The Python interface is designed to run on a more full-featured computer, which puts it into competition with in-vehicle infotainment (IVI) projects like GENIVI and Tizen IVI. An application-level API for accessing vehicle data is certainly appealing, but it could be a hard sell for Ford to convince other automakers to adopt OpenXC's message format and semantics rather than to draft something that is vendor-neutral from day one. Several pages on the project web site suggest that Ford is interested in pursuing this goal, encouraging users to (for example) tell other car makers that they want to see OpenXC support.

And there is competition. The most plausible candidate for such a vendor-neutral API is probably the World Wide Web Consortium (W3C) Automotive Business Group's Vehicle API, which recently became a draft specification. Ford does not have any official representatives sitting in with the group, although there are a few individuals representing themselves who are Ford employees. Most of the IVI development projects are also designed to support sending messages over the automotive data bus, which OpenXC does not yet address. There are hints in the project wiki and mailing list that further expansion is a possibility, but competing projects like Tizen IVI's Automotive Message Broker have already implemented message-writing functionality.

OpenXC also has a hurdle to overcome on the hardware side. The reference VI is quite large, especially given the fact that it must fit into a socket which, by informal industry standard, juts down below the steering column toward the driver's legs. OBD-II ports are not designed for permanent hardware installation, both due to their location and the fact that most of them do not latch into place; an innocent knee can easily dislodge the VI several times over the course of a drive.

That said, OpenXC is a commendable developer program for several reasons. The entire stack, including the VI firmware, is available as free software, which is refreshing and—judging from the list traffic—also seems to attract quite a few developers. In contrast, for example, support for the OBDLinks dongle mentioned earlier had to be reverse-engineered by the interested open source developers. Ford has also released quite a bit of information about its vehicles and their CAN Bus usage to the public, as well as other useful resources like vehicle data traces. In an industry as competitive as the automotive business, this level of openness is not the norm; the developer programs of other car makers generally require some type of non-disclosure agreement and maintain a tighter grip on their specifications.

The main goal of OpenXC seems to be making it possible to write third-party applications that utilize Ford vehicle data. That is, OpenXC is not aimed at system integrators or equipment manufacturers, but at individual, do-it-yourself hackers. On that front, it is easy to get started with, and worth exploring if one has a compatible car. Whether OpenXC can grow to encompass broader uses and more ambitious goals remains to be seen.

Comments (5 posted)

PostgreSQL pain points

By Jonathan Corbet
March 26, 2014

2014 LSFMM Summit

The kernel has to work for a wide range of workloads; it is arguably unsurprising that it does not always perform as well as some user communities would like. One community that sometimes felt left out in the cold is the PostgreSQL relational database management system project. In response to an invitation from the organizers of the 2014 Linux Storage, Filesystem, and Memory Management summit, PostgreSQL developers Robert Haas, Andres Freund, and Josh Berkus came to discuss their worst pain points and possible solutions.

PostgreSQL is an old system, dating back to 1996; it has a lot of users running on a wide variety of operating systems. So the PostgreSQL developers are limited in the amount of Linux-specific code they can add. It is based on cooperating processes; threads are not used. System V shared memory is used for interprocess communication. Importantly, PostgreSQL maintains its own internal buffer cache, but also uses buffered I/O to move data to and from disk. This combination of buffering leads to a number of problems experienced by PostgreSQL users.

Slow sync

The first problem described is related to how data gets to disk from the buffer cache. PostgreSQL uses a form of journaling that they call "write-ahead logging". Changes are first written to the log; once the log is safely on disk, the main database blocks can be written back. Much of this work is done in a "checkpoint" process; it writes log entries, then flushes a bunch of data back to various files on disk. The logging writes are relatively small and contiguous; they work fairly well, and, according to Andres, the PostgreSQL developers are happy enough with how that part of the system works on Linux.

The data writes are another story. The checkpoint process paces those writes to avoid overwhelming the I/O subsystem. But, when it gets around to calling fsync() to ensure that the data is safely written, all of those carefully paced writes are flushed into the request queue at once and an I/O storm results. The problem, they said, is not that fsync() is too slow; instead, it is too fast. It dumps so much data into the I/O subsystem that everything else, including read requests from applications, is blocked. That creates pain for users and, thus, for PostgreSQL developers.

Ted Ts'o asked whether the ability to limit the checkpoint process to a specific percentage of the available I/O bandwidth would help. But Robert responded that I/O priorities would be better; the checkpoint process should be able to use 100% of the bandwidth if nothing else wants it. Use of the ionice mechanism (which controls I/O priorities in the CFQ scheduler) was suggested, but there is a problem: it does not work for I/O initiated from an fsync() call. Even if the data was written from the checkpoint process — which is not always the case — priorities are not applied when the actual I/O is started by fsync().

Ric Wheeler suggested that the PostgreSQL developers need to better control the speed with which they write data; Chris Mason added that the O_DATASYNC option could be used to give better control over when the I/O requests are generated. The problem here is that such approaches require PostgreSQL to know about the speed of the storage device.

The discussion turned back to I/O priorities, at which point it was pointed out that those priorities can only be enforced in the request queue maintained by the I/O scheduler. Some schedulers, including those favored by PostgreSQL users (who tend to avoid the CFQ scheduler), do not implement I/O priorities at all. But, even those that do support I/O priorities place limits on the length of the request queue. A big flush of data will quickly fill the queue, at which point I/O priorities lose most of their effectiveness; a high-priority request will still languish if there is no room for it in the request queue. So, it seems, I/O priorities are not the solution to the problem.

It's not clear what the right solution is. Ted asked if the PostgreSQL developers could provide a small program that would generate the kind of I/O patterns created by a running database. Given a way to reproduce the problem easily, kernel developers could experiment with different approaches to a solution. This "program" may take the form of a configuration script for PostgreSQL initially, but a separate (small) program is what the kernel community would really like to see.

Double buffering

PostgreSQL needs to do its own buffering; it also needs to use buffered I/O for a number of reasons. That leads to a problem, though: database data tends to be stored in memory twice, once in the PostgreSQL buffer, and once in the page cache. That increases the amount of memory used by PostgreSQL considerably, to the detriment of the system as a whole.

Much of that memory waste could conceivably be eliminated. Consider, for example, a dirty buffer in the PostgreSQL cache. It is more current than any version of that data that the kernel might have in the page cache; the only thing that will ever happen to the page cache copy is that it will be overwritten when PostgreSQL flushes the dirty buffer. So there is no value to keeping that data in the page cache. In this case, it would be nice if PostgreSQL could tell the kernel to remove the pages of interest from the page cache, but there is currently no good API for that. Calling fadvise() with FADV_DONTNEED can, according to Andres, actually cause pages to be read in; nobody quite understood this behavior, but all agreed it shouldn't work that way. They can't use madvise() without mapping the files; doing that in possibly hundreds of processes tends to be very slow.

It would also be nice to be able move pages in the opposite direction: PostgreSQL might want to remove a clean page from its own cache, but leave a copy in the page cache. That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache. There was some talk of the form such an interface might take, but this part of the discussion eventually wound down without any firm conclusions.

Regressions

Another problem frequently experienced by PostgreSQL users is that recent kernel features tend to create performance problems. For example, the transparent huge pages feature tends not to bring much benefit to PostgreSQL workloads, but it slows them down significantly. Evidently a lot of time goes into the compaction code, which is working hard without actually producing a lot of free huge pages. In many systems, terrible performance problems simply vanish when transparent huge pages are turned off.

Mel Gorman answered that, if compaction is hurting performance, it's a bug. That said, he hasn't seen any transparent huge page bugs for quite some time. There is, he said, a patch out there which puts a limit on the number of processes that can be performing compaction at any given time. It has not been merged, though, because nobody has ever seen a workload where too many processes running compaction was a problem. It might, he suggested, be time to revisit that particular patch.

Another source of pain is the "zone reclaim" feature, whereby the kernel reclaims pages from some zones even if the system as a whole is not short of memory. Zone reclaim can slow down PostgreSQL workloads; usually the best thing to do on a PostgreSQL server is to simply disable the feature altogether. Andres noted that he has been called in as a consultant many times to deal with performance problems related to zone reclaim; it has, he said been a good money-maker for him. Still, it would be good if the problem were to be fixed.

Mel noted that the zone reclaim mode was written under the assumption that all processes in the system would fit into a single NUMA node. That assumption no longer makes sense; it's long past time, he said, that this option's default changed to "off." There seemed to be no opposition to that idea in the room, so a change may happen sometime in the relatively near future.

Finally, the PostgreSQL developers noted that, in general, kernel upgrades tend to be scary. The performance characteristics of Linux kernels tend to vary widely from one release to the next; that makes upgrades into an uncertain affair. There was some talk of finding ways to run PostgreSQL benchmarks on new kernels, but no definite conclusions were reached. As a whole, though, developers for both projects were happy with how the conversation came out; if nothing else, it represents a new level of communication between the two projects.

[Your editor would like to thank the Linux Foundation for supporting his travel to the LSFMM summit.]

Comments (68 posted)

Facebook and the kernel

By Jake Edge
March 26, 2014

2014 LSFMM Summit

As one of the plenary sessions on the first day of the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Btrfs developer Chris Mason presented on how his new employer, Facebook, uses the Linux kernel. He shared some of the eye-opening numbers that demonstrate just how much processing Facebook does using Linux, along with some of the "pain points" the company has with the kernel. Many of those pain points are shared with the user-space database woes that were presented in an earlier session (and will be discussed further at the Collaboration Summit that follows LSFMM), he said, so he mostly concentrated on other problem areas.

Architecture and statistics

In a brief overview of Facebook's architecture, he noted that there is a web tier that is CPU- and network-bound. It handles requests from users as well as sending replies. Behind that is a memcached tier that caches data, mostly from MySQL queries. Those queries are handled by a storage tier that is a collection of several different database systems: MySQL, Hadoop, RocksDB, and others.

Within Facebook, anyone can look at and change the code in its source repositories. The facebook.com site has its code updated twice daily, he said, so the barrier to getting new code in the hands of users is low. Those changes can be fixes or new features.

As an example, he noted that the "Look Back" videos, which were created by Facebook for each user and reviewed all of their posts to the service, added a huge amount of data and required a lot more network bandwidth. The process of creating and serving all of those videos was the topic of a Facebook engineering blog post. In all 720 million videos were created, which required an additional 11 petabytes of storage, as well as consuming 450 Gb/second of peak network bandwidth for people viewing the videos. The Look Back feature was conceived, provisioned, and deployed in only 30 days, he said.

The code changes quickly, so when performance or other problems crop up, he and other kernel developers can tell others in the company that "you're doing it wrong". In fact, he said, "they love that". It does mean that he has to come up with concrete suggestions on how to do it right, but Facebook is not unwilling to change its code.

Facebook runs a variety of kernel versions. The "most conservative" hosts run a 2.6.38-based kernel. Others run the 3.2 stable series with roughly 250 patches. Other servers run the 3.10 stable series with around 60 patches. Most of the patches are in the networking and tracing subsystems, with a few memory-management patches as well.

One thing that seemed to surprise Mason was the high failure tolerance that the Facebook production system has. He mentioned the 3.10 pipe race condition that Linus Torvalds fixed. It is a "tiny race", he said, but Facebook was hitting it (and recovering from it) 500 times per day. The architecture of the system is such that it could absorb that kind of failure rate without users noticing anything wrong.

Pain points

Mason asked around within Facebook to try to determine what the worst problem is that the company has with the kernel. In the end, two features were mentioned the most frequently: Stable pages and the completely fair queueing (CFQ) I/O scheduler. "I hope we never find those guys", he said with a laugh, since Btrfs implements stable pages. In addition, James Bottomley noted that Facebook already employs another CFQ developer (Jens Axboe).

Another area that was problematic for Facebook is surprises with buffered I/O latency, especially for append-only database files. Most of the time, those writes go fast, but sometimes they are quite slow. He would like to see the kernel avoid latency spikes like that.

He would like to see kernel-style spinlocks be available from user space. Rik van Riel suggested that perhaps POSIX locks could use adaptive locking, which would spin for a short time then switch to sleeping if the lock did not become available quickly. The memcached tier has a kind of user-space spinlock, Mason said, but it is "very primitive compared to the kernel".

Fine-grained I/O priorities is another wish list item for Facebook (and for the PostgreSQL developers as well). There are always cleaners and compaction threads that need to do I/O, but shouldn't hold off the higher-priority "foreground" I/O. Mason was asked about how the priorities would be specified, by I/O operation or file range, for example. In addition, he was asked about how fine-grained the priorities needed to be. Either way of specifying the priorities would be reasonable, and Facebook really only needs two (or few) priority levels: low and high.

The subject of ionice was raised again. One of the problems with that as a solution is that it only works with the (disabled by Facebook) CFQ scheduler. Bottomley suggested making ionice work with all of the schedulers, which Mason said might help. In order to do that, though, Ted Ts'o noted that the writeback daemon will have to understand the ionice settings.

Another problem area is logging. Facebook logs a lot of data and the logging workloads have to use fadvise() and madvise() to tell the kernel that those pages should not be saved in the page cache. "We should do better than that." Van Riel suggested that the page replacement patches in recent kernels may make things better. Mason said that Facebook does not mind explicitly telling the kernel which processes are sequentially accessing the log files, but continually calling *advise() seems excessive.

Josef Bacik has also been working on a small change to Btrfs to allow rate limiting buffered I/Os. It was easy to do in Btrfs, Mason said, but if the idea pans out would move elsewhere for more general availability. Jan Kara was concerned that only limiting buffered I/O would be difficult since there are other kinds of I/O bound for the disk at any given time. Mason agreed, saying that the solution would not be perfect but might help.

Bottomley noted that ionice is an existing API that should be reused to help with these kinds of problems. Similar discussions of using other mechanisms in the past have run aground on "painful arguments about which API is right", he said. Just making balance_dirty_pages() aware of the ionice priority may solve 90% of the problem. Other solutions can be added later.

Mason explained that Facebook stores its logs in a large Hadoop database, but that the tools for finding problems in those logs are fairly primitive—grep essentially. He said that he would "channel Lennart [Poettering] and Kay [Sievers]" briefly to wish for a way to tag kernel messages. Bottomley's suggestion that Mason bring it up with Linus Torvalds at the next Kernel summit was met with widespread chuckling.

Danger tier

While 3.10 is fairly close to the latest kernels, Mason would like to run even more recent kernels. To that end, he is creating something he calls the "danger tier". He ported the 60 patches that Facebook currently adds to 3.10.x to the current mainline Git tree and is carving out roughly 1000 machines to test that kernel in the web tier. He will be able to gather lots of performance metrics from those systems.

As a simple example of the kinds of data he can gather, he put up a graph of request response times (without any units) that was gathered over 3 days. It showed a steady average response time line all the way at the bottom as well as the ten worst systems' response times. Those not only showed large spikes in the response times, but also that the baseline for those systems was roughly twice that of the average. He can determine which systems those are, ssh in, and try to diagnose what is happening with them.

He said that was just an example. Eventually he will be able to share more detailed information that can be used to try to diagnose problems in newer kernels and get them fixed more quickly. He asked for suggestions of metrics to gather for the future. With that, his session slot expired.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (14 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Security: S-CRIB password scrambler; New vulnerabilities in chromium, kernel, mozilla, nginx, ...
Kernel: Lots and lots of LSFMM 2014 coverage.
Distributions: Distributions wrestle with pip; openSUSE, ...
Development: Java 8; musl libc 1.0.0; GNOME 3.12; SpinachCon; ...
Announcements: FSF awards; new members for AllSeen Alliance, LF, and OIN; ...

Next page: Security>>