LWN.net Weekly Edition for March 19, 2015
Debian adds technical committee members and considers their role
Debian's technical committee (TC) has been the subject of considerable discussion—even turmoil—over the course of the past year. 2014 saw high-profile debates within the TC, several resignations of long-serving TC members, and a General Resolution to introduce term limits for the first time. So it may come as welcome news that the TC recently added three new members to its ranks with little to no overt drama—although, in the process, thought-provoking questions were raised about how the TC itself would like to see its role evolve in the coming years.
In January 2015, the term-limit resolution was approved, but its effects will not take effect until 2016. In the meantime, TC business has more or less returned to normal. On March 5, the existing TC recommended the addition of three new members: Sam Hartman, Tollef Fog Heen, and Didier Raboud.
These recommendations were given to Debian Project Leader (DPL) Lucas Nussbaum. The DPL is not obligated to act on them, but only the DPL can formally appoint new TC members. Nussbaum indeed did make all three appointments, on March 8. Since the recently adopted term-limit rules take the order of appointments into account, Nussbaum made the appointments in a specific order (first Raboud, then Fog Heen, then Hartman).
Nussbaum explained that his use of this order—which is the reverse of
the order of the recommendations—was done so that "the most preferred
appointee could serve one more year due to the expiration
algorithm
" (the "most preferred" in this case being the name
put forward first by the existing TC). That outcome would happen because the new
term-limit rules only cause two seats to expire in any normal year
("normal," in this case, meaning a year without numerous early resignations).
In other words, by appointing Hartman third, Nussbaum ensured that
Hartman will have his term expire later than Raboud and Fog Heen.
The turnover process, then, would appear to be working smoothly. Before Nussbaum made his appointments, though, Hartman added one wrinkle to the story by suggesting to the TC that he might be not a good fit for the job, on the grounds that he has some specific thoughts on how the TC should operate that others may not agree with.
Hartman posed his suitability question in a March 4 message to the Debian TC mailing list,
saying that he hoped the TC could play a role that helped project
members do their technical work "with less project-wide
pain.
" Specifically, he said:
I hope that we'll work with people to see other sides of an issue and to help them make decisions more than we work as an appeal board.
In essence, Hartman wants to see the TC shift its role from that of final authority in otherwise-intractable disputes to one that can offer guidance and assistance on a regular basis—hopefully reducing the number of disagreements that reach "intractable dispute" status. As he explained later in the email:
For example, he continued, the TC might get involved to address communications issues, even if they do not impact the technical quality of the project.
Hartman's first example was vague enough to be open to interpretation, but it seemed to suggest at least the possibility that the TC might resolve some disagreement on the basis of which camp contributes more to the Debian project's functioning as a smooth, happy community, rather than solely on the technical merits of each camp's software:
Continuing, he provided a more specific example, pointing to the early-2014 dispute over whether applications should stick to Debian's traditional menu-file format or adopt the newer format used by GNOME, KDE, and other desktop environments. In that case, Hartman said, the TC seemed to make its decision on the basis of technical correctness, even though that decision sided with a minority viewpoint and upset a majority of the developers involved (who had already gone in a different direction with rough consensus). Ignoring consensus in such cases has the effect of devaluing the effort that people put into their work.
Then again, Hartman also made it clear that he would prefer to see fewer formal resolutions emanate from the TC in the first place. He noted that in the init-system debate, Don Armstrong had contended that every TC decision ought to result in a resolution. Hartman disagreed:
However, there's a big difference between actively not not acting and dropping an item through inaction.
Forcing everything to have a formal resolution (even if that is a formal resolution to take no action) really gets in the way of helping people out, building consensus, fostering communication. There are many times when it's really important to be able to say something like "I don't think we need anything more here; if folks disagree, speak up."
Hartman concluded by acknowledging that Armstrong's position (requiring a resolution from the TC on all disputes) was consistent with the constitution, but suggested that a more liberal reading in which the TC had flexibility to get involved without issuing resolutions, might better serve the project.
Hartman posed the question to the TC in order to ask if his thoughts about the TC's role in Debian were peculiar enough that he should turn down an appointment. Surprisingly, perhaps, no one took issue with his sentiments. Bdale Garbee and Keith Packard both expressed broad support. Armstrong, for his part, suggested that he believes the TC can act informally as well as formally, and that there are a number of ways in which a dispute can be resolved without a formal TC resolution. He also expressed support for Hartman's nomination.
Fog Heen added that "consensus" can have more than one meaning, from complete agreement to "is anybody unable to live with this?" It is important to establish what meaning the TC has in mind, he said, but, regardless of what the answer is, he looked forward to seeing Hartman on the TC.
Ultimately, then, all participants in the discussion felt that Hartman's thoughts about the TC were compatible with their own. Nussbaum made the appointments, after which the new TC proceeded to take up its first official task. Garbee, the current TC chairman, announced that he would be stepping down from the role (since his term will expire at the end of 2015), and called for a vote to select a new chairman. By transitioning to a new chairman now, Garbee's departure next year will have less of an impact on TC continuity. Armstrong won the vote, and the TC moved on to new business.
Nevertheless, the incident may have repercussions for quite some time. The preceding year was a rough one for Debian and the TC in particular; Hartman's questions about the role of the TC going forward may have crystallized similar thoughts that had been just beneath the surface for many TC members and Debian developers. It is unlikely that Debian has seen its last argument, but the project may have learned a few things from its recent encounters with conflict.
Conflict over a code
A new "code of conflict" was merged into the mainline kernel on March 8. Since then, your editor has endured news articles, direct emails, and being cornered at conferences; it seems that just about everybody has an opinion to share on this little document. Much of what has been said shows, in your editor's opinion, a misunderstanding of what the code is and what it is trying to do. So here are some thoughts on the matter.It should be emphasized that these are your editor's thoughts; they are not representative of anybody else. The Linux Foundation's Technical Advisory Board (TAB), which is named as the resolving body in the code, and of which your editor is a member, does not even know the article is being written. Even your editor's dog has requested an explicit disclaimer stating that these are not her opinions. But, then, the dog's opinions are mostly concerned with whether something is edible or not.
Perhaps the most surprising thing from your editor's point of view is the articles portraying the code as directly aimed at Linus Torvalds; expressions like "slap down" or "rein in" are not hard to come by. One particularly amusing piece saw it as a deliberate attempt to undercut an "over the hill" Linus and, presumably, take his place. In fact, Linus was given the opportunity to comment on the code and was, of course, the person who merged it into the mainline. It would be unusual for somebody to cooperate in his own reining-in in this way.
If the behavioral problems on the kernel mailing lists could be solved by muzzling one person, the issue as a whole would be much more easily dealt with. But that is not the case here; opinions differ on how large the problem in the kernel community really is, but it is a rare participant indeed who thinks it comes down to a single individual. Attempts to change the way a community behaves generally require addressing the community as a whole; that is what the code of conflict is attempting to do.
There is another line of thought that sees the "code of conflict" as a no-op statement with no teeth. What your editor has heard from a number of sources is that the code looks like an attempt to paper over the problem without actually doing anything about it.
This code does differ from the codes of conduct adopted by a number of other projects. It states from the outset that conflict over technical issues is a part of how the kernel community works. It lacks a list of specifically prohibited behaviors — something that a number of critics have pointed out. There is no list of specific sanctions that can be applied to developers who are deemed to violate the rules, whatever they turn out to be. The code places the onus on the target of abusive behavior to raise the issue with a distant group of ten developers who may or may not have this person's trust. All of this, to some, means that the code is not designed to actually bring about any real-world change.
From your editor's understanding, the list of unwanted behaviors was left
out for a couple of reasons: to keep the code short and comprehensible, and
to avoid attempts to play games around the edge of the rules. So the
actual rule is short and simple: participants are not to be made to feel
"personally abused, threatened, or otherwise uncomfortable
".
The assumption is that targets of abuse can tell whether they feel abused
without having to check a list of specific behaviors. The concern that
such people may be more reluctant to complain without a specific rule to
point to may have more merit. The good news is that the community is full
of sharp-eyed observers who are likely to blow the whistle on poor behavior
even if the target feels too intimidated to do so.
With regard to specific sanctions, there are limits on what can be done. Those who abuse the kernel mailing lists can already be banned; that has recently happened, for example, to some of the persistent systemd trolls who tried (unsuccessfully) to stir up flame wars recently. There is no community process that could, say, remove an abusive subsystem maintainer from that role. One could imagine that the TAB might advise such a course in an extreme case, but it would be up to Linus to actually carry it out. Hopefully, though, the bulk of any problems raised under the code of conflict can be resolved without resorting to punishments or the threat thereof.
At a different extreme, there are those who see the code as the beginning of the end for the kernel community. In this view, the code will curtail debate over code submissions and lead to a lowering of quality standards overall. In talking to many people about this code, your editor has noted that even those who feel most strongly that it should be more explicit and have sharper teeth do not say that it should become easier to get code merged into the kernel. It seems relatively safe to predict that anybody who complains that their code was rejected on technical grounds will be advised to address the issues raised and try again. There is no reason to believe that the standards for kernel code will be lowered.
Years of writing for LWN have given your editor a (possibly twisted) standard for success: if people on all sides of an issue appear to be equally unhappy with an article, it was probably reasonably fair. By that metric, the code of conflict may well be deemed to have gotten things about right: the complaints from the "it's business as usual" and the "it's the end of the kernel" camps both seem loud. At this early stage, it would have been hard to do better than that.
That said, those who see this code as an exercise in being seen to be Doing Something are probably not entirely off the mark. Doing Something does not always turn into "having done something useful" over time. The real value of this code can only be seen going forward. If it empowers the targets of abusive behavior to speak out, and if it helps to bring a resolution of such situations and end that behavior, it will be a success. If so, the kernel community should become a kinder, gentler place, though, in truth, that has already been happening for some time.
A few years from now, we might just look back on the kernel's code of conflict as
an inadequate, failed response. But it is far too soon to predict that
with any degree of certainty. The thing to do for now is to give it a
chance and see if problems arise that the code's process fails to address
adequately. As Linus said when he merged
it: "Let's see how this works
". Even your editor's dog
would agree, especially if it came with something to eat.
Security
Filesystem fuzzing
At the inaugural Vault conference, Sasha Levin gave a presentation on filesystem fuzzing—deliberately providing random bad input to the kernel to try to find bugs. He described different kinds of fuzzing, along with giving examples of some security bugs that were found. The conference itself focused on Linux storage and filesystems and was held March 11-12 in Boston. It attracted around 400 attendees, which has led the Linux Foundation to schedule another Vault for next year in Raleigh, North Carolina.
Levin started by saying that Linux has a problem with "shitty code". That's not because the developers are not skilled, nor is it that code review is going by the wayside. The biggest problem is that the code does not get all that much testing until after it is merged into the mainline. At that point, users get their hands on it and start to find bugs.
Kernel testing
Testing the kernel is done by multiple groups in the ecosystem. Developers will run some tests against their code; for filesystems those tests might include xfstests. Quality assurance (QA) groups will also run tests, but those are typically limited to existing test suites with a known set of tests. The kernel is a "big, scary machine", he said, and it needs more testing.
There are two different kinds of testing: manual and automated. Manual tests are typically run by developers based on the code they changed. If a developer changes the open() call, for example, they "poke it a little bit" to see if anything is broken. That kind of testing is slow and requires a human to create, run, and interpret the tests. It doesn't really scale so that multiple testers could get involved, either.
Automated tests essentially perform the manual tests automatically. Once a test suite covers the basics, though, people stop adding tests except to check for regressions. There is not much done with these test suites (such as the Linux Test Project, xfstests, Filebench, IOzone, and others) to find new bugs. In addition, there is no real effort to test new features.
Users test the code by doing their normal work. They may have a technical background, but they did not review the patches and are not working on the filesystem. They are just trying to get their work done and have not set out to test anything.
There are some things missing from today's testing. Test developers don't try to guess what users will or won't do so that tests cover the corner cases. Test suites generally just check for regressions. In addition, there is little imagination that goes into test development, since creating new features is much more interesting to developers than creating new tests.
For example, he mentioned the __GFP_NOFAIL issues that have been discussed in kernel forums (including the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit) recently. Dave Chinner added tests to xfstests to observe that problem, but only after the problems had been hit. That means that someone ran into those problems and ended up with a corrupted filesystem. It would be nice to find those kinds of problems before someone hits them and ends up complaining about a "shitty kernel", he said.
Fuzzing
Fuzzing is a technique that effectively creates new tests on the fly. Some of those tests are stupid, but others may find bugs. In addition, fuzzing frameworks tend to be heavily threaded which puts a different kind of load on filesystems. The existing test suites do put a load on the filesystem, but it is basically the same load over and over again. So fuzzing can help test concurrency in the filesystem as well.
"Structure fuzzing" simply takes a filesystem image, makes some changes to it, and then tries to mount it. Some of those tests have found kernel crashes or panics at mount time. But not every corruption can or will be found at mount time because that is too expensive to check. Testing with other operations will show whether the corruption is handled appropriately post-mount.
But just flipping every bit in the filesystem image doesn't really make too much sense as a test. That's where "smart structure fuzzing" comes into play. This kind of testing is filesystem-specific as it must have some knowledge of the structure of the filesystem. Since that structure can't really change often (it resides on-disk), this kind of testing does not need to be done all of the time. It can be run occasionally, especially when there are changes that might affect the binary format.
"API fuzzing" is more popular, Levin said. It typically fuzzes the virtual filesystem (VFS) layer, so it is not necessarily filesystem-specific. Basically, API fuzzing tries passing lots of different values to the system calls to see if it can break something.
"Smart API fuzzing" takes that one step further by incorporating knowledge about the kinds of values that make sense as parameters to the system calls. For example, chmod() takes a path and a mode. The first check in chmod() is to see if the mode value is reasonable, so sending all of the 216 possibilities doesn't make sense all of the time. Doing that occasionally is useful, but it is overkill to test the same error path over and over.
As an example of what this kind of fuzzing can find, Levin pointed to CVE-2015-1420. It is an invalid memory access in open_by_handle_at() that was found because the fuzzer knew what the function expects. In a multithreaded test, it was able to change the size in a structure between the time it was used for allocating a buffer and the time it was used to actually read the data. Since the fuzzer had knowledge of the parameters and their types, it could change them in multiple threads.
Having many threads all accessing the filesystem is a place where fuzzers shine. For example, simulating 10,000 users is easy, which can help catch untested scenarios, he said. It makes it easier to catch problems where a lot of load is needed to hit them.
CVE-2014-4171 was an example of a bug that needed a high load to find. It is a local denial of service that can happen when accessing the region around a hole in a file using mmap() while that hole is being punched in another thread. It was easy to see in the code once it was discovered, but it was only found under heavy load from the fuzzer.
That is one of the benefits of fuzzing, he said, that it creates tests that no filesystem developer would ever think of. It will do things that are not reasonable and don't make any sense. For example, CVE-2014-8086 is a race condition that was discovered when switching between asynchronous I/O and direct I/O, which is something that "no one really does". But a malicious user can, of course.
It is nice to know that some set of tests cover most or all of the lines of code of interest, but it does not mean that the code is right. There are multiple paths through any code, so it is important to have lots of threads exercising different paths from different places. Executing rarely used paths is useful as well.
Disadvantages
There are some disadvantages to fuzzing, though. For one thing, there is no pass/fail criteria. Since it is random, you can't say that if it runs for an hour it is considered a "pass". It may miss completely obvious errors. As Peter Zijlstra put it, running for some length of time "doesn't mean that the behavior is right, just that it didn't explode". There may be plenty of bugs lurking that just don't cause a big enough problem to crash the test (or the kernel).
Fuzzing really needs to run continuously, Levin said. It can't just be run overnight and checked in the morning. Instead it should be run continuously and checked daily. Fuzzing is a resource hog too, but that actually helps testing the memory management code, especially for huge pages. The tests split lots of pages and make it hard to collapse them back into huge pages, he said.
Reproducing bugs found by the fuzzer can be quite difficult. Unfortunately, the right answer for causing the bug to happen again is often "run the fuzzer and wish for the best". It is difficult to output the results of tests because the amount of data slows the system down. Things like the last system call made aren't all that helpful, he said. Intel's Processor Trace (which Levin learned about at LSFMM) may help the situation eventually.
Levin suggested that the community should be doing more fuzzing. Developers should be doing some fuzzing before they send in patches and QA folks should be fuzzing continuously. A QA person in the audience asked about getting more information out of the kernel when it fails from fuzzing. Levin suggested setting up the kernel to do a memory dump when it gets a BUG_ON(). He will also be working on better BUG_ON() reporting.
He uses the Trinity fuzz tester for all of the API fuzzing and a different, unnamed tool for filesystem structure fuzzing. He runs Trinity in a virtual machine, while Trinity developer Dave Jones runs it on real hardware, so they find different kinds of bugs. Levin has not gotten to the point where he can run Trinity on linux-next for a week without hitting problems; so far he has not needed to look anywhere else for fuzzing tests.
[I would like to thank the Linux Foundation for travel support to Boston for Vault.]
Brief items
Security quotes of the week
To my mind, the real eyebrow raising moment was that the CIA is also allegedly targeting app developers through “whacking” Apple’s Xcode tool, presumably allowing all subsequent software shipped from the developer to the app store to contain some sort of malicious implant, which will then be distributed within that developer’s app. Nothing has been disclosed about how widespread these attacks are (if ever used at all), what developers might have been targeted, or how the implants might function.
New vulnerabilities
389-admin: multiple /tmp/ file vulnerabilities
| Package(s): | 389-admin | CVE #(s): | CVE-2015-0233 | ||||
| Created: | March 16, 2015 | Updated: | March 18, 2015 | ||||
| Description: | From the Red Hat bugzilla:
Kurt Seifried of Red Hat Product Security reports: There are several temporary file creation vulnerabilities: In the file ./389-admin-1.1.36/admserv/newinst/src/AdminServer.pm.in my $secfile_backup_dir = "/tmp/adm-sec-files." . $$; and in the file: ./389-admin-1.1.36/lib/libadmin/httpcon.c char *dbd = "/tmp/http_trace.%d"; The perl code should use mkstemp() and the C code should use mkstemp(). These issues are only locally exploitable and require administrative action in order to exploit. | ||||||
| Alerts: |
| ||||||
checkpw: denial of service
| Package(s): | checkpw | CVE #(s): | CVE-2015-0885 | ||||||||
| Created: | March 17, 2015 | Updated: | March 18, 2015 | ||||||||
| Description: | From the Debian advisory:
Hiroya Ito of GMO Pepabo, Inc. reported that checkpw, a password authentication program, has a flaw in processing account names which contain double dashes. A remote attacker can use this flaw to cause a denial of service (infinite loop). | ||||||||||
| Alerts: |
| ||||||||||
cups-filters: remote command execution
| Package(s): | cups-filters | CVE #(s): | CVE-2015-2265 | ||||||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | April 7, 2015 | ||||||||||||||||||||||||
| Description: | From the Red Hat bugzilla:
It was reported that cups-browsed fails to properly sanitize data from the network when creating IPP printer scripts. As a result, an attacker can remotely create a script containing arbitrary commands, which will be executed as the "lp" user when the associated printer is used. This is the same vulnerability reported as CVE-2014-2707 but the existing fixes rely on a string sanitization function remove_bad_chars() which is not effective. | ||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||
freexl: denial of service
| Package(s): | freexl | CVE #(s): | |||||||||
| Created: | March 18, 2015 | Updated: | March 18, 2015 | ||||||||
| Description: | From the FreeXL advisory:
Four potentially harmful bugs causing crash and stack corruption were detected in FreeXL by American Fuzzy Lop. The most recent version of FreeXL solves all four issues. | ||||||||||
| Alerts: |
| ||||||||||
gnupg: denial of service
| Package(s): | gnupg | CVE #(s): | CVE-2015-1606 | ||||||||||||||||||||||||
| Created: | March 13, 2015 | Updated: | March 18, 2015 | ||||||||||||||||||||||||
| Description: | From the Debian advisory: The keyring parsing code did not properly reject certain packet types not belonging in a keyring, which caused an access to memory already freed. This could allow remote attackers to cause a denial of service (crash) via crafted keyring files. | ||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||
gnutls26: two vulnerabilities
| Package(s): | gnutls26 | CVE #(s): | CVE-2015-0282 CVE-2015-0294 | ||||||||||||||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | July 30, 2015 | ||||||||||||||||||||||||||||||||
| Description: | From the Debian advisory:
CVE-2015-0282: GnuTLS does not verify the RSA PKCS #1 signature algorithm to match the signature algorithm in the certificate, leading to a potential downgrade to a disallowed algorithm without detecting it. CVE-2015-0294: It was reported that GnuTLS does not check whether the two signature algorithms match on certificate import. | ||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||
icu: regular expression flaws
| Package(s): | icu | CVE #(s): | CVE-2014-9654 | ||||||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | April 28, 2015 | ||||||||||||||||||||||||
| Description: | From the Debian advisory:
CVE-2014-9654: More regular expression flaws. | ||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||
ipa: multiple vulnerabilties
| Package(s): | ipa | CVE #(s): | CVE-2014-7850 CVE-2014-7828 | ||||
| Created: | March 13, 2015 | Updated: | March 18, 2015 | ||||
| Description: | From the Oracle advisory: CVE-2014-7850: XSS flaw can be used to escalate privileges. CVE-2014-7828: password not required when OTP in use. | ||||||
| Alerts: |
| ||||||
jBCrypt: integer overflow
| Package(s): | jBCrypt | CVE #(s): | CVE-2015-0886 | ||||||||
| Created: | March 16, 2015 | Updated: | March 18, 2015 | ||||||||
| Description: | From the CVE entry:
Integer overflow in the crypt_raw method in the key-stretching implementation in jBCrypt before 0.4 makes it easier for remote attackers to determine cleartext values of password hashes via a brute-force attack against hashes associated with the maximum exponent. | ||||||||||
| Alerts: |
| ||||||||||
kernel: privilege escalation
| Package(s): | kernel | CVE #(s): | CVE-2014-8159 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 12, 2015 | Updated: | May 1, 2015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Red Hat advisory:
It was found that the Linux kernel's Infiniband subsystem did not properly sanitize input parameters while registering memory regions from user space via the (u)verbs API. A local user with access to a /dev/infiniband/uverbsX device could use this flaw to crash the system or, potentially, escalate their privileges on the system. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
libav: denial of service
| Package(s): | libav | CVE #(s): | CVE-2014-9604 | ||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | May 19, 2015 | ||||||||||||||||||||
| Description: | From the CVE entry:
libavcodec/utvideodec.c in FFmpeg before 2.5.2 does not check for a zero value of a slice height, which allows remote attackers to cause a denial of service (out-of-bounds array access) or possibly have unspecified other impact via crafted Ut Video data, related to the (1) restore_median and (2) restore_median_il functions. | ||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||
libxfont: privilege escalation
| Package(s): | libxfont | CVE #(s): | CVE-2015-1802 CVE-2015-1803 CVE-2015-1804 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 17, 2015 | Updated: | December 21, 2015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the X.org advisory:
Ilja van Sprundel, a security researcher with IOActive, has discovered an issue in the parsing of BDF font files by libXfont. Additional testing by Alan Coopersmith and William Robinet with the American Fuzzy Lop (afl) tool uncovered two more issues in the parsing of BDF font files. As libXfont is used by the X server to read font files, and an unprivileged user with access to the X server can tell the X server to read a given font file from a path of their choosing, these vulnerabilities have the potential to allow unprivileged users to run code with the privileges of the X server (often root access). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
movabletype-opensource: multiple vulnerabilities
| Package(s): | movabletype-opensource | CVE #(s): | CVE-2013-2184 CVE-2014-9057 CVE-2015-1592 | ||||
| Created: | March 13, 2015 | Updated: | March 18, 2015 | ||||
| Description: | From the Debian advisory: CVE-2013-2184 - Unsafe use of Storable::thaw in the handling of comments to blog posts could allow remote attackers to include and execute arbitrary local Perl files or possibly remotely execute arbitrary code. CVE-2014-9057 - Netanel Rubin from Check Point Software Technologies discovered a SQL injection vulnerability in the XML-RPC interface allowing remote attackers to execute arbitrary SQL commands. CVE-2015-1592 - The Perl Storable::thaw function is not properly used, allowing remote attackers to include and execute arbitrary local Perl files and possibly remotely execute arbitrary code. | ||||||
| Alerts: |
| ||||||
osc: command injection
| Package(s): | osc | CVE #(s): | CVE-2015-0778 | ||||||||||||||||||||
| Created: | March 13, 2015 | Updated: | March 7, 2016 | ||||||||||||||||||||
| Description: | From the openSUSE bug report: Server and client side arbitrary command execution in source service handling of OBS. | ||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||
php5: code execution
| Package(s): | php5 | CVE #(s): | CVE-2015-2301 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 18, 2015 | Updated: | March 23, 2015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Ubuntu advisory:
It was discovered that PHP incorrectly handled memory in the phar extension. A remote attacker could use this issue to cause PHP to crash, resulting in a denial of service, or possibly execute arbitrary code. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
php5: two vulnerabilities
| Package(s): | php5 | CVE #(s): | CVE-2014-9705 CVE-2015-2305 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 18, 2015 | Updated: | May 13, 2015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Debian advisory:
CVE-2014-9705: Buffer overflow in the enchant extension. CVE-2015-2305: Guido Vranken discovered a heap overflow in the ereg extension (only applicable to 32 bit systems). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
phpMyAdmin: information leak
| Package(s): | phpMyAdmin | CVE #(s): | CVE-2015-2206 | ||||||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | March 31, 2015 | ||||||||||||||||||||||||
| Description: | From the CVE entry:
libraries/select_lang.lib.php in phpMyAdmin 4.0.x before 4.0.10.9, 4.2.x before 4.2.13.2, and 4.3.x before 4.3.11.1 includes invalid language values in unknown-language error responses that contain a CSRF token and may be sent with HTTP compression, which makes it easier for remote attackers to conduct a BREACH attack and determine this token via a series of crafted requests. | ||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||
postgresql: buffer overrun
| Package(s): | postgresql | CVE #(s): | CVE-2015-0242 | ||||||||||||
| Created: | March 16, 2015 | Updated: | March 18, 2015 | ||||||||||||
| Description: | From the openSUSE advisory:
Fix buffer overrun in replacement *printf() functions | ||||||||||||||
| Alerts: |
| ||||||||||||||
requests: cookie stealing attacks
| Package(s): | requests | CVE #(s): | CVE-2015-2296 | ||||||||||||||||||||||||||||
| Created: | March 16, 2015 | Updated: | June 18, 2015 | ||||||||||||||||||||||||||||
| Description: | From the Ubuntu advisory:
Matthew Daley discovered that Requests incorrectly handled cookies without host values when being redirected. A remote attacker could possibly use this issue to perform session fixation or cookie stealing attacks. | ||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||
suricata: multiple vulnerabilities
| Package(s): | suricata | CVE #(s): | CVE-2015-0928 | ||||
| Created: | March 13, 2015 | Updated: | March 18, 2015 | ||||
| Description: | From the Fedora advisory: This release fixes a parsing issue in the DCERPC parser that can happen when Suricata runs out of memory. The exact scope of the problem isn’t clear, but it could certainly lead to crashes. CVE-2015-0928 is assigned for this. The second issue is certain characters in the URI could confuse the parsing of the HTTP request line, leading to possible detection bypass for ‘http_uri’ and to incomplete logging of the URI. | ||||||
| Alerts: |
| ||||||
tcllib: HTML injection
| Package(s): | tcllib | CVE #(s): | |||||||||
| Created: | March 16, 2015 | Updated: | May 7, 2015 | ||||||||
| Description: | The following flaw was reported against tcllib:
User supplied input is directly inserted into the <textarea> as default value, e.g. a textarea named 'ta' with a parameter of ta=XXX results in `<textarea>XXX</textarea>` This can be used to break out of the <textarea>-context and insert arbitrary HTML content such as <script>-Tags. The attack is possible using HTTP GET requests as well as POST and multipart form encoded POST requests. | ||||||||||
| Alerts: |
| ||||||||||
tcpdump: multiple vulnerabilities
| Package(s): | tcpdump | CVE #(s): | CVE-2015-0261 CVE-2015-2153 CVE-2015-2154 CVE-2015-2155 | ||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 17, 2015 | Updated: | April 27, 2015 | ||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Debian advisory:
Several vulnerabilities have been discovered in tcpdump, a command-line network traffic analyzer. These vulnerabilities might result in denial of service (application crash) or, potentially, execution of arbitrary code. | ||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||
wireshark: multiple vulnerabilities
| Package(s): | wireshark | CVE #(s): | CVE-2015-2187 CVE-2015-2188 CVE-2015-2189 CVE-2015-2190 CVE-2015-2191 CVE-2015-2192 | ||||||||||||||||||||||||||||||||||||||||||||
| Created: | March 13, 2015 | Updated: | April 1, 2015 | ||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the openSUSE bug reports: CVE-2015-2187 - The ATN-CPDLC dissector could crash. CVE-2015-2188 - The WCP dissector could crash while decompressing data. CVE-2015-2189 - The pcapng file parser could crash. CVE-2015-2190 - The LLDP dissector could crash. CVE-2015-2191 - The TNEF dissector could go into an infinite loop. CVE-2015-2192 - The SCSI OSD dissector could go into an infinite loop. | ||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The current development kernel is 4.0-rc4, released on March 15. Linus said: "Nothing particularly stands out here. Shortlog appended, I think we're doing fine for where in the release cycle we are."
Stable updates: 3.19.2, 3.14.36, and 3.10.72 were released on March 18.
Kdbus on track for 4.1
Greg Kroah-Hartman has added the kdbus tree to linux-next with an eye toward merging it during the next merge window. "The code has been reworked and reviewed many times, and this last round seems to have no objections, so I'm queueing it up to be merged for 4.1-rc1."
Kernel development news
Virtual filesystem layer changes, past and future
While most of the 2015 Linux Storage, Filesystem and Memory Management summit was dedicated to subsystem-specific discussions, some subjects were of sufficiently wide interest that they called for plenary sessions. Al Viro's session about the evolution of the kernel's virtual filesystem (VFS) layer was one such session. There is little that happens in the system that does not involve the VFS in one way or another; in a rapidly changing kernel, that implies a need for the VFS to change quickly as well.One of the things that has not yet happened, despite wishes to the contrary, is the provision of a better set of system calls to replace mount(). Al did some work in that area but the patches got bogged down before they were even posted for review. So there is no real progress to report in that area yet. On the other hand, there has been some limited progress toward the creation of a revoke() system call. The full implementation remains distant, but some of the infrastructure work is done.
An area that has seen more work is the transition to the iov_iter interface. Al's hope is that, by the time the 4.1 merge window closes, the reworking of aio_read() and aio_write() (part of the asynchronous I/O implementation) to use iov_iter will be complete. There are several instances that still need to be converted, but he is reasonably confident that there are no significant roadblocks.
In the last year the send and receive paths in the network stack have seen iov_iter conversions. The sendpages() path remains to be done, but there do not seem to be any obstacles to getting it done. The conversion of the splice() system call is a bit harder. The code on the write side has almost all been switched, with one exception: the filesystem in user space (FUSE) module. The problem with FUSE is that it wants to do zero-copy I/O, moving pages directly between a splice() buffer and the page cache.
When splice() was first added to the kernel, this sort of "page
stealing" was part of the plan; it seemed like a useful optimization. But
page stealing had a number of problems, including confusion in the
filesystem code when an up-to-date page is stuffed directly into the page
cache. So Nick Piggin removed
that feature in 2007 and nobody has ever gotten around to putting it
back. Al noted that Nick described some of the problems in his commit
message, but there are others and, since Nick has proved hard to reach in
recent years, they will have to remain a mystery until somebody else
rediscovers them.
Meanwhile, zero-copy operation in splice() is disabled, with one exception: FUSE. The problems that affected page stealing with other filesystems do not come up with FUSE, so there was no reason to disable it there; beyond that, FUSE needs zero-copy operation or its performance will suffer. This has prevented the conversion of FUSE over to iov_iter for now. Al's preferred solution to this problem would be to restore the zero-copy mode for all cases, but that is going to take some exploration.
The read side (as represented by the splice_read() file_operations method) will probably be converted sometime this year.
In summary, Al said, he is surprised by how many iovec instances (the predecessor to iov_iter) remain in the kernel. It is not about to go extinct quite yet, but there are fewer and fewer places where it is used.
Another upcoming change that might be visible outside of the VFS is that the nameidata structure is about to become completely opaque. It will only be defined within the VFS code. Al would like to eventually get rid of even the practice of passing around pointers to this structure and switch to using a pointer out of the task structure. This change should not affect non-VFS code that much, but he wanted to mention it because there are patch sets out there that will be broken.
Work continues on the project of getting rid of the numerous variants of d_add(), the basic function that adds a directory entry (dentry) structure into the dentry cache. One of those variants — d_materialise_unique() — was removed in 3.19. Others, like d_splice_alias(), remain. The ideal situation would be to have a single primitive to associate dentries with inodes. Matthew Wilcox asked if the other variants might still have value for documentation purposes, but Al said such cases should be handled with assertions.
A couple of other recent changes include unmounting of filesystems on invalidation and better shutdown processing. The unmounting changes cause a filesystem to be automatically removed if its mount point is invalidated; it went in some months ago. The big change with filesystem shutdown processing is that it is now delayed and always run on a shallow stack. That should address concerns about stack overflows that might otherwise occur during shutdown processing.
Al's final topic had to do with BSD process accounting. What happens if you start accounting to a file, then unmount the underlying filesystem? On a BSD system, the unmount will fail with an EBUSY error. But, on Linux, "somebody decided to be helpful" and thought it would be a friendly gesture to automatically stop the accounting and allow the unmount to proceed. This policy seems useful, but there is a catch: it creates a situation where an open file on a filesystem does not actually make that filesystem busy. That has led to a lot of interesting races dating back to 2000 or so; it is, he said, a "massive headache."
This mechanism has now been ripped out of the kernel. In its place is a mechanism by which an object can be added to a vfsmount structure (which represents a mounted filesystem); that object supports only one method: kill(). These "pin" objects hang around until the final reference to the vfsmount goes away, at which point each one's kill() function is called. It thus is a clean mechanism for performing cleanup when a filesystem goes away.
The first use of this mechanism is to handle shutdown of BSD process accounting. But it can also be put to good use when unmounting a large tree with multiple filesystems. If one filesystem depends on another, a pin object can be placed to ensure that the cleanup work is done in the right order. This facility, found in fs/fs_pin.c looks to be useful but, as Ted Ts'o noted, it is also completely undocumented at the moment. Al finished the session with an acknowledgment that some comments in that file would be helpful for other users.
Filesystem/block interfaces
In his session at the 2015 LSFMM Summit, Steven Whitehouse wanted to try to pull together lots of individual projects that are affecting the interfaces between the filesystem and block layers. There may be certain commonalities between them, so it would be good if the projects know about each other. When looking at making interface changes, it is also important for the storage and filesystem maintainers to consider the needs of all of these related projects rather than to just look at them piecemeal.
These projects come under one of three broad headings: dynamic devices, innovative I/O, and snapshots. Dynamic devices refers to "intelligent storage" devices; normally, a block device has the same characteristics throughout its life, but dynamic devices change capacity or other attributes over time. Innovative I/O refers to working with devices like shingled magnetic recording (SMR) and persistent memory devices as well as supporting data integrity features like checksums. Snapshots could fit in either of the other two headings, but he thought it was best to pull them out on their own.
Dynamic devices are those that have changes made to the device post-mount. For example, thin provisioning changes the capacity in the underlying devices in response to less available disk space—up to the capacity the kernel believes that it has. But dynamic devices may require a different kind of interface for error reporting so that filesystems can distinguish between temporary and permanent errors. Topology changes for multipath devices are another dynamic change. If Btrfs exeriences checksum failures while trying to read data, it may want to be able to ask for a different mirror or to change the path to the data. He asked, what information is needed from the block layer and how do the filesystems get that information?
There is a difference between informational reporting and error reporting, James Bottomley said. One contains hints that filesystems might want to use, while the other means the filesystem needs to do something about the event. Another question is how applications would want to get that kind of information, Ted Ts'o said, though it is clear that most applications won't change to take advantage of this kind of information.
Hannes Reineke said that there have been some attempts to use udev notifications to provide information to user space. The problem with that is there is no device information available for udev to attach the information to. Even if the information is available, there needs to be a way to transport it, he said.
But it is the filesystems that really need to know about changes in the block layer, Ts'o said. Maybe there needs to be a callback added to struct super that the block layer can make use of to alert filesystems to changes. Even a simple "something changed" message would be helpful.
There are a variety of new features that require different ways to communicate between the filesystems and the block layer, Whitehouse said in transitioning to the innovative I/O topic. SMR devices need to provide ways for the filesystem to find out where the write pointer is and the layout of the zones in the device. Data integrity (e.g. DIF/DIX) requires ways for checksums and/or checksum failures to be communicated between the block and filesystem layers. If the filesystem wants to read from a specific disk in a mirror, to provide hints to the block layer, or to initiate a copy offload operation, there needs to be an interface available to do so. He wondered if the same sorts of mechanisms could be used to support all of these kinds of operations.
The short answer would seem to be "no". Ts'o said that there are too many differences for all of those to be able to share much. But too much specificity in the interfaces won't be good either, Ric Wheeler said. Sometimes the right thing to request is for the block layer to "do something different than you did last time" when there is an problem, he continued. Christoph Hellwig agreed that "try again" can be the right approach for both disk failures and transport failures, while Dave Chinner suggested that adding some kind of "retry as hard as you can" operation might be helpful.
The problem comes back to error reporting and distinguishing transient from permanent errors, which is a recurring topic in the storage and filesystems tracks at LSFMM. The kernel is currently limited to the POSIX-defined errors, Chinner said. What is really needed are more fine-grained errors that give more information than just ENOSPC. A proper error interface from the block layer is really needed, he said.
Getting consistency between the snapshot operations across various devices was Whitehouse's last topic. Trying to take a filesystem snapshot on a single device is much different than doing so on a thin-provisioned array that may involve multiple underlying block devices. There are different granularities for snapshots as well. It could be that a single-file snapshot or application snapshot (which might include files on multiple filesystems) is desired.
For this topic, though, there was little time for discussion. Whitehouse was able to at least introduce the problem a bit for consideration down the road.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Overlayfs issues and experiences
David Howells and Mike Snitzer led a discussion at the 2015 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit about the overlay filesystem (overlayfs), which is the union filesystem implementation that was adopted into the kernel in 3.18. There are a number of problems that need to be addressed for this new filesystem.
Howells was first up. He noted that overlayfs does not play nicely with security technologies that use object labels (e.g. SELinux). There are a couple of problems that he reported back in November. Overlay filesystems can have three different inodes for any given file, one in the overlayfs itself, one in the read-only lower layer, and another in the writable upper layer if the file has been written (and, thus, copied up to the upper layer). The problem for SELinux and others regards which of the three different possible versions of the inode (i.e. lower, upper, or overlay) is visible to them. That affects what security labels will be seen on the file. But those problems have largely been solved at this point.
There are two more problems, for file locking and fanotify, that still need to be addressed. The first is a Jeff Layton problem, while the other is an Eric Paris problem, Howells said with a chuckle. Layton was present, so the discussion turned to locking. What happens when an overlayfs file that has not been written to is locked (so the lock must be placed on the lower layer), then written to so that it must be copied up from the lower layer into the upper? Should the lock be copied up too? What if there are two overlays referring to the same underlying file, each of which has a copied-up version of the file, where should the lock go then?
As it turns out, the fanotify problems are similar. If an application requests notifications on an overlayfs file that has not been written to, the notification must get placed on the lower layer inode. If the notifications are not copied up when the file gets written, then applications won't get notified even if changes are being made to the file.
James Bottomley suggested that the semantics for file locking and fanotify need to be worked out before a mechanism to satisfy them can be proposed. Ted Ts'o was uncomfortable having different behavior based on whether the file was part of an overlayfs. Howells noted that things can get worse than he had described when you add in network filesystems (e.g. SMB or NFS) as the overlayfs layers. He noted that he had posted a message in January with all of the problems he could think of, but "there are probably more".
Layton suggested returning ENOLCK when trying to lock files in an overlayfs until the semantics could be worked out and implemented. Al Viro noted that with overlayfs, a file opened for reading may have a different inode number than one opened for writing. That could be a problem for a number of different applications. The classic example is a mail user agent, Viro said, but some editors also care.
Bottomley said that there is a need to avoid surprise semantics. To do that, the developers need to know what actually matters and what users care about. POSIX semantics were broken for overlayfs, but does that really harm real users? "There is a limit to how far we need to dig to find problems that people are not complaining about", he said.
One of the users of overlayfs is Docker, so Snitzer wanted to look at that use case. Docker tried Btrfs, but didn't like it, he said. The project can't use block-based solutions, such as those based on device mapper and thin provisioning (thinp) that most Linux distributions use. The reason behind that is "lame" in Snitzer's view. Essentially, the project wants its Go programs to be built once (on Ubuntu), then to be able to be run on any other distribution forever, which requires statically built binaries. But there is no static library available for udev, which means that the devicemapper graph driver cannot be used. That is a political, not a technical, issue, Snitzer said.
The big reason that Docker has switched to overlayfs is to gain the memory efficiency that comes from pages in the page cache being shared between the containers. That doesn't happen with thinp currently, but Snitzer said that Dave Chinner has some ideas for using XFS on top of thinp to achieve it.
Chinner spoke up to describe the problem, which is that there might be a hundred containers running on a system all based on a snapshot of a single root filesystem. That means there will be a hundred copies of glibc in the page cache because they come from different namespaces with different inodes, so there is no sharing of the data. Basically, he said, there needs to be a kind of page cache deduplication to fix the problem.
Bottomley noted that it was a similar problem to the one that KSM tries to solve. KSM basically uses hashes of the contents of various pages of memory to share memory better between virtual machines. For containers, the main need is to deduplicate the page cache specifically. Bottomley said that the company he works for, Parallels, has a solution to the deduplication problem that does not require hashing each page, but that it is, currently at least, proprietary. Sharing of memory between containers is something that many are looking for, though, so there was some discussion of how to do it without the overhead that KSM incurs. That is where things wound down.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Asynchronous buffered read operations
A problem that Milosz Tanski has run into throughout his career is part of what brought him to the 2015 Linux Storage, Filesystem, and Memory Management Summit. Some reads can be satisfied immediately from the page cache, while others require an expensive I/O. Distinguishing between the two can lead to more efficient programs. He has implemented a new mode for read() that does so, though it requires adding a new system call.
The problem typically occurs in low-level network applications, Tanski said. Not every application can use sendfile(). For example, applications using TLS modify the data to encrypt it before sending it, which means they can't use sendfile(). So they must do their own copies but, depending on whether the data is in the page cache, some will be "slow", while others are "fast". Some programs that want to do asynchronous disk I/O often just use O_DIRECT and replicate the page cache concept in user space. That way they can track the contents of the cache to determine if an I/O can be satisfied quickly or not.
The normal workaround for these problems is to use thread pools for the I/O, but that pattern "kinda sucks". The latency added due to synchronization between the threads is not insubstantial. It is also often the case that requests that could be satisfied quickly get stuck behind slower requests.
So, with the help of Christoph Hellwig, he has implemented preadv2(), which is like preadv() except that there is a new flags argument (which, as was pointed out by several attendees, really should have been added with preadv()). There is only one flag available in his patches: RWF_NONBLOCK (which could also have been called RWF_NOWAIT, he said). That flag will cause reads to succeed only if the data is already in the page cache, otherwise it will return EAGAIN.
Basically, that flag allows reads from the network loop to skip the queue if the data needed is already available in the page cache. It essentially provides a fast path with minimal changes to the user-space application. He has been using it with an internal application and it works well.
His patches drew one major comment, he said, which was about using functionality like that in fincore() to get a list of the pages of a file that are resident in the page cache. The problem with that is a race condition where a page that was present at the time of the check is no longer there when the read is performed, which puts that read back into the slow lane.
He has also tested the patches with Samba, where they reduce the latency significantly. For his internal application, which is a large, columnar data store using the Ceph filesystem, he got 23% lower response times. The average response times dropped by 200ms, he said.
There have been some objections to adding another system call, Tanski said. James Bottomley was not particularly concerned about that, since the new system call is just adding a flag argument that should have been there already. Hellwig added that it required a new system call just to get the flag in, which is not an unusual situation in recent times.
Hellwig has also implemented pwritev2() as part of the patch set to add a flag argument for the write() side. There are no write flags included in the patch, though some will be added as separate patches down the road. There are some potential user-space uses for flags for writes, including a "high priority" flag and a non-blocking flag that could be used for logging, Hellwig said.
No one in the room seemed opposed to the idea. It seems likely that the two new system calls could show up as early as the 4.1 kernel.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Handling 32KB-block drives
There have been requests from certain disk drive manufacturers for the kernel to support 32KB block (or sector) sizes, James Bottomley said to kick off the discussion at a combined storage and filesystem session at the 2015 LSFMM Summit. He noted that the page cache could only handle 4KB granularity, and he didn't see that changing any time soon, which means that 32KB block sizes cannot be directly supported. But he wondered if aligning and sizing requests for 32KB boundaries most of the time would work for the disk drives.
Dave Chinner said that XFS can already handle making requests that are aligned and sized correctly, but Bottomley asked if that included metadata reads and writes. Metadata is the biggest problem, Bottomley said. Shorter writes can be supported by doing a read-modify-write (RMW) underneath the covers, in the filesystem, block layer, or in the disk itself.
Support for 4KB disk sectors, instead of the traditional 512-byte sectors, was added to Linux long ago, Ric Wheeler said. There are disk drives with 4KB logical and physical sectors out there now, Bottomley added. But that change matched up with the 4KB Linux page size. As Ted Ts'o pointed out, the page cache will need to be able to evict 4KB pages, which means that something will need to do an RMW operation on disks with larger block sizes.
Chris Mason pointed out that even if all filesystems had changes made in their data paths to do all I/O in 32KB chunks, and those changes were ready for the 4.1 kernel (which is, of course, only a thought experiment), it will be years before the code is in the hands of users. It will take at least a year before the enterprise distributions pick up the changes and at least another year before users are comfortable switching. Given that the disk drive makers want support now, it would make sense for them to add emulation of 512-byte sectors, as they did with the 4KB drives, so no changes are required of the kernel.
Christoph Hellwig agreed, noting that virtual-memory eviction has various corner cases that will require page-sized writes. Chinner was also on board with that, saying that the "easy solution is to fix it in the drive". That is also true for supporting shingled magnetic recording (SMR) drives, he continued.
Bottomley asked about ext4 support for doing 32KB I/O. Ts'o said that it would require some work but that it could be done. The same is true for Btrfs, Mason said. "We're all wrong but in slightly different ways", he said of Linux filesystem support. Ts'o said that there would need to be support added to the virtual-memory subsystem to support 32KB I/O. The filesystems could do their own RMW to ensure the full 32KB was in the cache when doing writes.
Chinner asked about workloads that generate lots of small files. Bottomley said those would essentially waste an additional 28KB per file. Each would require an RMW operation as well, which might not perform all that well for some workloads.
There was a suggestion that having 4KB emulation (rather than 512-byte emulation) would be better, but Chinner called it "immaterial". There are all kinds of "mapping tricks" already done by SSDs, any emulation would essentially be the same. SSD makers won't even say what the sector size is for those devices, Bottomley said. But Chinner said that he didn't care and didn't really want to know. Some were concerned about the performance implications of hiding RMW operations in the drive, however.
One way to support larger block sizes in the page cache would be to move to larger pages throughout the kernel. The last time the idea of larger page sizes was raised with the memory management (MM) folks, they were not happy with the idea, Bottomley said. He wondered if it was worth raising the issue on day two of the summit in a plenary session. But Ric Wheeler said that the topic was raised in New Orleans (in 2013) and he didn't think the MM developers were "adamantly opposed" to the idea, just that no one was working on it.
But, as Chinner pointed out, 32KB is not likely to be the end of the line. Even if the page size were increased to 32KB, disk drive manufacturers will someday want 128KB or 256KB (or beyond) for the block size. So a solution that is not dependent on the page size of the system is needed. Using vmalloc() allocations rather than contiguous allocations might help. Compound pages might also be part of any eventual solution.
In the end, Bottomley summed up the discussion by saying that filesystems could "pull tricks" to make most I/O 32KB-friendly, but would need help from the MM subsystem to have it all be aligned correctly. Given the time frames, it would seem that drive makers need to do some kind of emulation for now.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Filesystem support for SMR devices
Two back-to-back sessions at the 2015 Linux Storage, Filesystem, and Memory Management Summit looked at different attempts to support Linux filesystems on shingled magnetic recording (SMR) devices. In the first, Hannes Reinecke gave a status report on some prototyping he has done to support SMR in Linux. The second was led by Adrian Palmer of Seagate about a project to port the ext4 filesystem to host-managed SMR devices.
Reinecke described some prototyping he has done in the block layer to support SMR. Those devices have a number of interesting attributes that require code in the kernel to support. For example, SMR devices have multiple zones, some of which are normal random-access disk zones, while others must be written to sequentially. He has been looking specifically at supporting host-managed SMR devices, which require that the host never violate the sequential-write restriction in those types of zones.
SMR drives disallow I/O that spans zones, Reinecke said, which means that I/O operations need to be split at those boundaries. The zone layout could have a different size for each of the different zones, though none of the drives currently does that. To support that possibility, though, he used a red-black tree to track all of the zones. The current SMR specification allows for deferred lookup of some of the zone information, so the tree could just be partially filled for devices with lots of irregular zones.
Ted Ts'o suggested that supporting "insane drives" that have a variety of zone sizes might use a different data structure. That way, the majority of drives that have a straightforward layout could have all of that information available in kernel memory. He was concerned that there might be I/O performance degradation when issuing the "report zones" command once the device has been mounted.
There is also a question about "open zones" and the maximum number of open zones. Reinecke said that it is a topic that is still under discussion among the drive makers. From the LSFMM discussion it seems clear that there is no agreement on what an open zone is. Some believe that any partially filled zone qualifies, while to others it means zones that are simultaneously available to write to. In addition, the maximum may range from the four to eight that Martin Petersen has heard to the 128 that the drive makers have proposed.
In fact, someone from one of the storage vendors asked what the kernel developers would like the maximum to be. The reply was, not surprisingly, "all of them". Reinecke said that he is lobbying that "zone control" (maximum number of open zones) be optional and that any I/O that violates the maximum open zones should be allowed, possibly with a performance penalty. Ts'o agreed with that, saying that writing to one more zone than is allowed must not cause an I/O error, though adding some extra latency would be acceptable. Reinecke said that he had hoped to avoid the whole topic of open zones "because it is horrible".
Reinecke then moved back to his prototype work. He noted that sequential writes must be guaranteed. Each sequential zone has its own write pointer, which is where the next write for that zone must be. That "sort of works" using the NOP I/O scheduler, since it just merges adjacent writes. If out-of-order writes from multiple tasks are encountered, they can be requeued at the tail of the queue. The queue size must be monitored, he said, since if it never gets smaller, the I/O is making no progress, which should cause an I/O error.
But Dave Chinner said that once a filesystem has allocated blocks to different tasks, it must then guarantee an ordering of those writes "all the way down". The only way to do that is to serialize the I/O to the zone once the allocation has been done. Reinecke said that requeueing at the tail can solve that problem, but Chinner said that in a preemptible kernel that won't work. "Sequential I/O is basically synchronous I/O", he said.
There is a philosophical question about whether it makes sense to try to put a regular filesystem on SMR devices, Ts'o said. Chinner said that SMR is really a firmware problem. Actually solving the problems of SMR at the filesystem level is not really possible, he said.
Reinecke wondered if the host-managed SMR drives would actually sell. Petersen piled on, noting that the flash-device makers had made lots of requests for extra code to support their devices, but that eventually all of those requests disappeared when those types of devices didn't sell. Reinecke's conclusion was that it may not make a lot sense to try to make an existing filesystem work for host-managed SMR drives.
Ext4 on host-managed SMR
On the other hand, though, Palmer is quite interested in doing just that. He works on host-managed drives and is trying to get ext4 working on them.
He started by looking at block groups as a way to track the zones, but ran into a problem with that idea. Zones are 256MB in length, but a 4KB block only has enough bits to address 128MB worth of blocks, so he would need to use 8KB blocks, which is a sizable change. He also noted that O_DIRECT I/O was going to be a problem for host-managed SMR, without really going into any details.
As Reinecke said earlier, the order of writes to the disk is critical for host-managed drives. Out-of-order writes may not be written at all. Palmer looked at putting the code to keep write operations sequential into either the I/O scheduler or the block device. For now, the block device seems to be the right place.
Ts'o said that he is mentoring a student who is working on making the ext4 journal writes more SMR-friendly. But Chinner is worried about fsck. A corrupt block in the middle of a sequential zone may need to be rewritten, but it can't be overwritten in place. Ts'o suggested a 256MB read-modify-write with a chuckle.
One attendee noted that the drive makers want to start with host-aware drives (which will perform better with mostly sequential writes to those zones, but will not fail out-of-order writes) to get them working. That will allow the companies to learn from the market how much conventional space (zones without the sequential requirement) and overprovisioning is required.
Chinner suggested that some of that conventional space might be used for metadata sections. Another attendee cautioned that SSD makers are also looking at zone block devices, so it may be more than just SMR drives that need this kind of support. But Chinner said that the kernel developers had "more than enough" on their hands rewriting filesystems for use on SMR.
Another way to approach the problem, Chinner said, might be to have a new kind of write command for disks (perhaps "write allocate") that would return the logical block address (LBA) where the data was written, rather than getting the LBA from the filesystem or block layers with the write. That way, the drive would decide where to place the data and return that to the operating system. One attendee said that the driver vendors would probably welcome a discussion about what the API to these drives would look like.
There was some discussion on how to proceed with a new command, which would (eventually) need to be handled by the T10 committee (for SCSI interface standards). Petersen (who represents Linux on T10) noted that it is difficult to change the standard. An attendee from one of the drive makers thought it might be possible to prototype the idea to try it out completely separate from the standards process.
That is where the conversation trailed off, but the "write allocate" idea seemed to generate some interest. Whether that translates into action (or standards) remains to be seen.
After the summit, on March 16, Dave Chinner posted a pointer to a design document on supporting XFS on host-aware SMR drives.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Testing power failures
Trying to replicate failures that can happen in filesystems when the power suddenly fails was the topic of a discussion led by Josef Bacik at the 2015 LSFMM Summit. He has been working on a tool based on the device mapper to try to make power-failure scenarios more reproducible, but he was wondering if he should continue that work or shift to something else.
In Btrfs, he believes there are ways that the balancing operation can lead to a corrupted filesystem if there is a power failure at just the "right" moment. He has not caught it yet, but the problem has inspired the development of a new tool. It uses the device mapper and two disks, one of which is the normal filesystem and the other keeps a log of all the writes that go to the first disk. The log disk keeps a list of all the write operations that have completed, which is updated with each flush operation to the first disk.
The tool has been integrated into xfstests and works for ext4 and XFS as well as Btrfs. It does take a good bit longer on those other two filesystems, but it works. The idea is to be able to test "weird interactions", where the filesystem is fine at point A and at point B but, if the power fails in between those points, the filesystem gets corrupted. Bacik asked: does this log approach make sense?
Someone asked about using fault injection instead. But Bacik wants these tests to be generic for any filesystem without adding code to the kernel. Logging allows for replaying the problem. It is also finer-grained, as you can check the filesystem consistency at each flush.
He would like others to look at his assumptions to help ensure he isn't off base. He is only logging information for write operations that have completed. The tool drops all writes that have not completed at flush time.
There was a suggestion that blktrace could be changed to log the data that is being written. Bacik seemed to be leaning toward dropping his tool in favor of that, but Chris Mason wondered about maintaining the ordering of writes using blktrace. One attendee said that blktrace has sequence numbers that are maintained per-CPU but are not synchronized, so the order of the writes may not be preserved. Since the device mapper does preserve that order, Bacik concluded that he would finish up that tool, rather than switch.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Reservations for must-succeed memory allocations
When the schedule for the 2015 Linux Storage, Filesystem, and Memory Management Summit was laid out, its authors optimistically set aside 30 minutes on the first day for the thorny issue of memory-allocation problems in low-memory situations. That session (covered here) didn't get past the issue of whether small allocations should be allowed to fail, so the remainder of the discussion, focused on finding better solutions for the problem of allocations that simply cannot fail, was pushed into a plenary session on the second day.Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag, which is used to mark allocation requests that must succeed at any cost. But doing so, it turns out, just drives developers to put infinite retry loops into their own code rather than using the allocator's version. That, he noted dryly, is not a step forward. Retry loops spread throughout the kernel are harder to find and harder to fix, and they hide the "must succeed" nature of the request from the memory-management code.
Getting rid of those loops is thus, from the point of view of the memory-management developers, a desirable thing to do. So Michal asked the gathered developers to work toward their elimination. Whenever such a loop is encountered, he said, it should just be replaced by a __GFP_NOFAIL allocation. Once that's done, the next step is to figure out how to get rid of the must-succeed allocation altogether. Michal has been trying to find ways of locating these retry loops automatically, but attempts to use Coccinelle to that end have shown that the problem is surprisingly hard.
Johannes Weiner mentioned that he has been working recently to improve the out-of-memory (OOM) killer, but that goal proved hard to reach as well. No matter how good the OOM killer is, it is still based on heuristics and will often get things wrong. The fact that almost everything involved with the OOM killer runs in various error paths does not help; it makes OOM-killer changes hard to verify.
The OOM killer is also subject to deadlocks. Whenever code requests a memory allocation while holding a lock, it is relying on there being a potential OOM-killer victim task out there that does not need that particular lock. There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock. On such systems, a low-memory situation that brings the OOM killer into play may well lead to a full system lockup.
Rather than depend on the OOM killer, he said, it is far better for kernel code to ensure that the resources it needs are available before starting a transaction or getting into some other situation where things cannot be backed out. To that end, there has been talk recently of creating some sort of reservation system for memory. Reservations have downsides too, though; they can be more wasteful of memory overall. Some of that waste can be reduced by placing reclaimable pages in the reserve; that memory is in use, but it can be reclaimed and reallocated quickly should the need arise.
James Bottomley suggested that reserves need only be a page or so of
memory, but XFS maintainer Dave Chinner was quick to state that this is not
the case. Imagine, he said, a transaction to create a file in an XFS
filesystem. It starts with allocations to create an inode and update the
directory; that may involve allocating memory to hold and manipulate
free-space bitmaps. Some blocks may need to be allocated to hold the
directory itself; it may be necessary to work through 1MB of stuff to find
the directory block that can hold the new entry. Once that happens, the
target block can be pinned.
This work cannot be backed out once it has begun. Actually, it might be possible to implement a robust back-out mechanism for XFS transactions, but it would take years and double the memory requirements, making the actual problem worse. All of this is complicated by the fact that the virtual filesystem (VFS) layer will have already taken locks before calling into the filesystem code. It is not worth the trouble to implement a rollback mechanism, he said, just to be able to handle a rare corner case.
Since the amount of work required to execute the transaction is not known ahead of time, it is not possible to preallocate all of the needed memory before crossing the point of no return. It should be possible, though, to produce a worst-case estimate of memory requirements and set aside a reserve in the memory-management layer. The size of that reserve, for an XFS transaction, would be on the order of 200-300KB, but the filesystem would almost never use it all. That memory could be used for other purposes while the transaction is running as long as it can be grabbed if need be.
XFS has a reservation system built into it now, but it manages space in the transaction log rather than memory. The amount of concurrency in the filesystem is limited by the available log space; on a busy system with a large log he has seen 7-8000 transactions active at once. The reservation system works well and is already generating estimates of the amount of space required; all that is needed is to extend it to memory.
A couple of developers raised concerns about the rest of the I/O stack; even if the filesystem knows what it needs, it has little visibility into what the lower I/O layers will require. But Dave replied that these layers were all converted to use mempools years ago; they are guaranteed to be able to make forward progress, even if it's slow. Filesystems layered on top of other filesystems could add some complication; it may be necessary to add a mechanism where the lower-level filesystem can report its worst-case requirement to the upper-level filesystem.
The reserve would be maintained by the memory-management subsystem. Prior to entering a transaction, a filesystem (or other module with similar memory needs) would request a reservation for its worst-case memory use. If that memory is not available, the request will stall at this point, throttling the users of reservations. Thereafter, a special GFP flag would indicate that an allocation should dip into the reserve if memory is tight. There is a slight complication around demand paging, though: as XFS is reading in all of those directory blocks to find a place to put a new file, it will have to allocate memory to hold them in the page cache. Most of the time, though, the blocks are not needed for any period of time and can be reclaimed almost immediately; these blocks, Dave said, should not be counted against the reserve. Actual accounting of reserved memory should, instead, be done when a page is pinned.
Johannes pointed out that all reservations would be managed in a single, large pool. If one user underestimates their needs and allocates beyond their reservation, it could ruin the guarantees for all users. Dave answered that this eventuality is what the reservation accounting is for. The accounting code can tell when a transaction overruns its reservation and put out a big log message showing where things went wrong. On systems configured for debugging it could even panic the system, though one would not do that on a production system, of course.
The handling of slab allocations brings some challenges of its own. The way forward there seems to be to assume that every object allocated from a slab requires a full page allocation to support it. That adds a fair amount to the memory requirements — an XFS transaction can require as many as fifty slab allocations.
Many (or most) transactions will not need to use their full reservation to complete. Given that there may be a large number of transactions running at any given time, it was suggested, perhaps the kernel could get away with a reservation pool that is smaller than the total number of pages requested in all of the active reservations. But Dave was unenthusiastic, describing this as another way of overcommitting memory that would lead to problems eventually.
Johannes worried that a reservation system would add a bunch of complexity to the system. And, perhaps, nobody will want to use it; instead, they will all want to enable overcommitting of the reserve to get their memory and (maybe) performance back. Ted Ts'o also thought that there might not be much demand for this capability; in the real world, deadlocks related to low-memory conditions are exceedingly rare. But Dave said that the extra complexity should be minimal; XFS, in particular, already has almost everything that it needs.
Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time. Do we really want to add the extra complexity just to make things work better on under-resourced systems? Ric Wheeler responded that we really shouldn't have a system where unprivileged users can fire off too much work and crash the box. Dave agreed that such problems can, and should, be fixed.
Even if there is a reserve, Ted said, administrators will often turn it off in order to eliminate the performance hit from the reservation system (which he estimated at 5%); they'll do so with local kernel patches if need be. Dave agreed that it should be possible to turn the reservation system off, but doubted that there would be any significant runtime impact. Chris Mason agreed, saying that there is no code yet, so we should not assume that it will cause a performance hit. Dave said that the real effect of a reservation would be to move throttling from the middle of a transaction to the beginning; the throttling happens in either case. James was not entirely ready to accept that, though; in current systems, he said, we usually muddle through a low-memory situation, while with a reservation we will be actively throttling requests. Throughput could well suffer in that situation.
The only reliable way to judge the performance impact of a reservation system, though, will be to observe it in operation; that will be hard to do until this system is implemented. Johannes closed out the session by stating the apparent consensus: the reservation system should be implemented, but it should be configurable for administrators who want to turn it off. So the next step is to wait for the patches to show up.
Heterogeneous memory management
Jérôme Glisse started an LSFMM 2015 memory-management track session on heterogeneous memory management (HMM) by noting that memory bandwidth for CPUs has increased slowly in recent years. There is little motivation for faster progress, since not many workloads sustain maximum memory bandwidth; instead, CPU access patterns are relatively random, and latency is usually the determining factor in the performance of any given workload.When one looks at graphical processing units (GPUs), the story is a bit different. Contemporary GPUs are designed for good performance with up to 10,000 running threads; to get there, they can have a maximum memory bandwidth that exceeds CPU-memory bandwidth by a factor of ten. Even so, a good GPU can saturate that bandwidth. GPUs, in other words, can do some things extremely quickly.
Increasingly, Jérôme said, we are seeing systems where the CPU and the GPU are placed on the same die, both with access to the same memory. The GPU is useful for "light" gaming, user-interface rendering, and more. On such systems, most of the memory bandwidth is used by the GPU.
The HMM code exists to allow the CPU and GPU to share the same memory and
the same address space; it could eventually be useful for other devices
with access to memory as well. The GPU gains software capabilities similar
to those the CPU has; it runs its own page table, can incur page faults,
and more. The key is to provide a way to manage the ownership of a given
block of memory to avoid race conditions. And that is what HMM does; it
provides a way to "migrate" memory between the CPU and the GPU, with only
one side having access at any given time. If, say, the CPU attempts to
access memory that currently belongs to the GPU, it will incur a page
fault. The fault-handling code can then migrate the memory back and allow
the CPU's work to proceed.
Implementing this functionality requires the ability to keep page tables synchronized on both sides; that is done on the CPU side through the use of a memory-management unit (MMU) notifier callback. Whenever the status of a block of memory changes, the appropriate page-table invalidations can be done. There is one catch, though: to work properly, the notifier needs to be able to sleep, which is not something that MMU notifiers are currently allowed to do. That has been a sticking point for the acceptance of this patch so far.
Andrew Morton jumped in to express some concerns about the generality of this system. GPUs are changing rapidly, he said; we could easily reach a point where, five years from now, nobody is using the HMM code anymore, but it still must be maintained. Jérôme responded that he believes the system is sufficiently general to be useful for GPUs, digital signal processors, and other devices for a long time.
Jérôme finished up by saying that HMM support is needed in order to provide full, transparent GPU support to applications. The compiler projects are working on the ability to vectorize loops for execution on the GPU; when this works, applications will be able to use the GPU without even knowing about it.
Rik van Riel asked if the group had any issues with the HMM code that needed discussion. Mel Gorman asked how many people had actually read the patch set; it turned out that not many had. Rik had reviewed an older version and didn't find any real issues with it. Andrew noted that there have not been a whole lot of reviews of the HMM code in general, and there do not appear to be many other users waiting in the wings.
The session finished with some scattered discussion of various HMM details. How is the migration of anonymous pages to a device handled? The answer is that the device looks like a special type of swap file. The trick here is in handling of fork(); in this case, all of the relevant memory must be migrated back to the CPU first. Atomic access by the device is handled by mapping the relevant page(s) as read-only on the CPU; subsequent write faults look a lot like copy-on-write faults. It would be nice to be able to handle file-backed pages in the HMM system; that would require the creation of a special entry type in the page cache. That brings up a problem similar to the MMU-notifier issue: the filesystem code assumes that page-cache lookups are atomic, but, in this case, the code will need to sleep. It is not clear how to handle that one; adding HMM-specific code to each filesystem was mentioned, but that does not appear to be an appealing option.
Current issues with memory control groups
The memory controller for control groups has often been a prominent topic at the annual Linux Storage, Filesystem, and Memory Management Summit. At the 2015 event, control groups were mostly notable by their absence, suggesting that the worst of the problems have been solved. That said, there was time for a brief session where some of the remaining issues were discussed.Initially, memory control groups ("memcgs") only tracked user-space memory. Over time, the tracking of kernel-space memory has been added, but, until recently, this feature was acknowledged to not be in particularly good shape. Vladimir Davydov spent quite a lot of time fixing it up, and things work better now. One of the biggest problems was the fact that, while the controller could track and limit kernel memory use, it had no way of reclaiming memory. So, when a particular group hit its limit, things simply came to a stop. Vladimir added per-memcg least-recently-used (LRU) lists for heavily used data structures like dentries and inodes, and kernel-space reclaim now works.
Much of the remaining discussion centered on whether administrators really need the separate kmem.limit_in_bytes knob that controls how much kernel-space memory a control group can use, or whether an overall limit for both kernel-space and user-space memory is sufficient. Michal Hocko noted that kernel-space limits are often used to throttle forking processes, a task that might be better handled in other ways. Perhaps it should be possible to apply ordinary Unix-style resource limits to control groups. Peter Zijlstra said that a number of users want that feature; it will need to be provided or people will continue to propose other control-group-based solutions.
That left the group without an answer to the question of whether a separate knob for kernel-space memory limits is needed. In the end, there were not a lot of strong feelings on the subject. It will come down to collecting the use cases and seeing whether any are strong enough to warrant adding another knob.
The final topic discussed was where the biggest holes are in the accounting of kernel memory usage. The most prominent one at this point, it would seem, is tracking the memory used for page tables. So that may be where the next round of memcg development effort is targeted.
Memory-management scalability
One of the drivers of memory-management development is scalability — performing well on ever-larger systems. So it is not surprising that scalability is a perennial discussion topic at kernel development gatherings; the 2015 Linux Storage, Filesystem, and Memory Management Summit was no exception. Andi Kleen and Peter Zijlstra led the first of two sessions on virtual memory scalability during the memory-management track at that event.Andi started by pointing out that systems were growing, not only in the number of CPU cores available, but also in the amount of attached memory. The number of cores per NUMA node is on the rise, which is bringing out some new scalability problems.
One of the well-used scalability tactics found in the kernel is per-CPU
variables; when each CPU has its own data, there can be no contention
between them. But, Andi asserted, as the number of CPUs grows, it no
longer makes
sense to do things on a per-CPU basis. It just adds a lot of work whenever
it becomes necessary to touch every CPU's version of a variable. Instead,
data should be made local to groups of N cores (where N was not specified).
Christoph Lameter said that a lot of these scaling problems can be addressed by limiting subsystems to specific cores. Andi replied that this approach works great at installations where there is an experienced person configuring the system. In the absence of that person, it does not work quite so well.
Mel Gorman asked the group what other scalability problems are being experienced now. Christoph complained about I/O bandwidth; in particular, he said, he is unable to push more than about 2GB/second to a filesystem. The problems come down to locking and the handling of 4KB pages in the XFS filesystem. Writeback tends to slow things down, since a lot of CPU time is spent making it happen.
That led to a discussion of batching operations — another tried-and-true scalability technique. It was noted that the reverse-mapping code, which maintains data structures to enable the kernel to tell which processes have references to a given physical page, takes its locks on a per-page basis. Fixing that, evidently, is not hard, but it will require some reorganization of the code.
The current least-recently-used (LRU) lists track memory in units of 4KB pages. That is considered at this point to be overly fine-grained; there is no need for LRU accuracy at that level. There was talk of implementing a "bucket LRU" that would track larger groups of pages.
Inter-processor interrupts (IPIs) for translation lookaside buffer flushes
have long been seen as a potential scalability problem. But, it seems
that, while people worry about IPIs, it is hard to find a workload where
they create a bottleneck. Usually the much-maligned mmap_sem
semaphore gets in the way first.
There was some vague talk of other scalability issues; memory compaction was mentioned as a problem on large systems. If compaction tries to migrate a lot of pages, that can lead to large latencies in process execution. Mel Gorman said that compaction shouldn't be doing that, though, so it is not clear where the problem is.
The session wound down without coming to any real conclusions. The scalability topic returned on the second day, though, when Davidlohr Bueso led a session focused on mmap_sem in particular. This semaphore controls access to a process's page tables, along with a number of other, not always well-defined things; it has been on the list of things to fix for some time now. Davidlohr stated a wish to walk out with some tangible action items for improving the situation.
He started by looking back at past action items, especially those that came out of the LSFMM 2014 locking session. One of the concerns then was use of mmap_sem in drivers and other code outside of the memory-management subsystem. Jan Kara has been working on getting drivers to use the gup_fast() variant of get_user_pages() in order to eliminate dependencies on mmap_sem; the biggest problem he is facing at the moment is a deadlock problem in the media subsystem.
Jan would also like to get mmap_sem out of the filesystem code. Al Viro wondered, though, about how virtual memory area (VMA) structures would be protected in its absence. Peter said he has a patch that shifts the protection of VMAs to sleepable RCU if anybody wanted to push that work forward. Meanwhile, Jan hopes to get his driver patches submitted soon.
Davidlohr said that his focus is moving stuff out from under
mmap_sem entirely and, eventually, breaking up the lock into
something finer-grained. The problem with that, as Peter pointed out, is
that what's protected by the lock now is not entirely clear. The way to
start, he
said, would be to document what's protected by mmap_sem; after
that, one can start thinking about better locking schemes.
One problem with mmap_sem is that it protects a process's entire address space. Concurrency could be increased by locking only portions of that space instead. The concept of "range locks" is thus of interest here. Michal Hocko suggested that developers could start by replacing mmap_sem with a range lock that still covers the entire address space; the locking could then be made more precise in an incremental manner.
Hugh Dickins, though, wondered if that was the right approach and what problems, exactly, were being solved with range locks. His impression was that the top priority was to get page-fault handling out from under mmap_sem entirely. The answer was that there are, in fact, two different issues to be addressed regarding mmap_sem: it protects too much, and the hold times are too long. Range locks are one attempt to address the first part of the problem. Peter added that, among other things, range locking would allow concurrent mmap() calls to proceed, which is important for some threaded workloads.
There was some concern about surprises that can pop up when it turns out that an unexpected corner of the code was relying on mmap_sem. In extreme cases, Hugh said, user-space code may even rely on it. He described a complaint from a user about a change in mlock() semantics. Changes in the kernel increased mlock() concurrency and, in the process, exposed a lack of locking on the user-space side. Sympathy for the affected user was relatively low in this case, but, Hugh said, it would be wise to be prepared for nasty surprises.
In the end, Davidlohr's desire for tangible action items went mostly unfulfilled. About the only firm conclusion was that the range-lock code will be cleaned up and posted in the near future.
Memory-management testing and debugging
Memory-management problems can be hard to identify and track down; this is true for bugs that affect either correctness or performance. Quite a bit of work has been done in recent years to develop tools that can help with this task, though. The 2015 LSFMM gathering had a number of sessions dedicated to this area; like a large array on a virtual-memory system, though, they were scattered throughout the program. This article provides a virtual view of the entire discussion in one place.
Testing
Davidlohr Bueso started a session on testing by saying that he has been
working on
improving the mmtests
benchmark suite to improve its ability to detect changes across kernel
versions. To that end, he has looked at a couple of test suites that are
being used in academia: Mosbench
and Parsec. There were
questions about how well these tests worked for testing the kernel in
particular, but, Davidlohr said, these suites do contain some useful tests.
Andi Kleen said there is a new suite out there that is promising despite being named, inevitably, "cloudbench."
Davidlohr asked if anybody else had workload tests that they would like to contribute to mmtests. Laura Abbott said that she would like to see a good set of tests for mobile systems. Scalability tests, she said, tend to be oriented toward scaling up, but mobile developers need tests that focus on scaling down.
Hard conclusions from this session were hard to come by; Davidlohr will continue to work on integrating and documenting other tests aimed at memory-management scalability.
Debugging
Memory-management debugging was the topic of another session run by Dave
Jones, Sasha Levin, and Dave Hansen. Dave Hansen started off by saying
that, while developers have added a number of debugging features to the
memory-management subsystem, they have so far left an important technology
on the table. He was talking about Intel's MPX
mechanism, which is able to check pointer accesses and ensure, in
hardware, that they don't go outside a set of defined boundaries. The nice
thing about MPX is that it has almost no runtime cost, so it can be enabled
on production systems.
Of course, developers may have some excuse for not making much use of MPX so far. It requires the (not yet released) GCC 5 compiler to instrument code properly, and hardware that actually implements MPX is not yet available. So, he said, there is still time to get our act together.
There was some immediate interest in using MPX with the slab allocator in the kernel. That would take some work, though, since the kernel would have to be changed to load the appropriate MPX registers before accessing a given slab object. Christoph Lameter asked if access to all slab objects could be monitored with MPX. It turns out that there's a small practical difficulty there: a typical running kernel has many thousands of slab-allocated objects, but there are only four sets of registers in the MPX hardware. So tracking more than four objects requires juggling information into and out of those registers.
Peter Zijlstra suggested that MPX could be applied to the kernel stack. It is not clear, though, that MPX-based stack checking would provide advantages over the explicit stack-overflow checks done in the kernel now. Still, it may be possible to dedicate one of the registers to the kernel stack and gain some extra protection.
Andy Lutomirski asked if the MPX registers could be written to while running
in atomic context. That turns out to be tricky, since setting up these
registers involves doing a memory allocation. Andy also suggested that
MPX could be used to block direct access to user-space addresses from the
kernel. Laura asked about checking
of DMA operations, but MPX only applies to accesses from the CPU.
Sasha shifted the discussion to the VM_BUG_ON() macro. This macro, which comes in a few variants, dumps out a bunch of information specific to the memory-management subsystem; it is thus useful for identifying memory-management bugs. Sasha would like to add more VM_BUG_ON() instances in the kernel, but he is worried about complaints of false positives. These complaints have kept debugging code out in the past; the result, he said, was that users suffered from a number of race conditions that could otherwise have been caught.
There was some talk about additional information that could be printed out by VM_BUG_ON(), but few conclusions. It was suggested that a full kernel memory dump would be helpful — but that, of course, is rather a large amount of data to print into the kernel log. Dave Jones would, instead, like more information about how the system got into the bad state; that would require adding some sort of transaction log. It was suggested that Intel's upcoming Processor Trace functionality could be helpful in this regard.
Dave Hansen then asked if there were any developers with sets of
memory-management tracepoints that could be considered for merging? It
seems that some exist, but Andi said that, rather than adding more
tracepoints now, it would be better to focus on improving the documentation
of existing tracepoints. Andrea Arcangeli questioned the value of
memory-management tracepoints in general; he does his memory-management
development on
virtualized systems and wonders why anybody would do anything else. When a
system is run under virtualization, it can be examined with an ordinary
debugger. But others argued that there are a lot of problems that only
show up on bare-metal systems, so there will always be a place for
debugging infrastructure that works in that environment.
Fernando Vasquez Cao noted that his group uses SystemTap heavily for memory-management debugging. Among other things, it is handy for injecting faults at specific locations, making it easier to get at hard-to-reproduce problems. Dave Jones agreed that the tools have made life better; it is, he said, a miracle that we were able to solve anything five years ago. He also wondered why there was not more use of the existing fault-injection framework; when he turns it on, he said, "everything breaks," so he concludes that nobody else is doing so. Fernando responded that the injection framework does not allow sufficiently specific fault injection. Besides, he said, when you turn it on everything else breaks, making it hard to focus on the specific problem at hand. It was agreed that somebody (currently unnamed) should fix those problems.
KASan
One tool that has been merged relatively recently is the kernel address sanitizer (or KASan). This tool uses a "shadow memory" array to track which memory the kernel should legitimately be accessing; it can then throw an error whenever the kernel goes out of bounds. KASan developer Andrey Ryabinin led a session on this tool and how it might be improved.
The first idea that came out was to enable KASan to properly validate
accesses to memory obtained with vmalloc(). Doing so would
require putting hooks into vmalloc() itself and creating a new,
dynamic shadow memory array. The amount of work required is not huge; it
is much like tracking slab allocations, except that shadow memory for slab
can be allocated at boot time. There were, unsurprisingly, no objections,
so this work should go forward soon.
A slightly trickier problem is memory that is freed and quickly reallocated to a new user. That memory looks fine to KASan, but quick reallocation can mask use-after-free bugs in the code that previously owned it. The proposed solution here is to put freed memory into a "quarantine" area for a period, delaying its availability to the rest of the system. Memory would emerge from quarantine after a defined period; alternatively, a shrinker could be used to remove memory from quarantine when the system starts to run low. There are concerns that delaying free operations in this way could create a certain amount of memory fragmentation. Andrey is not quite sure how to move forward with this feature, and the group did not appear to have a lot of fresh ideas to share.
Then there is the possibility of catching reads of uninitialized memory. It is possible to get the compiler to instrument code to make this testing possible, but the results include a lot of false positives that are hard to get rid of. Among other things, memory initialized in assembly code must be annotated manually. Andrey has tried doing this and found the result difficult to support. He's afraid that developers will turn the feature on, see all the false positives, and just give up on the whole thing.
Another possibility is using KASan to find data races; there are some tools out there to help with this now. But, he said, it involves some "crazy overhead" — four bytes of shadow memory for every byte of normal memory. There's also a need for a lot of manual annotation; large numbers of false positives are also a problem. The end result is that this feature does not appear to be useful for now.
Other ideas for the more distant future include a quarantine for the page allocator (and not just the slab allocator), and the instrumentation of some inline assembly operations like the atomic bit operators.
Sasha made a plea for developers to enable KASan when they are running their own tests. It has turned up a lot of bugs, he said; the code is in the upstream kernel, it's easy to turn on, and the overhead is low. The only catch is that GCC 5 is needed to gain all of the features, though 4.9 works with reduced functionality.
The final question in this session was: now that we have KASan, is there still a need to maintain the older kmemcheck utility? Kmemcheck only works on single-processor systems, it is painful to use, and it is slow. It seems that nobody is actually making use of it. The consensus of the group was that kmemcheck should be removed. (It should be noted that Sasha's attempt to implement this decision ran into some opposition from developers who still use kmemcheck, so it may stay around for a while yet).
Improving page reclaim
Dave Hansen started a brief LSFMM 2015 memory-management track session on page-reclaim performance by saying that we have a problem: over the years, the kernel's memory-management and swap subsystems have been designed around the use of slow secondary storage devices. But now we are heading toward an era increasingly driven by the availability of massive nonvolatile memory, and we are not fully prepared for it.The fundamental question, he said, was how to integrate these technologies into the Linux kernel. We have a number of subsystems like DAX that can provide high-speed access to persistent memory devices, but they require applications to be changed. If we run current kernels over such devices without using special interfaces, swapping is no faster than it is with older, slower devices. There is just too much overhead in the memory-management layer, and, in particular, in the manipulation of the least-recently-used (LRU) lists that track reclaimable pages in the system. The LRU, he said, is a fancy system to find the best eviction candidate at any given time, but, in this situation, perhaps it would be better to use something else?
Christoph Lameter suggested that users who care about performance should just put their entire application into memory and be done with it. But Dave was not so easily deterred; he would like to find ways for existing applications to get better performance on persistent-memory devices without changes.
Andrea Arcangeli said that we should not be worrying about memory in 4KB units when we are dealing with devices that can hold 100GB or more. Swapping pages in 2MB units would, he said, go a long way toward solving the problem. Andi Kleen agreed to a point — but he felt that 2MB was still far too small. In general, he said, we need to move toward managing memory in larger chunks or just do away with the LRU lists altogether.
Dave suggested that there are a number of opportunities to run the LRU lists in a more relaxed mode. One idea, he said, was to add a third LRU level for pages that are ready to be swapped out. (The kernel currently manages two levels of LRU lists, one for active pages and one for pages that seem to be inactive and should be considered for eviction). Perhaps some sort of "scanaround" algorithm could be applied to that third level to batch up pages for writing out to the swap device. Johannes Weiner answered that he had tried something similar a few years ago. It didn't work well, he said, due to disk seek issues, but it might work better on truly random-access devices.
Hugh Dickins expressed skepticism toward the entire idea, though. To him, it looks like an attempt to reduce memory-management overhead by adding even more complex algorithms to cluster things. That is increasing the complexity of the system rather than reducing it. Batching things up may help to speed things up, but you still have to deal with items individually to make up the batches.
As things wound down, Dave said that he was going away with a couple of interesting ideas to explore.
Huge pages and persistent memory
One of the final sessions in the memory-management track of LSFMM 2015 had to do with the intersection of persistent memory and huge pages. Since persistent memory looks set to come in huge sizes, using huge pages to deal with it looks like a performance win on a number of levels. But support for huge pages on these devices is not yet in the kernel.Matthew Wilcox started off by saying that he has posted a patch set adding huge-page support to the DAX subsystem. But, he said, only one other developer reviewed the code. The biggest complaint was the introduction of the pmd_special() function, which tracks a "special" bit at the page middle directory (PMD) level in the page table hierarchy, which is where huge pages are managed.
Some background: the kernel allows architecture-level code to mark specific pages as being "special" by providing a pte_special() function. These pages have some characteristic that causes them to behave differently than ordinary memory. In cases where the architecture has enough bits available in its page table entries, pte_special() just checks a bit there; otherwise things get more complicated. The core memory-management code treats so-marked pages, well, specially; for example, virtual memory areas containing "special" pages should also have a find_special_page() method to get the associated struct page.
Back to the discussion: adding pmd_special() requires that the
"specialness" of the huge page be tracked at the PMD level. It is not
clear that every architecture has a free bit available in the PMD to track
that state. In theory, free bits should abound there since as many as 20
bits in the lower part of the entry are not needed to map to a page frame
number, but some quick searching by developers in the room revealed that,
on x86 at least, the "extra" bits must be set to zero. For now, though,
Matthew is using the same bit that pte_special() uses, so his
code should work on every architecture that supports pte_special().
In the case of huge pages backed by persistent memory, the pmd_special() bit indicates to the memory-management code that there is no associated page structure. Andrea Arcangeli asked why a special bit was needed to mark that condition; Matthew responded that it's because he doesn't really understand the memory-management subsystem, so he implemented something he knew he could make work.
This code may eventually be pushed in a direction where pmd_special() is no longer needed. But there are some other issues that come up. Matthew raised one: what happens when an application creates a MAP_PRIVATE mapping of a file into memory, then writes to a page in that mapping? The write will cause the memory-management code to allocate anonymous memory to replace the 2MB huge page being written to; the question is: should it allocate and copy a full 2MB page, or just copy the 4KB page that was actually written? Andy Lutomirski suggested that the answer had to be to copy 4KB; copying the full 2MB for each single-page change would be too expensive. But Kirill Shutemov replied that copy-on-write for huge pages does a 2MB copy now; the behavior with persistent memory, he said, should be consistent.
Matthew moved on to the topic of in-kernel uses for persistent memory. There will be some interesting ones, he thought, but how it should all work has yet to be worked out. HP, for example, is using ioremap() to map persistent memory into the kernel as if it were device memory; Matthew said that seems like the wrong approach to him. We should, he said, be using logical interfaces to persistent memory rather than direct physical interfaces like ioremap(). So he would like to see the creation of some sort of mapping interface implemented within the virtual filesystem layer that would allow persistent memory to be mapped into the kernel's address space.
Andy said that the pstore mechanism could benefit from directly-mapped persistent memory. There was also talk of maybe being able to load kernel modules from persistent memory without the need to copy them into "regular" memory. It might be possible to even map the entire kernel, but there is one little catch: the kernel patches its own code for a number of reasons, including use of optimal instructions for the specific hardware in use and turning tracepoints on and off. If the kernel were mapped from persistent memory, that patching would change the version stored in the device as well — probably not the desired result.
Finally, Matthew said, there have been requests for the ability to use extra-huge, 1GB pages as well as 2MB pages. He is looking at adding that functionality, but he has been struck by the amount of code duplication that exists at each of the four page table levels. He has some thoughts about creating a level-independent "virtual page table entry" abstraction that could be used to get rid of much of that duplication. The reaction from the assembled memory-management developers was cautiously positive; Matthew was encouraged to implement this abstraction within the DAX code. If it works out well there, it can then spread into the rest of the memory-management code.
Investigating a performance bottleneck
In a short plenary session near the end of day one of the 2015 LSFMM Summit, Chuck Lever and Peter Zijlstra led a discussion on performance bottlenecks. The original idea for the session was to look at various performance problems, one of which came from Lever and others that would be offered up by those in attendance. As it turned out, though, only Lever's problem was discussed, perhaps due to low energy after a long day.
Lever described a problem he is seeing in NFS on low-latency transports, which have latencies an order of magnitude less than Ethernet. For his test, the latency added by the RPC infrastructure is on the order of 20µs and the round-trip network time is around 25µs. On idle clients, the performance is much what he expects, but if he loads the client with, say, a kernel build, these RPC tests start taking 300µs.
Lever has narrowed the problem down to wake_up_bit(). That function is taking "too bloody long", Zijlstra said. There is some contention on waking, he continued, but it is not clear what that could be.
Dave Chinner suggested using the latency tracer in ftrace to help further narrow it down.
Chris Mason noted that he has started benchmarking newer kernels at Facebook and had not run into anything surprising yet.
Lever said that it is not just a spinlock that is being contended, as the resources are being held far longer than that. Zijlstra said that the wakeup itself should not be that expensive. Perhaps it is the runqueue locks that are being contended in that situation.
Andy Lutomirski wondered if inter-processor interrupts (IPIs) take longer to send in this case. There is a different path in the code when the system is under load, he said. Mel Gorman suggested testing with a maximum cstate value set to zero to ensure that power management wasn't affecting things. At the end, Zijlstra suggested gathering more data and said that he and others would have a look then.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Distributions
Making MythTV manageable with LinHES 8.3
Recent years have seen an explosion in the number and variety of streaming-media services available to consumers. Regrettably, the overwhelming majority of those services are tied to restrictive DRM and access-control methods. If users want full control over their media content, a local digital-video recorder (DVR) is essentially the only option, and for those who also care about software freedom, MythTV remains the only truly viable choice. But MythTV is not easy to configure or use, which is why specialty distribution projects like LinHES (for Linux Home Entertainment System) were created. LinHES recently released an update that takes a lot of the pain and frustration out of working with MythTV—and adds some nice features of its own, to boot.
For those unfamiliar with it, MythTV is a free-software DVR system, which is to say that it enables scheduling, recording, and playing back broadcast video. There are other "media center" applications with wider install bases (such as Kodi), but they tend to focus primarily on Internet-accessible media. A handful of other DVR projects have emerged over the years, but they rarely last long. Given the large array of broadcast formats, tuning and recording hardware, and channel-listing services that such a project has to cope with, the momentum has remained with the largest project, MythTV.
But MythTV installation and configuration are a considerable hassle: users have to make a great many decisions up front that can be hard to change later and must manually enter page after page of settings that, if not entirely correct, can cause incompatibilities and failures further down the line. MythTV updates are also renowned for introducing backward-incompatible changes (to, say, the internal database schema), making upgrades a less-than-pleasant pastime.
On the plus side, that level of complexity creates a market for projects like LinHES. The project released its latest update, version 8.3, on February 18. The release is available as an x86-64 ISO image that can be used to boot any machine into a MythTV frontend system or to install a permanent MythTV instance running as a frontend, a backend, or both. Frontends, for those uninitiated, are the "media center" systems used to play back content; backends are used for recording, storage, and post-processing.
LinHES was started in 2003 under the name KnoppMyth. As that name suggests, those early releases were built on top of the Knoppix distribution. The project adopted the name LinHES in 2009, at which point it migrated from Knoppix to Arch Linux as its underlying distribution.
Version 8.3 ships with MythTV 0.27.4, which is the latest upstream release. In addition to MythTV, Kodi and Plex are available as alternative media-center applications (both have add-ons to let them play back MythTV recordings).
The distribution also includes a suite of auxiliary services preconfigured to make a LinHES box function as a media-centric appliance. NFS and Samba are included for sharing files, remote administration is available with Webmin and VNC (the latter because another of MythTV's charming quirks is that even backend machines must be configured with a GUI tool), plus there are various remote system-monitoring tools and extra packages for supporting Bittorrent and other file-transfer methods. These auxiliary services are not automatically started without the user's permission, for those who are concerned about the security risks involved.
"Appliance" is really the key word, though. A LinHES machine cannot easily be used as something other than a MythTV box (even using Kodi and Plex requires making serious alterations). The Enlightenment desktop environment is provided, as are terminal emulators and web browsers, but they are only there as system-administration tools. Booting LinHES (whether in Live-CD mode or from hard disk) launches straight into the MythTV frontend.
More importantly, however, booting for the very first time launches LinHES's GUI configuration tool, which is built with the same toolkit as MythTV itself. The result is as close to a seamless set-up experience as one will find for MythTV. And the set-up and configuration process actually works; a lot of effort has gone into making automatic detection and configuration of the necessary settings function correctly. This includes setting up recording hardware, configuring infrared receivers and remote controls, and connecting input sources to channel line-ups. All three of these set-up tasks are, historically, on the tricky side. LinHES does not make them automatic, but it does a lot to simplify matters.
The framework used for the LinHES-specific configuration steps is called MythVantage, and it is coupled with some custom theming work on the MythTV frontend menu system. The theming work even integrates with the live-CD mode's welcome screen, which results in a largely boot-it-and-run experience for live-CD frontends.
The MythVantage tools even permit configuring LinHES's non-MythTV options and settings (like file-sharing or package management) without exiting MythTV itself. In contrast, most of the other MythTV distributions (like MythBuntu) offer their custom configuration tools in standalone applications. While that might not bother Linux veterans, it complicates the system, and detracts from the "appliance" feel.
That said, there are limits to how much LinHES can automate. In the United States, for example, program guide data for MythTV is accessible only through a non-profit service called Schedules Direct, which requires the user to set up a paid account. Nevertheless, the 8.3 release notes indicate that there has been some work to automate configuring guide data for other regions of the world, so perhaps others will find the experience even simpler.
Once up and running, a LinHES MythTV system is (naturally enough) much like any other MythTV system of similar vintage. But LinHES offers a few nice add-on features here as well. For example, two MythTV add-on packages are included that are not part of the standard, upstream MythTV system: MythExpress and MythExport. The first is a browser-based MythTV frontend; it allows the user to play back recordings in a web browser, largely negating the need to install the MythTV frontend package on every machine.
The second is a video-transcoding tool designed to shrink recordings to a size suitable for long-term storage or mobile-friendly playback; it can be run as a background job to transcode every recording automatically. Since most of the major video-broadcast standards today use some variation on MPEG-2 (which consumes considerably more storage than do more recent formats), saving space by transcoding is an obvious plus. It might seem like that functionality would be an obvious feature to build into MythTV itself—in fact, transcoding is built-in, in theory, but the upstream implementation has been broken for years. The typical response to asking about transcoding on an official MythTV forum is that it no longer makes sense, since hard disks are so cheap. Such a reply rarely satisfies the question-asker; LinHES simply adds the requested feature that the user wants, rather than telling the user to stop wanting it.
One of the other niceties in this category include a system back-up-and-restore tool. LinHES automatically creates database backups every day, along with backups of /etc and /home, to be stored locally. Recorded videos (which in MythTV are generally not stored in /home, because MythTV expects an entire volume to be reserved for storage) can also be backed up but, because of the space required, this is not configured by default. LinHES also builds in configuration tools for setting up dynamic DNS, Bluetooth headsets, and a number of other options that are not integrated with the vanilla MythTV package.
LinHES does still have some shortcomings worth pointing out. The biggest is that the latest releases have support only for 64-bit Intel architectures. While 32-bit Intel systems are probably not worth worrying about, the lack of an ARM release is more problematic. More and more users are using small, ARM-based Linux boards as media-center set-top boxes; not supporting that option is something that LinHES will likely have to reconsider in future releases. Another, similar limitation is that LinHES focuses on support for NVIDIA graphics cards, including NVIDIA's non-free drivers. Smooth HD video playback no longer requires the latest and greatest GPU (or proprietary drivers), so this is a peculiar choice.
Nevertheless, LinHES offers a noticeably better setup and administration experience than the vanilla MythTV package one finds in a typical Linux distribution. This comes at a cost: the appliance-like setup necessarily restricts what else the MythTV box can be used for. But the reduction in MythTV-related headaches one gets in exchange may make a lot of MythTV users currently running other distributions do some serious thinking.
Brief items
Distribution quotes of the week
Major Updates for Qubes + Whonix
Qubes + Whonix is the combination of QubesOS and Whonix OS; it focuses on the twin goals of security and anonymity. Qubes + Whonix has seen some major updates recently. "The Qubes + Whonix port has been fundamentally upgraded to a native seamless architecture (ProxyVM + AppVM)."
Distribution News
Debian GNU/Linux
Debian Project Leader election
We are now in the campaigning period for the DPL election. There are three candidates: Mehdi Dogguy, Gergely Nagy, and Neil McGovern. The vote page contains more information and links to the candidate's platforms. See this week's quotes section for links to interviews with the candidates.
Ubuntu family
Ubuntu 10.04 (Lucid Lynx) reaches End of Life
Ubuntu 10.04 (Lucid Lynx) will reach its end of life on April 30, 2015. There will be no more updates after that time. "The supported upgrade path from Ubuntu 10.04 is via Ubuntu 12.04. Users are encouraged to evaluate and upgrade to our latest 14.04 LTS release via 12.04."
Newsletters and articles of interest
Distribution newsletters
- Debian Project News (March 12)
- DistroWatch Weekly, Issue 601 (March 16)
- 5 things in Fedora this week (March 14)
- Ubuntu Weekly Newsletter, Issue 408 (March 15)
A Linux distro for education: UberStudent (Opensource.com)
Opensource.com reviews UberStudent. "I will be honest, I am not normally a fan of specialized distributions that are basically Ubuntu plus extra pieces. I have tried out too many where they end up being nothing more than a poorly customized theme, a few extra bundled applications, and a new name (despite the Ubuntu branding still appearing everywhere in the distribution), which is not enough to make a distribution worth using in place of Ubuntu or one of the official variants. However, UberStudent is most definitely not one of those slipshod distributions; it is well thought out and implemented, very polished, and works great. I highly recommend UberStudent to anyone interested in a distribution customized for education."
Page editor: Rebecca Sobol
Development
Host-key rotation and more in OpenSSH 6.8
OpenSSH 6.8 was released on March 18. As usual, the update adds several additional features to the ssh client and sshd server; some of the changes are meant to ease the configuration or management of systems, while some are geared primarily toward better usability (a factor that, for SSH, can have genuine security implications). But there are other changes that introduce new functionality altogether, such as the ability to securely migrate from one SSH key to another, or the ability to require multiple keys to authenticate to a server.The portable version of OpenSSH (that is, the package intended for operating systems other than OpenSSH's parent OS, OpenBSD) is available for download in source form. It will likely be a brief matter of time before most Linux distributions have packages available.
New major features
The new feature in version 6.8 that has prompted the most discussion is support for host-key rotation. Host-key rotation is an attempt to solve a longstanding problem: from time to time, servers need to retire an old SSH key and replace it with a new one, but swapping out keys without warning can leave clients unable to connect. Sometimes, the key replacement is precautionary (such as migrating to a stronger key algorithm), but key replacement may also be necessary in a hurry if a key is believed to be compromised.
With OpenSSH's rotation scheme, once a client has authenticated to a server, the server can send over a list of all of its supported keys. The client can store the list locally in its known_hosts file. Since each key record indicates the algorithm used, the next time a client connects, it can authenticate using a newer or stronger available key. The server, in turn, can eventually pull an old key out of the list and retire it. The client, when it connects with the new key, would update its list again and remove the now-absent old key from known_hosts.
This feature is experimental, though. In the comments on OpenSSH maintainer Damien Miller's initial blog post about the subject, some readers pointed out potential exploits. An attacker could slip an extra key into the list, for example, then subsequently proxy-connect clients to a different server. By trusting implicitly that the keys in known_hosts belong to who they claim to, the client would not know that the SSH session had been redirected. Miller then added a signature-checking step to the scheme, so that the client will verify that the key belongs to the server.
To do the signature check, the client sends a request (including a session identifier) for each new key that it sees. The server signs each of these requests with the private key that corresponds to the requested public key. That addition seems to have satisfied most of the commenters, but the story serves as a reminder that some real-world testing is highly advisable before deploying such a new feature in the wild.
Another new feature is support for multi-key authentication. In OpenSSH 6.2, the sshd daemon gained support for the AuthenticationMethods configuration directive, with which the server administrator can specify a multi-step authentication process. ``publickey,password'', for example, would require connecting clients to authenticate with a key, then with a password.
As of OpenSSH 6.8, ``publickey,publickey'' is a supported authentication combination. It requires clients to authenticate with two separate keys. Other combinations with additional directives are possible, too, as is requiring three or more keys.
There is also one important change in 6.8 that may require server administrators to alter their sshd configurations. In older versions of OpenSSH, the sshd daemon would perform reverse DNS lookups on connecting clients (logging suspicious results). There were a few objections to this. For one thing, high-traffic servers were doing a lot of DNS queries (adding to system load). For another, as Daniel Kahn Gillmore pointed out in November 2014, the lookups added no real security benefit. In fact, they could even pose a security risk, he said in a follow up, since buggy DNS resolvers could be used to mask an attacker's activity. As of OpenSSH 6.8, then, the DNS lookup feature has been turned off by default. Servers that make use of it will need to have their configuration files updated to switch the feature back on.
New minor features
While the host-key rotation and multi-key authentication features permit OpenSSH users to implement some new functionality, there are a great many more improvements in the new release that merely simplify configuration or make day-to-day usage a better experience. For instance, several enhancements were made to host-based authentication. Both the client and server configuration files can now include a directive specifying what public key types are used to connect for host-based authentication, and Ed25519 keys are supported.
Key-revocation lists (KRLs) were another feature introduced in version 6.2, and were also the target of some small enhancements. Up through OpenSSH 6.7, the use of KRLs required that OpenSSH be compiled with OpenSSL support; this is now no longer needed. A RevokedHostKeys option was added to the ssh client, allowing the user to revoke keys with a KRL or with a text file. KRLs can also revoke X.509 certificates and, as of version 6.8, they can do so without also needing to specify the certificate authority (CA) that issued each certificate.
Both the ssh client and sshd server have a new FingerprintHash option, available as a command-line flag and as configuration-file option, that lets users specify the algorithm used to generate a key fingerprint. In conjunction with this change, the format OpenSSH uses to print out a key fingerprint has been updated; it now prepends the name of the algorithm used, for easy reading.
Anyone still using version 1 of the SSH protocol (which is hopefully not a large group) can rest easier in at least one respect tonight: OpenSSH 6.8 adds a workaround that blocks the new Bleichenbacher side-channel attack disclosed by Christopher Meyer and associates in 2014. At the other end of the ancient-to-contemporary spectrum, users who use IPv6 addresses on their machines will be happy to hear that version 6.8 fixes an annoying bug in which OpenSSH tried to parse some IPv6 addresses as hostnames.
There are, of course, many more small changes and updates not addressed here. Partial authentication successes are no longer counted as authentication failures against the MaxAuthTries limit, ssh matching rules now support the negation operator (e.g., Match !foo), and so forth.
Moving forward, it will be worth paying attention to the real-world feedback generated by users testing out the host-key rotation feature. System administrators have dealt with key rotation in a variety of ways in the past, with no real standard, so OpenSSH's venture into the fray could have a lasting impact. In the meantime, there are enough new additions to OpenSSH to make it worth exploring for its other improvements as well.
Brief items
Quotes of the week
simply isn't needed anymore."
Qt 5.5 Alpha Available
Qt 5.5 alpha has been released. "With Qt 5.5, Canvas 3D is fully supported and a technology preview of long awaited Qt 3D is included. Qt 5.5 also introduces mapping support with a Qt Location technology preview. Qt 5.5 Alpha is the first step towards Qt 5.5 final release planned to be available in May." Check out the New Features in Qt 5.5 page for more details.
StoryText 3.12 released
Version 3.12 of the StoryText GUI-testing tool is now available. StoryText supports "PyGTK, Tkinter, wxPython, Swing and SWT along with a Python framework for testing GUIs in general.
" The new release adds support for GTK+3 and features several enhancements to Eclipse support.
KDE Frameworks 5.8.0 released
Version 5.8.0 of the KDE Frameworks add-on library collection is now available. New frameworks in this release include KPeople, which "provides access to all contacts and the people who hold them
" and KXmlRpcClient, for interacting with XMLRPC services. There are changes to be found in many of the individual libraries; developers are encouraged to read the release notes thoroughly.
Newsletters and articles
Development newsletters from the past week
- What's cooking in git.git (March 11)
- What's cooking in git.git (March 14)
- What's cooking in git.git (March 17)
- GNU Toolchain Update (March 15)
- LLVM Weekly (March 16)
- OCaml Weekly News (March 17)
- OpenStack Community Weekly Newsletter (March 13)
- Perl Weekly (March 16)
- PostgreSQL Weekly News (March 15)
- Python Weekly (March 12)
- Ruby Weekly (March 12)
- This Week in Rust (March 16)
- Tor Weekly News (March 18)
- Wikimedia Tech News (March 16)
NTP's Fate Hinges On 'Father Time' (InformationWeek)
InformationWeek has a lengthy look at the maintenance of the network time protocol (NTP) code. "Not all is well within the NTP open source project. The number of volunteer contributors -- those who submit code for periodic updates, examine bug reports, and write fixes -- has shrunk over its long lifespan, even as its importance has increased. Its ongoing development and maintenance now rest mostly on the shoulders of [Harlan] Stenn, and that's why NTP faces a turning point. Stenn, who also works sporadically on his own consulting business, has given himself a deadline: Garner more financial support by April, 'or look for regular work.'"
OpenSCAD 2015.03 released with text objects support (Libre Graphics World)
Libre Graphics World has a look
at the new release of OpenSCAD, the 3D solid-modeling tool often used
in conjunction with 3D printers. The new features include support for
complex text layout, offset functions for manipulating polygons, and
the ability to generate height maps from PNG images. "The user interface got a few improvements as well: new startup dialog to quickly open recent files or examples from a library, new QScintilla-based code editor with folding support, SVG and AMF exporting, and more.
"
KDE and The Semantic Desktop
Vishesh Handa has written a detailed recap
of KDE's history with the "semantic desktop" paradigm, in which the
Resource Description Framework (RDF) format was used to store all
data, and the Nepomuk component was provided to index and search
it. "Having a huge central store was limiting, and using RDF
just made it harder. Some of the notable applications were - Amarok,
Bangarang, Rekonq, and KGet. However, Nepomuk was almost always
optional, and not part of the core feature set.
" Eventually,
Handa notes, Nepomuk was removed, and KDE had to design a new search
engine to replace it. "This project was often sold under the
misnomer of being KDE's new Semantic Search engine. I often feel that
the description, while containing a ton of buzz words, really does
stray away from what it really meant to be Semantic.
" Rather, "in Plasma 5, The Baloo project is just a file indexing and searching solution. Nothing more.
"
Hall: Preview of GNOME usability results
Jim Hall has posted a preview
of the recent usability work done by GNOME OPW participant Sanskriti
Dawle (to whom Hall has been acting as mentor). "
I can make a
few initial observations from this data. Looks like testers had the
most difficulty with tasks Gedit.6 and Photos.3 and Photos.4, with
noticeable difficulty in tasks Notes.1 and Photos.2. There's some
interesting data around tasks Gedit.1 and Music.1 that might reflect
testers 9, 11, and 12
", Hall notes. "
I encourage you to watch Sanskriti's blog for the final results, which I hope to see in the next week as she wraps up her work in the internship.
"
Page editor: Nathan Willis
Announcements
Brief items
Google Code shutting down
Google has announced that the Google Code repository is shutting down. "As developers migrated away from Google Code, a growing share of the remaining projects were spam or abuse. Lately, the administrative load has consisted almost exclusively of abuse management. After profiling non-abusive activity on Google Code, it has become clear to us that the service simply isn’t needed anymore." New project creation has been stopped already; the final pulling of the plug will be in January 2016.
Articles of interest
The GNU Manifesto Turns Thirty (New Yorker)
The New Yorker notes the 30th anniversary of the GNU Manifesto. "Stallman was one of the first to grasp that, if commercial entities were going to own the methods and technologies that controlled computers, then computer users would inevitably become beholden to those entities. This has come to pass, and in spades. Most computer users have become dependent on proprietary code provided by companies like Apple, Facebook, and Google, the use of which comes with conditions we may not condone or even know about, and can’t control; we have forfeited the freedom to adapt such code according to our needs, preferences, and personal ethics."
Utah software company’s decade-old suit against IBM revived (SL Tribune)
The Salt Lake Tribune reports that the SCO Group's lawsuit against IBM is once again alive and moving in Federal court. "In addition to its claims of IBM misappropriation of code, SCO alleges that IBM executives and lawyers directed the company's Linux programmers to destroy source code on their computers after SCO made its allegations. The company's other remaining claims are that IBM's actions amounted to unfair competition and interference with its contracts and business relations with other companies."
Calls for Presentations
Kolab Summit 2015
Kolab Summit will be held May 2-3 in the The Hague, Netherlands, co-located with the openSUSE conference. The call for papers ends April 1. "Keynotes from Georg Greve, CEO of Kolab Systems AG, and Jeroen van Meeuwen, lead Kolab architect, will open the speaking schedule, providing a look at what is coming for Kolab in 2015 and beyond. Lead developers from Roundcube, the worlds most popular webmail application, along with key participants from KDE Kontact, cyrus-imap, Seafile, OpenChange and more will in attendance and presenting talks."
EuroPython 2015: Call for Proposals
EuroPython will be held in Bilbao, Spain, July 20-26. Proposals must be submitted by April 14. "We’re looking for proposals on every aspect of Python: programming from novice to advanced levels, applications and frameworks, or how you have been involved in introducing Python into your organization. EuroPython is a community conference and we are eager to hear about your experience."
PostgresOpen 2015 - Call For Papers
PostgresOpen will be held September 16-18 in Dallas, TX. The deadline for submitting talks is May 17. "We're looking for presentations on any topic related to PostgreSQL including, but not limited to, case studies, experiences, tools and utilities, migration stories, existing features, new feature development, benchmarks, performance tuning, and more!"
DebConf15: Call for Proposals
The DebConf Content team has announced the call for proposals for DebConf15, which will be held in Heidelberg, Germany August 15-22. The deadline is June 15.CFP Deadlines: March 19, 2015 to May 18, 2015
The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.
| Deadline | Event Dates | Event | Location |
|---|---|---|---|
| March 31 | July 25 July 31 |
Akademy 2015 | A Coruña, Spain |
| March 31 | May 4 May 5 |
CoreOS Fest | San Francisco, CA, USA |
| April 3 | May 2 May 3 |
Kolab Summit 2015 | The Hague, Netherlands |
| April 4 | May 30 May 31 |
Linuxwochen Linz 2015 | Linz, Austria |
| April 6 | May 20 May 22 |
SciPy Latin America 2015 | Posadas, Misiones, Argentina |
| April 14 | April 14 April 15 |
Palmetto Open Source Software Conference | Columbia, SC, USA |
| April 15 | June 12 June 14 |
Southeast Linux Fest | Charlotte, NC, USA |
| April 17 | June 11 June 12 |
infoShare 2015 | Gdańsk, Poland |
| April 28 | July 20 July 26 |
EuroPython 2015 | Bilbao, Spain |
| April 30 | August 7 August 9 |
GNU Tools Cauldron 2015 | Prague, Czech Republic |
| May 1 | August 17 August 19 |
LinuxCon North America | Seattle, WA, USA |
| May 1 | September 10 September 13 |
International Conference on Open Source Software Computing 2015 | Amman, Jordan |
| May 1 | August 19 August 21 |
KVM Forum 2015 | Seattle, WA, USA |
| May 1 | August 19 August 21 |
Linux Plumbers Conference | Seattle, WA, USA |
| May 2 | August 12 August 15 |
Flock | Rochester, New York, USA |
| May 3 | August 7 August 9 |
GUADEC | Gothenburg, Sweden |
| May 3 | May 23 May 24 |
Debian/Ubuntu Community Conference Italia - 2015 | Milan, Italy |
| May 8 | July 31 August 4 |
PyCon Australia 2015 | Brisbane, Australia |
| May 15 | September 28 September 30 |
OpenMP Conference | Aachen, Germany |
| May 17 | September 16 September 18 |
PostgresOpen 2015 | Dallas, TX, USA |
| May 17 | August 13 August 17 |
Chaos Communication Camp 2015 | Mildenberg (Berlin), Germany |
If the CFP deadline for your event does not appear here, please tell us about it.
Upcoming Events
Announcing Libre Graphics Meeting 2015
Libre Graphics Meeting will be held April 29-May 2 in Toronto, Canada. "Developers and prominent users of software projects like Inkscape, Gimp, Scribus, Fontforge, My Paint, Blender, and Krita—among many others— come together to show off their projects and discuss them with the larger Libre Graphics community. Not only is Libre Graphics Meeting an exciting and motivating moment for developers and users of all kinds, from typographers to illustrators, designers and video artists, it's is also a unique moment for users and developers of free software to collide and share ideas beyond the confined space of mailing-lists, bug trackers or forums."
Events: March 19, 2015 to May 18, 2015
The following event listing is taken from the LWN.net Calendar.
| Date(s) | Event | Location |
|---|---|---|
| March 17 March 19 |
OpenPOWER Summit | San Jose, CA, USA |
| March 21 March 22 |
LibrePlanet 2015 | Cambridge, MA, USA |
| March 21 March 22 |
Kansas Linux Fest | Lawrence, Kansas, USA |
| March 23 March 25 |
Android Builders Summit | San Jose, CA, USA |
| March 23 March 25 |
Embedded Linux Conference | San Jose, CA, USA |
| March 24 March 26 |
FLOSSUK DevOps Conference | York, UK |
| March 25 March 27 |
PGConf US 2015 | New York City, NY, USA |
| March 26 | Enlightenment Developers Day North America | Mountain View, CA, USA |
| March 28 March 29 |
Journées du Logiciel Libre | Lyon, France |
| April 9 April 12 |
Linux Audio Conference | Mainz, Germany |
| April 10 April 12 |
PyCon North America 2015 | Montreal, Canada |
| April 11 April 12 |
Lyon mini-DebConf 2015 | Lyon, France |
| April 13 April 17 |
SEA Conference | Boulder, CO, USA |
| April 13 April 17 |
ApacheCon North America | Austin, TX, USA |
| April 13 April 14 |
AdaCamp Montreal | Montreal, Quebec, Canada |
| April 13 April 14 |
2015 European LLVM Conference | London, UK |
| April 14 April 15 |
Palmetto Open Source Software Conference | Columbia, SC, USA |
| April 16 April 17 |
Global Conference on Cyberspace | The Hague, Netherlands |
| April 17 April 19 |
Dni Wolnego Oprogramowania / The Open Source Days | Bielsko-Biała, Poland |
| April 21 | pgDay Paris | Paris, France |
| April 21 April 23 |
Open Source Data Center Conference | Berlin, Germany |
| April 23 | Open Source Day | Warsaw, Poland |
| April 24 | Puppet Camp Berlin 2015 | Berlin, Germany |
| April 24 April 25 |
Grazer Linuxtage | Graz, Austria |
| April 25 April 26 |
LinuxFest Northwest | Bellingham, WA, USA |
| April 29 May 2 |
Libre Graphics Meeting 2015 | Toronto, Canada |
| May 1 May 4 |
openSUSE Conference | The Hague, Netherlands |
| May 2 May 3 |
Kolab Summit 2015 | The Hague, Netherlands |
| May 4 May 5 |
CoreOS Fest | San Francisco, CA, USA |
| May 6 May 8 |
German Perl Workshop 2015 | Dresden, Germany |
| May 7 May 9 |
Linuxwochen Wien 2015 | Wien, Austria |
| May 8 May 10 |
Open Source Developers' Conference Nordic | Oslo, Norway |
| May 12 May 13 |
PyCon Sweden 2015 | Stockholm, Sweden |
| May 12 May 14 |
Protocols Plugfest Europe 2015 | Zaragoza, Spain |
| May 13 May 15 |
GeeCON 2015 | Cracow, Poland |
| May 14 May 15 |
SREcon15 Europe | Dublin, Ireland |
| May 16 May 17 |
11th Intl. Conf. on Open Source Systems | Florence, Italy |
| May 16 May 17 |
MiniDebConf Bucharest 2015 | Bucharest, Romania |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol
