Brief itemsreleased on October 4. The kernel repository is back in its old location on kernel.org; Linus also has started using a new signing key with this release. "On the kernel front, not a huge amount of changes. That said, by now, there had better not be - and I definitely wouldn't have minded having even fewer changes. But the fixes that are here are generally pretty small, and the diffstat really doesn't look all that scary - there really aren't *big* changes anywhere."
Use of plugins in this way allows significant changes to be made to the kernel without actually having to change the code:
On the other hand, plugins of this type can increase the distance between the code one sees and what is actually run in the kernel; it is easy to imagine that leading to some real developer confusion at some point. Still, says PaXTeam, "the cost/benefit ratio of the plugin approach is excellent and there's a lot more in the pipeline." It is not too hard to imagine other uses that are not necessarily tied to security.
(Amusingly, the plugins are licensed under GPLv2, meaning that they do not qualify for the GCC runtime library exemption. The kernel does not need that library, though, so all is well.)
Kernel development newsinformed that kernel.org, the primary repository for kernel code in various stages of development, had been compromised - though developers with access to the site had been informed a few days prior. The site was shut down for "maintenance" when that notice went out, leaving the community without an important hosting and distribution point. Kernel development has slowed as a result; the 3.1 kernel, which would have been expected by now, remains unreleased. Kernel.org is on its way back, but it will almost certainly never be quite the same.
On October 3, a basic kernel.org returned to the net. Git hosting is back, but only for a very small number of trees: mainline, stable, and linux-next. The return of the other trees is waiting for the relevant developers to reestablish their access to the site - a process that involves developers verifying the integrity of their own systems, then generating a new PGP/GPG key, integrating it into the web of trust, and forwarding the public key to the kernel.org maintainers. This procedure could take a while; it is not clear how many developers will be able to regain their access to kernel.org before the 3.2 merge window opens.
The front-page web interface is back though, as of this writing, it is not being updated to reflect the state of the git trees. Most other kernel.org services remain down; some could stay that way for some time. It is worth remembering that kernel.org only has one full-time system administrator, a position that has been funded by the Linux Foundation since 2008. That administrator, along with a number of volunteers, is likely to be quite busy; some of the less-important services may not return anytime soon.
A full understanding of what happened is also likely to take some time. Even in the absence of a report on this intrusion, though, there are some conclusions that can be made. The first is obvious: the threat is real. There are attackers out there with time, resources, motivation, and skills. Given the potential value of either putting a back door into the kernel or adding a trojan that would run on developers' machines, we have to assume that there will be more attacks in the future. If the restored kernel.org is not run in a more secure manner, it will be compromised again in short order.
The site's administrators have already announced that shell accounts will not be returning to the systems where git trees are hosted. Prior to the breakin, there were on the order of 450 of those accounts; that is a lot of keys to the front door to have handed out. No matter how careful all those developers may be - and some are more careful than others - the chances of one of them having a compromised machine approach 100%. Keeping all those shell accounts off the system is clearly an important step toward a higher level of security.
Kernel.org has its roots in the community and was run the way kernel developers often run their machines. So, for example, kernel.org tended to run mainline -rc kernels - a good exercise in dogfooding, perhaps, but it also exposed the system to bleeding-edge bugs, and, perhaps more importantly, obscured the real cause of kernel panics experienced last August, delaying the realization that the system had been compromised. The kernel currently running on the new systems has not been announced; one assumes it is something a little better tested, better supported, and stable. (No criticism is intended by pointing this out, incidentally. Kernel.org has been run very well for a long time; the point here is that the environment has changed, so practices need to change too.)
At this point it seems clear that a single administrator for such a high-profile site is not an adequate level of staffing. Given the resources available in our community, it seems like it should be possible to increase the amount of support available to kernel.org. There are rumors that this is being worked on, but nothing has been announced.
Developers are going to have to learn to pay more attention to the security of their systems. There are scattered reports of kernel developers turning up compromised systems; in some cases, they may have been infected as the result of excessive trust in kernel.org. Certain practices will have to change; for that reason, the Fedora project's announcement of a zero-tolerance policy toward private keys on Fedora systems is welcome. Developers are on the front line here: everybody is depending on them to keep their code - and the infrastructure that distributes that code - secure.
There is an interesting question related to that: will kernel developers move back to kernel.org? These developers have had to find new homes for their git repositories during the outage; some of them are likely to decide that leaving those repositories in their new location is easier than establishing identities in the web of trust and getting back into kernel.org. Linus has said in the past that he sees the presence of a kernel.org-hosted tree in a pull request as a sign that the request is more likely to be genuine. Requiring that repositories be hosted at kernel.org seems like an unlikely step for this community, though. It is not entirely clear whether trees distributed around the net increase the security risk to the kernel, or whether putting all the eggs into the kernel.org basket would be worse.
One other conclusion would seem to jump out at this point: kernel.org got hit this time, but there are a lot of other important projects and hosting sites out there. Any of those projects is just as likely to be a target as the kernel. If we are not to have a long series of embarrassing compromises, some with seriously unfortunate consequences, we're going to have to take security more seriously everywhere. Doing so without ruining our community's openness is going to be a challenge, to say the least, but it is one we need to take on. Security is a pain, but being broken into and used to attack your users and developers is even more so.
One would think that, once the hrtimer code starts running in response to a timer interrupt, it would make sense to run every timer event whose soft expiration time has passed. But that is not what current kernels do. It is an interesting exercise to look at why that is, and how a recent patch from Venkatesh Pallipadi changes that behavior.
For the sake of simplicity, let us imagine a set of timers that we'll call "A" through "G", each expiring 10µs after its predecessor. The hard expiration times are regular, but the timers have wildly differing soft expiration times; plotted on a timeline, the example timers look like this:
As can be seen here, timer "A" has a hard expiration 10µs in the future, but it could expire any time after 5µs. Timer "B" can be expired anytime from 7.5µs to 20µs in the future; the kernel can thus expire them both at 10µs and eliminate the need to schedule a timer interrupt at 20µs. Further in the future, timer "D" has a hard expiration 40µs ahead, but it is quite flexible and could, like timer "B", legitimately be expired 7.5µs from now.
If the kernel is interrupted by a hardware timer in 10µs, it might be expected to call the expiration function for timers "A", "B", and "D". In reality, though, the expiration function for "D" will not be called at that time. To understand why, consider that hrtimers, within the kernel, are stored in a red-black tree with the hard expiration time as the key. The resulting tree will look something like this:
When the timer interrupt happens, the timer code performs a depth-first traversal of this tree for as long as it finds timers whose soft expiration time has passed. In this case, it will encounter "A" and "B" but, once it hits "C", the soft expiration time is in the future and the traversal stops. The organization of the data structure is such that the code cannot find the other events whose soft expiration time has passed without searching the whole tree.
When the hrtimer code was extended to support range timers, searching for all soft-expired timers looked like it would require the addition of a second tree over the existing tree. That was deemed to be too expensive, especially since it may not actually save any wakeups. With the current code, "D" will be expired after 30µs, when "C" hits its hard expiration. Expiring "D" sooner will not eliminate the need for a wakeup at 30µs, so it didn't seem worth the effort to expire "D" sooner.
Venkatesh thought this through and decided that he could come up with a couple of exceptions to that reasoning. It may well be that, at 10µs, the system will be less heavily loaded than at 30µs; in that case, it makes sense to get more work done sooner. Running the timer sooner also could save a wakeup if "C" is deleted prior to expiration. So he wrote up a patch to implement a "greedy hrtimer walk" that would run all soft-expired hrtimers on a timer interrupt.
He was helped by the addition of augmented red-black trees (also done by Venkatesh) in 2010. Essentially, an augmented tree allows the addition of a bit of extra metadata to each node; when a change is made to the tree, that extra information can be percolated upward. The greedy hrtimer walk patch turns the hrtimer tree into an augmented red-black tree; each node then stores the earliest soft expiration time to be found at that level of the tree or below. With the timer example given above, the new tree would look like this:
The new numbers in red tell the tree-traversal logic what the soonest soft-expiration time is in each subtree. Using those numbers, a search of the tree 10µs in the future could prune the search at "F", since all soft expiration times will be known to be at least 25µs further in the future at that time. That takes away much of the cost of searching the tree for soft-expired timers that are not on the left side.
One might still wonder if that extra work is worthwhile on the off-chance that running timer events sooner will be advantageous. After all, in the absence of specific knowledge or a crystal ball, it is just as likely that the system will be less loaded at the later expiration time; in that case, expiring the timer sooner would make things worse. Venkatesh's patch avoids that issue by only performing the greedy hrtimer walk if the CPU is idle when the timer interrupt happens. If work is being done, soft-expired timers that are not immediately accessible are left in the tree, but, if the CPU has nothing better to do, it performs the full search.
Venkatesh benchmarked this work by looking at the number of times the scheduler migrated tasks between CPUs on a given workload. Migrations are a sign of contention for the processor; they can also be expensive since processes can leave their memory cache behind when they move. Given the right workload (80% busy with a number of threads), the number of migrations was cut to just over half its previous value; other workloads gave less impressive results, but the patch never seemed to hurt. Given that, the comments on the patch were mostly focused on the details - like whether the greedy behavior should be controlled by a sysctl knob or not. Chances are this feature will show up in the 3.2 kernel.
Google's requirements for systems running in its cluster have been discussed in public a number of times at this point; the recent Linux Plumbers Conference session on control groups is an example. The company does everything it can to pack as much work onto each system as possible to ensure that its hardware is fully utilized. One aspect of this packing is the need to make the best use possible of system memory. Michel Lespinasse's recently posted idle page tracking patch set is one piece of Google's solution to this problem.
The "fake NUMA" mechanism is currently used to control memory use within a single system, but Google is trying to move to the control-group memory controller instead. The memory controller can put limits on how much memory each group of processes can use, but it is unable to automatically vary those limits in response to the actual need shown by those groups. So some control groups may have a lot of idle memory sitting around while others are starved. Google would like to get a better handle on how much memory each group actually needs so that the limits can be adjusted on the fly - responding to changes in load - and more jobs can be crammed onto each box.
Determining a process's true memory needs can be hard, but one fairly clear clue is the existence of pages in the process's working set that have not been touched in some time. If there are a lot of idle pages around, it is probably safe to say that the process is not starved for memory; this idea is based, of course, on the notion that the kernel's page replacement algorithm is working reasonably well. It follows that, if you would like to know how memory usage limits can be tweaked to optimize the use of memory, it makes sense to track the number of idle pages in each control group. The kernel does not currently provide that information - a gap that Michel's patch set tries to fill.
The memory management code has a function (page_referenced() and a number of variants) that can be used to determine whether a given page has been referenced since the last time it was checked. It is used in a number of memory management decisions, such as the quick aging of pagecache pages that are only referenced once. Michel's patch makes use of this mechanism to find idle pages, but this use has some slightly different needs: Michel needs to know more about the pages in question, and he needs to not interfere with other users of page_referenced(). To meet these needs, Michel has to make some changes to the core memory management code.
For the first problem, the page_referenced() interface is changed to take a new structure (struct page_referenced_info) where the additional information can be recorded. Avoiding interference with existing users of page_referenced(), instead, requires adding a couple of new page flags. Since page flags are in short supply on 32-bit architectures, using more of them is strongly discouraged. This patch set gets around that problem by disabling the feature altogether on 32-bit machines; anybody wanting idle page tracking will need to run in 64-bit mode.
Systems where idle page tracking is in use will have a new kernel thread running under the name kstaled. Its job is to scan through all of memory (once every two minutes by default) and count the number of pages that have not been referenced since the previous scan. Such pages are deemed to be idle; each one is traced back to its owning control group and that group's statistics are adjusted. The patch adds a new "page age" data structure - an array containing one byte for every page in the system - to track how long each page has been idle, up to 255 scan cycles. The results are boiled down to counters showing how many pages have been idle for 1, 2, 5, 15, 30, 60, 120, and 240 cycles. Idle pages are further broken down into a few categories: clean, dirty and file-backed, and dirty anonymous pages. These counters, which are only updated at the end of each scan, can be found in the memory controller's control directory for each group.
Since the statistics are only updated at the end of each scan, and since the scans are two minutes apart, the resulting numbers are likely to lag reality by some time. Imagine that a given page is scanned toward the beginning of a cycle and seen to be in use; clearly it will not be counted as idle. If it is referenced one last time just after the scan, it will still appear to be in use at the next scan, nearly two minutes later, when the "referenced" bit will be reset. It is only after another two minutes that kstaled will decide that the page is unused - nearly four minutes after its last reference. That is not necessarily a problem; a decision to shrink a group of processes because they are not using all of their memory probably should not be made in haste.
There are times when more current information is useful, though. In particular, Google's management code would like to know when a group of processes suddenly start making heavier use of their memory so that their limits can be expanded before they begin to thrash. To handle this case, the patch introduces the notion of "stale" pages: a page is stale if it is clean and if it has been idle for more than a given (administrator-defined) number of scan cycles. The presence of stale pages indicates that a control group is not under serious memory pressure. If that control group's memory needs suddenly increase, though, the kernel will start reclaiming those stale pages. So a sudden drop in the number of stale pages is a good indication that something has changed.
When kstaled determines that a given page is stale, one of the new page flags (PG_stale) will be used to mark it. Tests have been sprinkled throughout the memory management code to notice when a stale page is dirtied, referenced, locked, or reclaimed; when that happens, the owning control group's count of stale pages will be decremented on the spot. Stale pages are not detected any more quickly than idle pages, but a reduction in the number of stale pages can be noticed immediately. That provides an early-warning system that can flag control groups whose memory use is on the increase.
The patch has been through a couple of iterations; there have been comments pointing out things to fix but no fundamental opposition to the idea. That said, memory management patches are not known for their speed getting into the mainline; if and when we'll see this feature in mainline kernels remains to be seen.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds