Kernel development

Brief items

Kernel release status

The current development kernel is 4.8-rc8, released on September 25. Linus said: "Things actually did start to calm down this week, but I didn't get the feeling that there was no point in doing one final rc, so here we are. I expect the final 4.8 release next weekend, unless something really unexpected comes up."

The September 25 4.8 regression list has 15 entries.

Stable updates: 4.7.5 and 4.4.22 were released on September 24. The 4.7.6 and 4.4.23 updates are in the review process as of this writing; they can be expected on or after September 30.

Comments (none posted)

Kernel development news

A look at the 4.8 development cycle

By Jonathan Corbet
September 28, 2016

As of this writing, the 4.8 development cycle is nearing its end. Linus has let it be known that a relatively unusual -rc8 release candidate will be required before the final release, but that still means that the cycle will only require 70 days, fitting into the usual pattern. A look at the development statistics for this release also fits the pattern about now.

With regard to the release cycle, it has become boringly regular in recent years. The 3.8 kernel, released on February 18, 2013, came out on a Sunday, as has every subsequent release with the exception of 3.11, which was released on Monday, September 2, 2013. In these last few years, the only cycle that has taken longer than 70 days was 3.13, which required 77 days. The extra week that time around was forced by Linus's travels, rather than anything inherent in that cycle itself. Since then, every cycle has taken 63 or 70 days, with the sole exception of 3.16, which showed up in 56 (and one could quibble that it was really a 63-day cycle as well — that was the time Linus experimented with opening the merge window before the previous final release had been made).

In this 70-day cycle, we have seen the addition of 13,253 non-merge changesets from 1,578 developers — so far; the numbers will increase slightly before the end. It is thus a busy cycle, though the record for the busiest (3.15, with 13,722 commits) remains unchallenged. Those developers grew the kernel by 350,000 lines this time around. The most active developers in this cycle were:

Most active 4.8 developers

By changesets

Mauro Carvalho Chehab 347 2.6%

Chris Wilson 266 2.0%

Arnd Bergmann 180 1.4%

Daniel Vetter 144 1.1%

Geert Uytterhoeven 139 1.0%

Wei Yongjun 129 1.0%

Hans Verkuil 121 0.9%

Arnaldo Carvalho de Melo 117 0.9%

James Hogan 107 0.8%

Paul Gortmaker 100 0.8%

Trond Myklebust 98 0.7%

David Hildenbrand 92 0.7%

Christoph Hellwig 90 0.7%

Krzysztof Kozlowski 88 0.7%

Ville Syrjälä 86 0.6%

Daniel Lezcano 82 0.6%

Ben Dooks 80 0.6%

Linus Walleij 76 0.6%

Wolfram Sang 75 0.6%

Christian König 75 0.6%

By changed lines

Mauro Carvalho Chehab 110741 13.2%

Markus Heiser 77196 9.2%

Hans Verkuil 17868 2.1%

Wolfram Sang 15211 1.8%

Moni Shoua 13039 1.6%

Christoph Hellwig 12535 1.5%

Yuval Mintz 12467 1.5%

Jani Nikula 12397 1.5%

Chris Wilson 11003 1.3%

Darrick J. Wong 7453 0.9%

Arnaldo Carvalho de Melo 7204 0.9%

Marc Zyngier 6514 0.8%

Daniel Vetter 6499 0.8%

Megha Dey 5844 0.7%

Florian Fainelli 5697 0.7%

Krzysztof Kozlowski 5600 0.7%

Gavin Shan 5343 0.6%

Bryant G. Ly 5019 0.6%

Arnd Bergmann 4914 0.6%

Adrian Hunter 4906 0.6%

Mauro Carvalho Chehab, the maintainer for the media subsystem, is traditionally a highly active developer. To understand his position at the top of both columns this time around, one need only to look back to the 4.8-rc1 announcement, where Linus said:

The merge window has been fairly normal, although the patch itself looks somewhat unusual: over 20% of the patch is documentation updates, due to conversion of the drm and media documentation from docbook to the Sphinx doc format.

Many of those documentation updates, part of the transition in the kernel's formatted documentation subsystem, came from Mauro, who jumped on the task of converting the (considerable) media documentation with gusto. Other developers at the top of the "by changesets" column include Chris Wilson, whose work was focused on the Intel i915 driver; Arnd Bergmann who, when he's not maintaining the arm-soc subsystem, stays busy eliminating warnings from the kernel build; Daniel Vetter, an active DRM developer, and Geert Uytterhoeven, who did a lot of system-on-chip support work.

In the "changed lines" column, Markus Heiser worked on the media document conversion — and contributed a fair amount of code to make the new documentation system work. Hans Verkuil did a lot of media driver work (including removing some unused drivers), Wolfram Sang spent time on on the ks7010 driver in the staging tree (along with maintaining the I2C subsystem), and Moni Shoua contributed a single patch adding the "RDMA over converged Ethernet" driver to the InfiniBand subsystem.

Normally, work in the staging tree figures prominently in these statistics, but it is almost absent this time around. Indeed, only 386 patches have been applied to the staging tree in the 4.8 cycle, far less than the 916 seen in 4.7, or the 1,852 in 4.6. One might be tempted to think that the staging tree is slowing down, but that seems likely to be a temporary state of affairs. Indeed, it appears that the 4.9 development cycle will see over 2,300 staging commits for the addition of the greybus subsystem alone.

Work on the 4.8 kernel was supported by 217 employers that we were able to identify. The most active employers this time around were:

Most active 4.8 employers

By changesets

Intel 1960 14.8%

Red Hat 1143 8.6%

(Unknown) 806 6.1%

(None) 746 5.6%

Linaro 662 5.0%

IBM 654 4.9%

Samsung 637 4.8%

SUSE 338 2.6%

Google 294 2.2%

AMD 281 2.1%

Oracle 259 2.0%

Texas Instruments 258 1.9%

Mellanox 243 1.8%

Renesas Electronics 223 1.7%

Broadcom 217 1.6%

ARM 204 1.5%

Huawei Technologies 170 1.3%

NVidia 166 1.3%

NXP Semiconductors 163 1.2%

(Consultant) 157 1.2%

By lines changed

Samsung 120693 14.4%

Intel 104291 12.4%

(None) 102848 12.3%

Red Hat 48563 5.8%

IBM 42298 5.0%

Mellanox 29226 3.5%

(Unknown) 27671 3.3%

Linaro 22960 2.7%

Broadcom 18040 2.2%

Cisco 17868 2.1%

MediaTek 16292 1.9%

QLogic 15986 1.9%

ARM 14397 1.7%

Renesas 14283 1.7%

(Consultant) 14146 1.7%

Free Electrons 11227 1.3%

Oracle 10982 1.3%

Texas Instruments 9789 1.2%

Google 9534 1.1%

Renesas Electronics 9482 1.1%

The documentation work has shifted the numbers around here a bit but, for the most part, this table is as boring and unsurprising as usual. Samsung's position at the top of the "lines changed" column is, once again, the result of the formatted documentation transition.

In summary, this would appear to be another relatively normal busy development cycle. The kernel development machine appears to continue to hum along smoothly, with no serious process problems evident at this level though, as the recent discussion on backporting showed, there are issues elsewhere in the community. Both the 4.8 kernel and the community that produce it appear to be working well.

Comments (4 posted)

A low-level hibernation bug hunt

September 28, 2016

This article was contributed by Rafael J. Wysocki

This is a story about how several obscure and nasty hibernation bugs were fixed over the last few months and how hibernation on x86-64 was made to work correctly with kernel address space layout randomization (KASLR) at the same time. It is a success story, but it did not look like that in the beginning. That success would not have been possible without a series of bug reports that happened to appear just in the right order, one after another. Fortunately enough, in each case the bug in question was reliably reproducible on at least one system, which allowed it to be narrowed down to a particular kernel change or a specific piece of code. It also would not have been possible without the persistence and determination of the bug reporters and developers involved.

For me, it started with a problem report from Logan Gunthorpe forwarded to the Linux power-management development list by Ingo Molnar. In that report, Gunthorpe said that hibernation broke for him after a security-related change that had made the kernel set the "no execute" (NX) flag on memory pages in the gap between the kernel code and the read-only data section following it.

My initial idea about why that change might cause hibernation to fail was related to how resume from hibernation worked on x86-64, so let me explain that briefly to begin with.

Hibernation on x86-64

Hibernation is generally regarded as a power-management feature, but it really is a checkpoint/restore mechanism working on the system as a whole. When triggered, it creates a snapshot of all memory pages in use at that time and saves it in persistent storage. Of course, the snapshot of each page has to be saved along with the number of the page frame occupied by it, so that it can be put into the same page frame later on. All of that information combined is referred to as a "hibernation image".

Next, the system is turned off (that can be done in a few different ways which are not relevant here). When turned on again later, it undergoes full initialization, starting with the platform firmware, which invokes the bootloader that, in turn, loads a new kernel (that is what happens in Linux; the resume control flow in other operating systems may be different). That new kernel is then responsible for loading the hibernation image created earlier back into memory and for restoring its previous state, so it will be referred to as the "restore kernel" in what follows. In turn, the kernel that created the hibernation image and, therefore, is included in it will be referred to as the "image kernel".

Of course, the restore kernel is always different from the image kernel, but it may come from the same kernel binary, in which case the kernel code is the same in both of them. That is not a requirement on x86-64, though. Moreover, even if the kernel code (often referred to as the "kernel text") is the same, the layout of code and data in memory created by the restore kernel may be different from what the image kernel had used. For instance, if kernel address space layout randomization is in use, the physical location of the kernel code in the restore and image kernels usually will be different. Moreover, in Linux 4.8-rc1 (and later) KASLR will cause the virtual base address of the kernel identity mapping (the one that maps the entire physical address space of the system into the kernel's virtual address space) to be different in each of them as a rule.

When the restore kernel runs, it will first initialize itself and the hardware; then it will look for a hibernation image header. If it finds one, it reads image description data from there and, if all looks good, it will start to load the image.

The goal here is to put each memory page included in the image into the page frame it occupied before hibernation and pass control to the image kernel, which can take over from that point on (as the memory will then look the same as before hibernation to it). That is not as straightforward as it sounds, however, because at least some of the page frames in question will be occupied by the restore kernel itself or its data. To overcome that difficulty, the restore kernel takes several steps that each get it closer to its goal.

First of all, it allocates enough memory to hold all of the data pages and metadata (basically consisting of the page frame numbers to put those data pages into eventually) from the image. It uses two bitmaps to track the memory allocated in this step, to keep a record of (1) which page frames have been allocated and (2) which of them were in use before hibernation. The allocated ones that were not used before hibernation (i.e. their numbers are not included in the image metadata) are referred to as "safe", because they won't be overwritten with data coming from the image going forward.

Second, all of the image data pages are loaded into the allocated memory. The trick here is to store as many data pages from the image as possible in the page frames they occupied before hibernation; the bitmaps mentioned above are used for that. Namely, before loading a data page from the image, the page frame it occupied before hibernation is looked up in the bitmaps and, if it is present there (i.e. it was allocated in the previous step), the data page is loaded into it directly without the need to remember where it has been stored. If the page frame occupied by that data page before hibernation was not allocated in the previous step, the data page has to be stored in a safe page frame whose number has to be recorded along with the "target" location of the data page stored in it.

The next step is to quiesce all devices and all CPUs except for one and, having done that, the restore kernel prepares to copy all of the image data pages stored in "safe" page frames previously to their "target" locations. That has to be done in an architecture-specific way and it has to take into account the fact that the restore kernel itself and its data will be overwritten in the process, so the following step will not be reversible.

On x86-64, the restore kernel creates temporary page tables consisting of safe pages only, so that they will not get overwritten with image data. These page tables only need to cover two mappings: the identity mapping necessary for the image data pages copying operation itself and the kernel text mapping allowing the restore kernel to pass control back to the image kernel. This transfer of control is done by jumping to an address representing the image kernel's entry point (that can be read from the image header). In addition, the code that will copy the image data pages and perform the final jump to the image kernel's entry point has to be relocated to a safe page in order to prevent it from overwriting itself inadvertently; the page it has been relocated to must be marked as executable. With all that in place, the restore kernel only needs to jump to the relocated code that will switch over to the temporary page tables, copy the image data pages still held in "safe" page frames to their "target" locations, and jump to the image kernel's entry point.

Where things went wrong

That should sound reasonable enough — but it is what the restore kernel does today. At the time of the Gunthorpe's bug report, however, the code in question was somewhat less straightforward.

Namely, it also created temporary page tables but, while the identity mapping covered by those tables was set up from scratch, the restore kernel's own text mapping was reused by hooking it up directly into the topmost page directory of the new page tables. That allowed the restore kernel to switch over to the temporary page tables before jumping to the relocated code, but it also imposed serious limitations on the final jump to the image kernel's entry point such that it would only work in quite specific conditions. As it turned out, those conditions were not guaranteed to be met in general; that was the source of the problem seen by Gunthorpe.

My first idea about what might have gone wrong was that, perhaps, the security change identified by Gunthorpe as the one that introduced the problem caused the page containing the image kernel's entry point to become non-executable in the restore kernel's text mapping. With that in mind I prepared a patch that would mark that page as executable at the right time and asked Gunthorpe to test it, but it did not make any difference.

That caused me to look at the addresses involved more closely; I quickly realized that reusing the restore kernel's text mapping in the temporary page tables was a mistake, because that mapping might very well be corrupted in the process of copying image data pages to their target locations. If that happened, the final jump to the image kernel's entry point would go to nowhere, triggering a page fault that couldn't be handled at that point. Clearly, the temporary page tables needed a kernel text mapping set up from scratch consisting of only safe pages, just like the identity mapping. I noticed, though, that it didn't have to cover the entire kernel text. In fact, it didn't have to cover the kernel text at all. It only had to cover the image kernel's entry point itself.

That was the case because the code performing the final jump to the image kernel's entry point would be relocated and it would be running from a page covered by the identity mapping, so it didn't need the kernel text mapping to run. Moreover, the virtual address of the image kernel's entry point passed in the image header had to be mapped to the physical address of its location in memory, but that might not match the restore kernel's text mapping. Hence, the kernel text mapping used for the final jump to the image kernel's entry point had to be based on the information provided by the image kernel. For that reason, I changed the image header format to include the physical address of the image kernel's entry point too.

It didn't take me too much time to come up with a patch implementing that idea. With that patch, however, the restore kernel would still switch over to the temporary page tables before jumping to the relocated code, so its text mapping still had to be reused to start with. It would be replaced with a new minimum kernel text mapping that covered the image kernel's entry point just prior to the final jump to it.

The plot thickens

That patch fixed the resume problem for Gunthorpe, but it wasn't perfect. Namely, Borislav Petkov reported that it introduced a strange memory corruption during resume from hibernation for him. That new problem occurred on every resume from hibernation on his system and manifested itself as a corruption of the context of a user-space process that attempted to run after the image kernel had brought all CPUs back online and had completed the resume of I/O devices.

That was really unusual, so we spent quite a lot of time on trying to understand why and how it might happen. Linus Torvalds suspected that the problem might be related to the way the patch played with the kernel-text mapping and he clearly didn't like that part of it anyway, so I decided to change the code flow to first jump to the relocated code and then switch over to the temporary page tables from there. That still allowed the kernel-text mapping in the temporary page tables to be minimal, but it avoided the need to replace one version of the kernel-text mapping with another one on the fly which, admittedly, had been an ugly hack.

I posted a patch created along these lines and, again, it worked for Gunthorpe, but it still triggered memory corruption during resume from hibernation for Petkov, so we went into a long debug session trying to figure out what was going on. Theories taken into consideration included platform firmware involvement, a hardware issue, or a bitmap implementation error in the hibernation core, but there were substantial weaknesses in every one of them.

Eventually, we were able to narrow the breakage down to a single line of code in a new function added by my patch, but it was completely unclear why that particular line of code would lead to the observed symptoms. Since that line of code looked like it might be using a local variable on the stack, I decided to check whether changing the new function to use fewer local variables would make any difference (the theory was that the stack might have been corrupted somehow, although how exactly that could have happened was still a mystery). Surprisingly enough, that change appeared to fix the problem for Petkov (in fact, it only hid the problem, but that was found to be the case quite a bit later). It did that so effectively that the memory corruption went away and could not be reproduced on Petkov's machine any more.

In the meantime, Yu Chen analyzed Gunthorpe's original report in detail and explained why the security-related kernel commit identified as the one that introduced the problem could actually make a difference. According to Chen, the setting of the NX flag on the gap between the kernel text and the read-only data was not as straightforward as it looked because it might cause kernel page tables to be split. Specifically, if the end of the kernel text fell into a large (2M) page, that page had to be split into normal (4K) pages for the NX bit to be set on the gap only. That required more page-table memory to be allocated dynamically; that allocation happened within the kernel-text mapping that would be overwritten by image data during resume from hibernation, so reusing it in the restore kernel's temporary page tables would lead to an unrecoverable error.

In addition to that, Kees Cook reported that the fix for the issue reported by Gunthorpe also made hibernation work with KASLR on x86-64. At that time, KASLR worked on the kernel's text mapping only and randomized its physical base. As a result, the physical address of the base of the kernel text mapping used by the restore kernel would be different from what the image kernel had used most of the time. That prevented the restore kernel from mapping the virtual address of the image kernel's entry point (passed in the image header) to the correct physical address and resume from hibernation didn't work. That changed with the introduction of the minimal kernel-text mapping used for the final jump to the image kernel's entry point in my patch, because it mapped virtual addresses to physical addresses in the same way as the image kernel did.

In the face of this, and because the memory corruption seen by Petkov was apparently not reproducible with the last version of the resume fix (and I was quite confident that it could not be introduced by that fix itself anyway), I decided to go ahead with the fix and it finally landed in Linux 4.7 as kernel commit 65c0554b73c9. While the immediate problem was fixed, it was quite possible that the previous versions of the resume fix simply uncovered some obscure latent bug, so I made a few changes in the hibernation core to make it easier to debug in case the memory corruption problem or anything similar to it showed up again in the future. When I did that, though, I wasn't expecting the memory corruption issue to reappear a few days later in a report pointing to the kernel commit that was the true source of it. But, first, another problem had to be solved.

`MWAIT` vs. `HLT`

Meanwhile, my attention had been caught by another serious bug related to resume from hibernation on x86-64, but limited to Intel CPUs. At that point it had already been investigated for several weeks by Chen who had posted a couple of RFC patches to address it, but the reviewers looking at them pointed out some valid concerns to him.

That issue was related to the use of the MONITOR and MWAIT instructions of the CPU in the code that takes CPUs offline, in particular during resume from hibernation. CPU offlining is a complicated matter that involves migrating tasks and interrupts from the CPU going offline to ensure that it won't have anything to do from that point on. The last stage of the process is to make the CPU appear as though it is not functional from a software perspective. That is achieved by making it execute a "wait for something to happen" instruction in a tight endless loop with locally disabled interrupts.

There are two flavors of such "wait for something to happen" instructions in the Intel processors' instruction set. The first one is the old-school HLT instruction that causes the CPU to go into a relatively shallow low-power state and wait for an interrupt; if interrupts are locally disabled on the CPU, it will become almost completely unresponsive after executing that instruction (the only interrupts that can "revive" the CPU then are the non-maskable ones, but those are only used in very special situations). The second type of a "wait for something to happen" instruction is MWAIT, which goes together with MONITOR.

First, MONITOR takes an address identifying a range of memory that corresponds to a single line in the CPU's cache. Next, the MWAIT following it causes the CPU to enter a low-power state (and that state may be much deeper than the HLT-induced one) and wait for an event like an interrupt or a write to one of the MONITORed memory locations from another CPU in the system. Thus, from an energy consumption perspective, the MONITOR/MWAIT combination is much better than HLT, but that really wasn't important in the resume from hibernation case since CPUs stay offline for a very short time then. The important fact was that, during resume from hibernation, the memory locations MONITORed by the offline CPUs were almost guaranteed to be written to by the only online CPU that carries out the final resume stages described earlier.

Recall that, during those stages, the image data pages still held in safe page frames are copied into their target locations, which generally overlap with memory occupied by the restore kernel itself and by its data. In particular, with CPUs offline using MONITOR/MWAIT, they might (and usually did) overlap with the memory MONITORed by those offline CPUs. That was a recipe for disaster; because the page tables used by those CPUs might have been overwritten too at that point, an attempt to fetch the next instruction by any of them would lead to a page fault that could not be handled, so the kernel would panic and crash. Worse yet, the code those CPUs would be executing if woken up from the MWAIT-induced state inadvertently might have been overwritten at that point too.

The problem was figured out and a rough consensus about how to fix it had formed during the review of Chen's patches: everyone involved seemed to agree that, during resume from hibernation, the CPU offline code should use the HLT instruction instead of MONITOR/MWAIT. The question was how to implement that idea in the cleanest way possible.

Chen had already posted a couple of patches going in that direction when I started to look at the details of the code in question, but none of those approaches had been particularly attractive. My first attempts at fixing this issue were not any better, until I realized that the function to execute at the last stage of CPU offline was a callback pointed to by the play_dead field in the smp_ops structure, so replacing that callback temporarily with a special one using HLT during resume from hibernation would do the trick. The change needed for that was relatively isolated and, most importantly, it didn't add any overhead to the CPU offline code, so it was approved by Molnar and the final patch making the change shipped in Linux 4.8-rc1 as kernel commit 406f992e4a37.

The mystery bug returns

At that point, I was thinking that the worst problems related to resume from hibernation on x86-64 were fixed, but I forgot about the mystery memory corruption issue previously reported by Petkov. To my surprise, just then it was reported again by Andre Reinke. For Reinke, however, it was a regression introduced in Linux 4.6 and he was able to identify kernel commit ef0f3ed5a4ac as the source of it.

In retrospect, it was quite obvious that resume from hibernation would be broken by that commit, because it added a FRAME_BEGIN macro to the assembly code that would run as the first thing after the restore kernel had jumped to the image kernel's entry point. Among other things, that macro generated a PUSH instruction that would be executed before writing the address of the original image kernel's page tables into the CR3 register of the CPU. Thus the CPU would still be using the temporary page tables created by the restore kernel when executing it and the value of its stack pointer would contain the address of a memory area that might contain image data now. In that case, the PUSH instruction would corrupt those image data pages by overwriting them with a stale value read from another CPU register.

Ironically enough, the FRAME_BEGIN macro was there all the time when the memory corruption reported by Petkov was being investigated and nobody saw the problem with it then. It looks like everyone, myself included, was mentally blinded by the fact that it was a macro and no one could see the real sequence of CPU instructions it was resolving to. Had the PUSH instruction been located directly in that code, the issue probably would have been resolved earlier without a need for a pointer to the kernel commit that introduced it. That pointer did help a lot, though, because it made everyone look at the right places in the code and the bug was readily fixed by Josh Poimboeuf. His fix went into Linux 4.8-rc1 as kernel commit 4ce827b4cc58.

That would have ended the x86-64 hibernation saga, had KASLR not been extended during the Linux 4.8-rc1 merge window. That did happen, however, and it affected Petkov again, breaking resume from hibernation for him on another machine. He noticed that unsetting the new CONFIG_RANDOMIZE_MEMORY kernel configuration option (set by default) made hibernation work again on that system, so the investigation of the problem focused on the interactions between hibernation and the new KASLR-related changes.

After those changes, KASLR on x86-64 randomizes not only the (physical) base address of the kernel text mapping, but also the (virtual) base address of the kernel identity mapping, among other things. That obviously might not play well with resume from hibernation which, in principle, might not be prepared to deal with differences in kernel identity mapping base address between the restore and image kernels. Indeed, that turned out to be the case; two problems in that area were quickly found by KASLR developer Thomas Garnier, who posted prototype patches to fix them.

First, the assembly code carrying out the switch over to temporary page tables during resume from hibernation contained a direct reference to the __PAGE_OFFSET symbol, used with the assumption that it would always resolve to a number. However, with CONFIG_RANDOMIZE_MEMORY set that symbol resolves to a variable name and the code generated in that case was invalid. Clearly, it was necessary to avoid using __PAGE_OFFSET this way, but Garnier's prototype patch did that with the help of preprocessor directives, which wasn't particularly clean. There was a better way: pass the physical rather than the virtual address of the page tables to the assembly code. That physical address might be computed by the code written in C and passed to the assembly in the same variable that previously had been used to pass the virtual address of the temporary page tables. With that, the problematic reference to __PAGE_OFFSET from assembly would simply go away, so I posted a patch making that change which landed in Linux 4.8-rc1 as kernel commit c226fab47429.

Second, the kernel_ident_mapping_init() function called by the low-level code that creates temporary page tables during resume from hibernation made an assumption regarding the alignment of the base address of the kernel identity mapping that generally wasn't satisfied with CONFIG_RANDOMIZE_MEMORY set. That was easy enough to fix, but Garnier's prototype patch overlooked a corner case that was pointed out by Yinghai Lu, who posted his own version of that fix. Lu's patch worked, but it increased the complexity of the code in question which wasn't strictly necessary, so I prepared and posted yet another version of it that was approved by everyone involved and went into Linux 4.8-rc2 as kernel commit e4630fdd4763.

Still, those two fixes turned out to be insufficient to make the issue reported by Petkov go away. Moreover, the same issue was reported by Jiri Kosina in the meantime (the symptom seemed to be a triple fault during resume meaning, probably, an unhandled page fault). It was puzzling because it was reproducible on the affected systems 100% of the time, while other, similar, systems hibernated and resumed without any problems at all.

Fortunately, I had a test system that was similar to Petkov's failing one, so I was able to use his configuration file to generate a kernel for it. That allowed me to reproduce the problem locally and to verify that it was triggered by setting the CONFIG_DEBUG_LOCK_ALLOC configuration option. It still was not particularly clear why and how that option might lead to the observed failure, but Garnier was also able to reproduce it, and he found the reason why it appeared. That turned out to be a bug in the hibernation core introduced during the Linux 3.16 development cycle that caused a tracing function to be called before the processor state had been restored completely. As a result, a stale value of the GS register was used by that tracing function; that led to the observed triple fault, which Garnier was able to fix by simply changing the ordering of the code in question. That fix went into Linux 4.8-rc2 as kernel commit 62822e2ec4ad.

Working, at last

That finally made hibernation work for Petkov and Kosina again, even with both CONFIG_RANDOMIZE_MEMORY and CONFIG_DEBUG_LOCK_ALLOC set; only one thing remained unknown: why would CONFIG_DEBUG_LOCK_ALLOC make a difference before? That was explained by Kosina, who looked at the assembly output generated by the compiler for the affected code both with and without CONFIG_DEBUG_LOCK_ALLOC set and found that it was different in those two cases. Next, he was able to track the difference down to the definition of the __DECLARE_TRACE() macro, which generated additional code with CONFIG_DEBUG_LOCK_ALLOC set; that additional code used GS-relative addressing, which would lead to the observed failure if the GS value was stale.

In the end, in Linux 4.8-rc3 (and later) resume from hibernation on x86-64 works at last and it works with KASLR enabled. It took a couple of months to get to this point due to the nature of the bugs that needed to be fixed and due to the complexity of the affected code. As said in the beginning, that wouldn't have been possible without all of the developers and bug reporters involved and in particular I'd like to thank the following contributors for their input that shaped the final code changes: Logan Gunthorpe, Ingo Molnar, Borislav Petkov, Linus Torvalds, Chen Yu, Kees Cook, Andre Reinke, Josh Poimboeuf, Thomas Garnier, Yinghai Lu, and Jiri Kosina.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.8-rc8 Sep 25

Greg KH Linux 4.7.5 Sep 24

Con Kolivas linux-4.7-ck5 Sep 23

Greg KH Linux 4.4.22 Sep 24

Architecture-specific

vijay.kilari@gmail.com [PATCH v7 0/7] arm/arm64: vgic: Implement API for vGICv3 live migration Sep 23

Pratyush Anand ARM64: Uprobe support added Sep 27

Core kernel code

Con Kolivas BFS CPU scheduler v0.502 for linux-4.7 with skip list. Sep 23

Device drivers

Claudiu Manoil Freescale DPAA 1.x QBMan Drivers Sep 22

Markus Mayer Broadcom AVS CPUfreq driver Sep 23

Rich Felker J-Core timer support Sep 24

Erin Lo Add clock support for Mediatek MT2701 Sep 26

Stefan Wahren net: qualcomm: add QCA7000 UART driver Sep 26

Amir Levy thunderbolt: Introducing Thunderbolt(TM) Networking Sep 27

Adit Ranadive Add Paravirtual RDMA Driver Sep 24

Ram Amrani QLogic RDMA Driver (qedr) RFC Sep 26

Device driver infrastructure

Damien Le Moal ZBC / Zoned block device support Sep 26

Filesystems and block I/O

Bart Van Assche Introduce blk_quiesce_queue() and blk_resume_queue() Sep 26

Memory management

zi.yan@sent.com mm: THP migration support Sep 26

Vlastimil Babka followups to reintroduce compaction feedback for OOM decisions Sep 26

Ross Zwisler re-enable DAX PMD support Sep 27

Networking

David Howells rxrpc: Preparation for slow-start algorithm Sep 22

Toshiaki Makita Support envelope frames (802.3as) Sep 27

Ursula Braun net/smc: Shared Memory Communications - RDMA Sep 27

Security-related

Mat Martineau Make keyring link restrictions accessible from userspace Sep 26

Miscellaneous

Jiri Olsa perf c2c: Add new tool to analyze cacheline contention on NUMA systems Sep 22

Wang Nan perf clang: Support compiling BPF script use builtin clang Sep 23

Page editor: Jonathan Corbet
Next page: Distributions>>