An unexpected perf feature

By Jake Edge
May 21, 2013

Local privilege escalations seem to be regularly found in the Linux kernel these days, but they usually aren't quite so old—more than two years since the release of 2.6.37—or backported into even earlier kernels. But CVE-2013-2094 is just that kind of bug, with a now-public exploit that apparently dates back to 2010. It (ab)uses the perf_event_open() system call, and the bug was backported to the 2.6.32 kernel used by Red Hat Enterprise Linux (and its clones: CentOS, Oracle, and Scientific Linux). While local privilege escalations are generally considered less worrisome on systems without untrusted users, it is easy to forget that UIDs used by network-exposed services should also qualify as untrusted—compromising a service, then using a local privilege escalation, leads directly to root.

The bug was found by Tommi Rantala when running the Trinity fuzz tester and was fixed in mid-April. At that time, it was not recognized as a security problem; the release of an exploit in mid-May certainly changed that. The exploit is dated 2010 and contains some possibly "not safe for work" strings. Its author expressed surprise that it wasn't seen as a security problem when it was fixed. That alone is an indication (if one was needed) that people in various colored hats are scrutinizing kernel commits—often in ways that the kernel developers are not.

The bug itself was introduced in 2010, and made its first appearance in the 2.6.37 kernel in January 2011. It treated the 64-bit perf event ID differently in an initialization routine (perf_swevent_init() where the ID was sanity checked) and in the cleanup routine (sw_perf_event_destroy()). In the former, it was treated as a signed 32-bit integer, while in the latter as an unsigned 64-bit integer. The difference may not seem hugely significant, but, as it turns out, it can be used to effect a full compromise of the system by privilege escalation to root.

The key piece of the puzzle is that the event ID is used as an array index in the kernel. It is a value that is controlled by user space, as it is passed in via the struct perf_event_attr argument to perf_event_open(). Because it is sanity checked as an int, the upper 32 bits of event_id can be anything the attacker wants, so long as the lower 32 bits are considered valid. Because event_id is used as a signed value, the test:

    if (event_id >= PERF_COUNT_SW_MAX)
            return -ENOENT;

doesn't exclude negative IDs, so anything with bit 31 set (i.e. 0x80000000) will be considered valid.

The exploit code itself is rather terse, obfuscated, and hard to follow, but Brad Spengler has provided a detailed description of the exploit on Reddit. Essentially, it uses a negative value for the event ID to cause the kernel to change user-space memory. The exploit uses mmap() to map an area of user-space memory that will be targeted when the negative event ID is passed. It sets the mapped area to zeroes, then calls perf_event_open(), immediately followed by a close() on the returned file descriptor. That triggers:

    static_key_slow_dec(&perf_swevent_enabled[event_id]);

in the sw_perf_event_destroy() function. The code then looks for non-zero values in the mapped area, which can be used (along with the event ID value and the size of the array elements) to calculate the base address of the perf_swevent_enabled array.

But that value is just a steppingstone toward the real goal. The exploit gets the base address of the interrupt descriptor table (IDT) by using the sidt assembly language instruction. From that, it targets the overflow interrupt vector (0x4), using the increment in perf_swevent_init():

    static_key_slow_inc(&perf_swevent_enabled[event_id]);

By setting event_id appropriately, it can turn the address of the overflow interrupt handler into a user-space address.

The exploit arranges to mmap() the range of memory where the clobbered interrupt handler will point and fills it with a NOP sled followed by shellcode that accomplishes its real task: finding the UID/GIDs and capabilities in the credentials of the current process so that it can modify them to be UID and GID 0 with full capabilities. At that point, in what almost feels like an afterthought, it spawns a shell—a root shell.

Depending on a number of architecture- or kernel-build-specific features (not least x86 assembly) makes the exploit itself rather fragile. It also contains bugs, according to Spengler. It doesn't work on 32-bit x86 systems because it uses a hard-coded system call number (298) passed to syscall(), which is different (336) for 32-bit x86 kernels. It also won't work on Ubuntu systems because the size of the perf_swevent_enabled array elements is different. The following will thwart the existing exploit:

    echo 2 > /proc/sys/kernel/perf_event_paranoid

But a minor change to the flags passed to perf_event_open() will still allow the privilege escalation. None of these is a real defense of any sort against the vulnerability, though they do defend against this specific exploit. Spengler's analysis has more details, both of the existing exploit as well as ways to change it to work around its fragility.

The code uses syscall(), presumably because perf_event_open() is not (yet?) available in the GNU C library, but it could also be done to evade any argument checks done in the library. Any sanity checking done by the library must also be done in the kernel, because using syscall() can avoid the usual system call path. Kernels configured without support for perf events (i.e. CONFIG_PERF_EVENTS not set) are unaffected by the bug as they lack the system call entirely.

There are several kernel hardening techniques that would help to avoid this kind of bug leading to system compromise. The grsecurity UDEREF mechanism would prevent the kernel from dereferencing the user-space addresses so that the perf_swevent_enabled base address could not be calculated. The PaX/grsecurity KERNEXEC technique would prevent the user-space shellcode from executing. While these techniques can inhibit this kind of bug from allowing privilege escalation, they impose costs (e.g. performance) that have made them unattractive to the mainline developers. Suitably configured kernels on hardware that supports it would be protected by supervisor mode access prevention (SMAP) and supervisor mode execution protection (SMEP), the former would prevent access to the user-space addresses much like UDEREF, while the latter would prevent execution of user-space code as does KERNEXEC.

This is a fairly nasty hole in the kernel, in part because it has existed for so long (and apparently been known by some, at least, for most of that time). Local privilege escalations tend to be somewhat downplayed because they require an untrusted local user, but web applications (in particular) can often provide just such a user. Dave Jones's Trinity has clearly shown its worth over the last few years, though he was not terribly pleased how long it took for fuzzing to find this bug.

Jones suspects there may be "more fruit on that branch somewhere", so more and better fuzzing of the perf system calls (and kernel as a whole) is indicated. In addition, the exploit author at least suggests that he has more exploits waiting in the wings (not necessarily in the perf subsystem), it is quite likely that others do as well. Finding and fixing these security holes is an important task; auditing the commit stream to help ensure that these kinds of problems aren't introduced in the first place would be quite useful. One hopes that companies using Linux find a way to fund more work in this area.

Index entries for this article
Kernel	Security/Vulnerabilities

An unexpected perf feature

Posted May 21, 2013 23:01 UTC (Tue) by PaXTeam (guest, #24616) [Link] (24 responses)

> While these techniques can inhibit this kind of bug from allowing privilege escalation,

sadly, this bug is the textbook example of the 'almost arbitrary write' kind and it *is* exploitable under PaX/grsecurity (well, after the attacker achieved arbitrary code execution in userland), albeit harder (needs a powerful enough kernel infoleak bug).

> they impose costs (e.g. performance) that have made them unattractive to the mainline developers.

i have my doubts that mainline devs have ever cared let alone known about these techniques ;), besides these features are pretty much free on i386, only amd64 sucks for performance.

An unexpected perf feature

Posted May 22, 2013 10:35 UTC (Wed) by vivo (subscriber, #48315) [Link] (23 responses)

> these features are pretty much free on i386, only amd64 sucks for performance.
The other way around right? amd64 is ok i386 suck

An unexpected perf feature

Posted May 22, 2013 10:54 UTC (Wed) by patrick_g (subscriber, #44470) [Link] (22 responses)

> The other way around right? amd64 is ok i386 suck

I don't think so. According to this paper :

The PaX-protected kernel exhibits a latency ranging between 5.6% and 257% (average 84.5%) on the x86, whereas on x86-64, the latency overhead ranges between 19% and 531% (average 172.2%). Additionally, (..) overhead for process creation (in both architectures) lies between 8.1% to 56.3%.

And :

On x86, PaX offers protection against ret2usr attacks by utilizing the segmentation unit for isolating the kernel from user space. In x86-64 CPUs, where segmentation is not supported by the hardware, it temporarily remaps user space into a different location with non-execute permissions. (...) the lack of segmentation in x86-64 results in higher performance penalty.

An unexpected perf feature

Posted May 22, 2013 11:02 UTC (Wed) by dlang (guest, #313) [Link] (21 responses)

now is that because the AMD64 is so bad, or because the stock kernel is tuned to be so much faster on AMD64 and these patches ruin that tuning.

but either way, the attitude that the performance problem doesn't matter because it's not bad on the i386 port, only on the amd64 port is ignoring the fact that amd64 is becoming the common case, and people aren't going to use features that make their highest performing hardware suffer that much.

An unexpected perf feature

Posted May 22, 2013 11:29 UTC (Wed) by PaXTeam (guest, #24616) [Link]

first, note that the kguard paper has very questionable numbers (they contradict my own measurements i've been doing for over a decade now), and when i tried to reproduce them, their gcc plugin didn't even compile so i don't know what exactly they'd done. i've been trying to fix it up ever since but it's a low priority project, so i'll blog about this topic some time later i guess ;).

second, when i said 'these features' i specifically referred to Jake's sentence where he cites two PaX features that would protect against specific exploit techniques utilized by the public exploit (and there're more features that would protect against other exploit techniques but i digress).

third, what attitude are you talking about? i never said anything about ignoring performance, just corrected Jake's statement that sounded like as if these features (remember, still talking about PaX) were universally bad for performance whereas both i386 and arm (recent addition by spender from earlier this year, see http://forums.grsecurity.net/viewtopic.php?f=7&t=3292) have a very efficient implementation. in fact, if there's anyone who most appreciates the slow but steady income of hardware support for long existing PaX features (SMEP/SMAP for KERNEXEC/UDEREF, respectively) then it's me.

An unexpected perf feature

Posted May 22, 2013 11:42 UTC (Wed) by cesarb (subscriber, #6266) [Link] (16 responses)

It is because it depends for performance on an obsolete feature of the 32-bit x86 ISA (segmentation), which was almost completely removed in the cleanup during the design of the x86-64 ISA.

An unexpected perf feature

Posted May 22, 2013 12:09 UTC (Wed) by PaXTeam (guest, #24616) [Link] (15 responses)

i would not call the effective removal of segmentation from amd64 a cleanup, more like the proverbial case of the baby going with the bathwater. sure, conforming code segments and call gates could be called obsolete, but the ability to define windows on the virtual address space is very useful (AMD had to add some of it back temporarily for VMware before hw virtualization caught on) and is a real shame that it got almost completely removed. it's a design mistake, not something to be proud of.

An unexpected perf feature

Posted May 22, 2013 16:43 UTC (Wed) by nix (subscriber, #2304) [Link]

Quite. It didn't even make the silicon appreciably simpler, because the CPU still has to drag it all around for 32-bit code, even while the CPU is in long mode (unlike vm86 which it can skip entirely and doesn't need to make particularly efficient in any case). The most they could do was drop optimizations for non-maximally-sized segments, and they did *that* before x86-64 was even thought of. It frees up some opcodes that they could reuse, is all.

An unexpected perf feature

Posted May 22, 2013 19:37 UTC (Wed) by helge.bahmann (subscriber, #56804) [Link] (13 responses)

i386-style segmentation requires an add of the segment base for every address generation (or using a virtual cache which is a can of worms on its own), and while this is already not funny to do it on 32 bit without wrecking access latency (it is just barely manageable because it is possible to restrict the fast path to upper 20 bit adds), it becomes just very expensive going to 48 bit addresses (not even thinking of what would happen going to full 64 bit addresses). And yes, scrapping this add even on 32 bit is a saving significant enough to the point that current-gen CPUs short-circuit base=0 and pay a penalty otherwise.

Segmentation anyways is a poor sustitute for what you *really* want: a full "world separation" between kernel and user, and this has since ages been possible e.g. on Sparc without any segmentation and much more efficiently using address space identifiers. And while amd64 took away segmentation, it also brought with virtualisation the ASIDs (admittedly, both are unfortunately annoyingly closely tied together).

It might make more sense to look forward and maybe figure out if the new facilities can be used to do the isolation properly (and that may perhaps include talking to chip makers), rather than looking backward and mourning the loss of an oddball capability that enabled an incomplete hack.

An unexpected perf feature

Posted May 22, 2013 20:02 UTC (Wed) by dlang (guest, #313) [Link]

Also, x96 is not the entire world, it never was on the high-end, and with ARM and MIPS appliances, it's increasingly less so on the low-end. Right now x86 is the middle ground, but it's getting squeezed from the bottom the same way that amd64 is squeezing out the former high-end

An unexpected perf feature

Posted May 22, 2013 20:09 UTC (Wed) by PaXTeam (guest, #24616) [Link] (11 responses)

first, the cost of non-0 bases is independent of width, the CPU already has full 64 bit adders just for this purpose (think support for 'lea' and fs/gs overrides). however for my purposes what is important is the ability to define segment limits and flip the meaning of that limit (lower or upper, for expand-down segments), so non-0 bases are pretty much irrelevant.

second, segment limits are pretty much the best way to implement world separation, they require the least amount of hw resources: one parallelizable compare on the virtual address (i.e., it can be done before or in parallel to the virtual->physical translation) vs. tons of paging related caching while resolving the physical address. so no, i maintain that ditching all of segmentation was a design mistake and sparc or VT-x style ASIDs are not equivalent replacements (just ask Bromium for their performance numbers ;).

third, while SMEP/SMAP are useful performance enchancements for amd64, they require quite a bit more kernel infrastructure to make them robust. in particular, since they rely on paging, *all* of the paging related structures must be properly protected against memory corruption bugs which is quite a bit larger attack surface than the GDTs (and no kernel i know does this work except for PaX). so while i 'mourn the loss of segmentation' (which is not as oddball as you think, and it's never been an incomplete hack) i've been doing the extra work for a decade now to make paging based replacements actually secure as well ;).

An unexpected perf feature

Posted May 22, 2013 21:06 UTC (Wed) by helge.bahmann (subscriber, #56804) [Link] (10 responses)

First, AFAICT fs/gs overrides (as well as lea) go through ALU and pay the access penalty exactly because there is no dedicated adder anymore, it would be a total waste of resources otherwise as segment overrides are in practice used just for TLS which is not that much of a fast path. (It is actually even more disturbing what happens with a base!=0 cs as it totally messes up the btb).

Second, TLBs tagged with address space identifiers are actually cheaper and more generic than segment limits: You can save the comparators (even though they don't add latency because they can operate in parallel), and address space layout can be done at will. Since for separation the ranges are inteded to be disjoint, both approaches will actually have the same TLB foot print, so no advantages here for segment limits either.

VT-x style ASIDs are only poor due to their tie-in with, err, VT-x, their equivalents work just fine on every architecture that has been designed with address space identifiers (under their various names, as context/thread/process/... identifiers) to begin with (and BTW have more general applications for fast thread switching etc.). As for the paging-related caching: You don't get rid of that using segment limits, you just pile another layer on top.

As for the attack surface regarding page tables: Why do you think it is easier to protect page tables mapping virtual address space [kern_start;kern_end) in the segment limit case, than it is to protect page tables mapping asid=kern in the asid case? (Rhetorical question, as there is no difference, so the whole argument regarding paging is a red herring).

And yes x86 segmentation is oddball in that it has never been designed for what it is now being used for, is difficult to make efficient in hardware, while address-space-based methods are both easy to make efficient and can more explicitly support the intended separation semantic. Considering segmentation to be the best solution to the problem is suffering quite a bit from Stockholm syndrome ;)

Really I don't question your accomplishments, but segmentation is a shallow local minimum, and while infinite effort can be spent trying to micro-optimize beside this minimum, there are far deeper local minima (and many of them outside x86).

An unexpected perf feature

Posted May 23, 2013 23:33 UTC (Thu) by PaXTeam (guest, #24616) [Link] (9 responses)

1. the ALU concept is so 80's ;). seriously, 'modern' CPUs are a tad bit more complex, Intel and AMD CPUs have all dedicated adders for address calculations (not just for the already mentioned purposes but also for rip-relative addressing). TLS not being a fast path is probably news to everyone spending their time on multithreaded applications, and Intel, in the grand conspiracy of schemes, must have added dedicated fs/gs base manipulating insns to their latest CPUs in order to slow these workloads down even more. as for the BTB, isn't it indexed by virtual and not logical addresses?

2. ASIDs cannot by definition be cheaper, paging related caches and checks will always require more circuits and cycles than a simple comparator. not sure what address space layout decisions have to do with this though, when you have ASIDs by definition you have full address spaces for each ASID. if you meant mixing different ASIDs in the same virtual address space (how?), then nobody does that.

conceptually ASIDs are indeed more generic except this fact is utterly irrelevant, there isn't a mainstream OS out there that would make use of this ability (i.e., mix user and kernel pages in arbitrary ways in the same address space). in practice everyone simply divides the virtual address space into two parts between userland and the kernel, so simple limit checking would do just fine (vs. checking access rights at every level of the paging hierarchy).

3. ASIDs do have their uses indeed, in fact i would love to have a better mechanism on Intel/AMD CPUs to implement some of my ideas but for simple user/kernel separation a segment limit check has no match.

4. to understand the difference in the security level provided by a segmentation and paging based non-exec/no-access kernel protection scheme we have to consider the attacker's assumed ability. against an arbitrary read and write capability they're equivalent. however this is the ideal attacker model only we use to evaluate theoretical boundaries of protection schemes, in practice we rarely get such bugs and that's exactly where the difference becomes important. in particular, the segmentation based approach can achieve a certain level of self-protection by simply ensuring that the top-level page tables enforce the read-only property on the GDTs whereas doing the same for page tables themselves is much harder - this is the attack surface difference. that said, KERNEXEC (on i386/amd64 so far) does attempt to minimize the exposure of top-level page tables but it's far from being a closed system yet (breaking up kernel large page mappings, tracking virtual aliases, etc have non-negligible performance impact).

5. what has segmentation been designed for then? surely there's only so much you can do with a data or code segment ;). why is it difficult to make it efficient in hardware? and which particular bit (there're many descriptor types)? why would paging related data structures be easier to handle in hw than segmentation ones? and how do you imagine beating a simple comparator? so far you haven't offered any facts to make me think otherwise.

An unexpected perf feature

Posted May 24, 2013 8:53 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link] (7 responses)

1. Sure it's not a single 80s style alu anymore, rather it is a set of independently operating arithmetic units; typically two are adders, and whatever is needed is scheduled to them (addr gen or general arithmetic alike). Doing anything else is wasteful (which is to say that it is done nevertheless occasionally *if* it can speed up a fast-path, which for address calculations it cannot).

2. TLS is not a fast path, profiling shows around 1 in 1000 to 10000 instructions is TLS, so no one bothers paying a 1 cycle penalty for that.

3. ASIDs are cheaper because they just become part of the tag in the TLB. You do the TLB lookup (which you do anyways) and compare the tag for equality which is cheaper than a "greater than" comparison against an address space limit (which, incidentally, is just another adder), end of story.

4. Enforcing read-only is easily done at the page level, so what's the point?

5. What segmentation has been designed for? To map the concept of program object segments (the name may be a hint, right?) directly to hardware, facilitate sharing and relocatability this way. This is also where the security model for them originates from. It was conceived when people were somehow not yet certain paging was scalable, but I am too young to have been involved back then, you would have to ask the 'bearded guys'.

And to answer your last question: Paging is cheaper because hashed lookup (TLB) and equality comparison are cheaper than "less than" comparisons in hardware. Segmentation is a conceptual dead-end, live with it :)

An unexpected perf feature

Posted May 24, 2013 12:00 UTC (Fri) by ras (subscriber, #33059) [Link] (6 responses)

As a "bearded guy" (I have no beard) who has written designed and written bios's and protected mode operating systems x86, and who had to look at the x86 architecture in detail again in mind numbing detail (as in reading the 4 Intel x86 "data sheets" several times in order to port linux-abi system to AMD64), I recall my thoughts at the time as being "my - AMD has cleaned this mess up".

For those of you defending Intel's decisions at this time - I lived through it. At the time Intel regarded all programmers as idiots, and decided to solve the problem with hardware. Thus we have the absurdly complex designs we see today, with x86 interrupts taking 2000 cycles (?!?!? - that was back when Intel was game enough to publish cycles) and it wasn't the slowest instruction. Can anyone remember a Task Gate Descriptor?

Yet that wasn't the worst of it. The worst of the worst died. It was a new Intel architecture called iAPX432. It caused more excitement than Haswell in it's day. I am sure it's forebears would prefer we forgot it entirely. It remains in my mind the ultimate testimony to the arrogance caused by ignorance, in this case the Electrical Engineers thinking they could tell Software Engineers how we should do our jobs. But I exaggerate. Back then we weren't allowed to call ourselves Engineers.

Still they got their revenge with x86. In it they made it plain we could not be trusted to swap between two tasks quickly. Only 30 years later with the advent of ARM has their folly been made plain to everyone.

An unexpected perf feature

Posted May 24, 2013 13:40 UTC (Fri) by dlang (guest, #313) [Link] (4 responses)

you aren't going nearly far enough back

segmentation on the x86 came about with the 80286 (or possibly even the 80186, I'm not sure) CPU.

It was intended as a way to allow programs to continue to use 16 bit 8086 addressing but be able to use more than 64K of ram in a system (by setting a segment offset and then running the 16 bit code inside that segment)

It never was an effective security measure, because to avoid wasting tons of memory on programs that didn't need it, the segments overlap, allowing one program to access memory of another program.

When the 80386 came out and supported paging and real memory protection, everyone stopped using segments real fast

An unexpected perf feature

Posted May 24, 2013 23:31 UTC (Fri) by ras (subscriber, #33059) [Link] (3 responses)

> segmentation on the x86 came about with the 80286 (or possibly even the 80186, I'm not sure) CPU.

8086 actually. It was their way of extending the address space of the 8080 to 20 bits.

It was just a slightly more sophisticated take on the way CPU designers have gone about extending the range of addressable memory beyond the natural size of internal registers for eons. It had the advantage you could use separate address spaces for both code and data, giving you 64K for each. So by using a small amount of extra silicon they effectively doubled the amount of address space you had over simple "page extension" hack everybody else was doing. Kudos to them.

So you are saying it was the 80286 is where the rot set it. In the 80286 they had enough silicon to give the programmer true isolation between processes. They could of gone the 32 bits + page table route everybody else did, but no we got segmentation (without page tables instead) and retained 16 bit registers. Why? Well this was also the time the hardware designers had enough silicon to implement microcode - so they could become programmers to! And they decided they could do task switching, ACL's and god know what else better than programmers, so they did. In other words somehow they managed to forget their job was to provide hardware that ran programs quickly, and instead thought their job was to do operating system design.

It was a right royal mess. Fortunately for them the transition from DOS to Windows / OS2 (which is where the extra protections and address space matters) took a while, and by then the i386 was released. It added 32 bits and paging, so we could ignore all that 16 bit segmentation rubbish and get on with life. It turned out the transition to multitasking operating systems wasn't waiting on programmers figuring out had to do it (who would have thunk it?), but rather the price of the RAM needed to hold several tasks at once had to come down.

People here defending the segmentation model should try writing for the 80286, which could only address 64K of code and 64K of data at any one time. There was no reason for it. The 68000 family had a much better memory model at the time, so it wasn't silicon constrains. Well there would have been enough silicon if they hadn't devoted so much of it to creating their own operating system on a chip.

Intel finally came to their senses with Itanium. With it they used all that extra silicon to do what hardware designers should be doing - make programs run fast. Sadly it came along too late.

Back to your point - the time line. The 80286 was released in 1982. The iAPX432 was meant to be released in 1981. The 80286 was the fall back position. As Wikipedia points out, this is a part of its history Intel strives to forget. You will find no mention of the iAPX432 on their web site, for instance.

An unexpected perf feature

Posted May 24, 2013 23:50 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

> 8086 actually. It was their way of extending the address space of the 8080 to 20 bits.

That's right, I forgot about that. And IIRC, with the 286 they only expanded it to 24 bits

but in any case, my point was that the dawn of the AMD64 chip is eons past the point where segmentation was introduced, failed, and was rejected

P.S. I disagree with you about the Itanium. It wasn't a good design. They took far more silicon than other processors and ended up being less productive with it.

In part, the speed increases were the cause of this failure. As the speed differences between memory and the CPU core get larger, having a compact instruction set where each instruction can do more, becomes more important, and for all it's warts, the x86 instructions do rate pretty well on the size/capability graph.

but in part the programmers really should NOT have to change their programs to work well on a new CPU where the hardware designers have implemented things differently. By hiding such changes in the microcode, instructions that get used more can be further optimized, and ones that aren't can be emulated so they take less silicon. It's a useful layer of abstraction.

An unexpected perf feature

Posted May 25, 2013 0:21 UTC (Sat) by ras (subscriber, #33059) [Link]

> P.S. I disagree with you about the Itanium. It wasn't a good design. They took far more silicon than other processors and ended up being less productive with it.

I wasn't commenting whether it was a good design. I don't know, as I haven't used it. I was just saying Itanium marked the point when Intel went back to sticking to the knitting - in other words trying to design a new architecture whose sole goal was to run programs fast. If you say they failed despite having that explicit goal then that's a shame.

As I recall the Itanium tried to improve it's speed in a RISC like way - ie by keeping things simple on the hardware side and offloading decisions to the compiler. In the Itanium's case those decisions were about parallelism.

I think it's pretty clear now tuning the machine language to whatever hardware is available at the time is a mistake when you are going for speed. It might work when the hardware is first designed, but then the transistor budget doubles and all those neat optimisations don't much so much sense anymore, but you are welded to them because there are hardwired into the instruction set. Instead the route we have gone down is to implement a virtual CPU, rather like the JVM. The real CPU then compiles the instruction set on the fly into something that can be run fast with todays transistor budget. With tomorrows transistor budget it might be complied into something different.

The x86 instruction set is actually good pretty in this scenario - better than ARM. It's compact, and each instruction gives lots of opportunities to execute bits of it in parallel. If this is true, then Intel ending up with an architecture that can run fast is just dumb luck, as they weren't planning for it 30 years ago. Now that I think about it, the VAX instruction set would probably be even better again.

An unexpected perf feature

Posted May 25, 2013 10:36 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

192k of data (point DS, ES, SS to different segments).

An unexpected perf feature

Posted May 27, 2013 22:06 UTC (Mon) by nix (subscriber, #2304) [Link]

As a guy who used to do stuff with Arcs, '30 years later with the advent of ARM' is bizarre. ARMs of a sort were around in the early 90s :) but, sure, they hadn't set the world on fire yet.

An unexpected perf feature

Posted May 24, 2013 9:16 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]

Just as an addendum... Don't get me wrong, you are right to complain that a mechanism that was usable for security purposes was removed while nobody bothered to add a suitable substitute, and that all of the pretty architectural ideas that had already been present and demonstrated to be workable for 25+ years now had been ignored, but the "proper" way going forward is not to revive segmentation but implement comparable semantics with mechanisms that are performance-neutral. SMEP and SMAP are actually quite "easy" from a hardware conceputal point of view, so it is kind of annoying that it took so long.

An unexpected perf feature

Posted May 24, 2013 7:35 UTC (Fri) by ballombe (subscriber, #9523) [Link] (2 responses)

From a security point of view amd64 _is_ bad, especially compared to older architecture like sparc which provide fully separated kernel and user address space.

An unexpected perf feature

Posted May 24, 2013 7:48 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

if they were so fully separated, how could you pass data between them?

Even if true, it just shows that price and performance trump low probability security benefits once again, so what's new?

An unexpected perf feature

Posted May 24, 2013 12:43 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]

Sparc has instructions "load from foreign address space" and "store to foreign address space" (and even "compare and swap in foreign address space").

An unexpected perf feature

Posted May 21, 2013 23:39 UTC (Tue) by deater (subscriber, #11746) [Link] (1 responses)

> Kernels configured without support for perf events (i.e.
> CONFIG_PERF_EVENTS not set) are unaffected by the bug as they lack
> the system call entirely.

As far as I know it's not possible to disable perf_event on x86 since about 2.6.37 or so, because it is automatically enabled to get debugger support. I'd be glad to be proven wrong though.

As far as trinity support, better perf_event_open() coverage that I contributed was merged today, so everyone can fuzz away.

An unexpected perf feature

Posted May 21, 2013 23:43 UTC (Tue) by deater (subscriber, #11746) [Link]

> As far as trinity support, better perf_event_open() coverage that I
> contributed was merged today, so everyone can fuzz away.

Also, the new trinity support isn't as complete as it could be. Check out my perf_event_open() manpage linked above. The perf_event system call is fantastically complex with over 40 inter-dependent arguments that interact in complex ways.

I still sometimes wish that a thinner perf counter interface was merged. perfctr and perfmon2 both were much thinner wrappers over the perf counter MSRs. The "put everything in the kernel" strategy for perf_event makes it that hard to validate correctness.

An unexpected perf feature

Posted May 21, 2013 23:45 UTC (Tue) by gerdesj (subscriber, #5446) [Link] (21 responses)

>> This is a fairly nasty hole in the kernel

Or perhaps a fairly nasty hole in a subsystem within the kernel which you may or may not be using.

For a laugh (I'm back in a hotel having fixed a knackered RHEL box quicker than anticipated) I ran that sheep.c code on my laptop. I'm no C expert but after a read through it looked innocuous enough apart from the job it has to do.

It died badly and a quick play with strace showed why.

Yes: It was bloody Gentoo not forcing me to enable every possible feature that I may never use. Bloody non Enterprise ready OS distros.

Cheers
Jon

PS Actually a temporary KVM on my laptop with a basic copy of the host FS to play with.

PPS Must get around to looking into perf - perhaps I'll start installing it and the kernel bits when I need it.

An unexpected perf feature

Posted May 22, 2013 0:05 UTC (Wed) by gerdesj (subscriber, #5446) [Link] (20 responses)

On reflection and now I've bothered to check I DO have perf enabled, what about treating kernel modules in a similar way to say PHP extras like cURL in Ubuntu? (I only say Ubuntu because I have some experience there - I'm sure others have similar policies)

There must be a "disto required subset" that needs to be installed at initial installation time and then only add the extra modules when a package is installed?

I admit this would probably add rather a lot of extra work if too granular but might assist mitigating unforeseen future snags.

For example have a distro basic kernel and then exclude all V4Lx drivers on a server. Even on a server version, install V4L mods if say Myth was installed.

Cheers
Jon

PS Wonder why the sheep thing didn't work then although it was described as fragile.

An unexpected perf feature

Posted May 22, 2013 1:29 UTC (Wed) by deater (subscriber, #11746) [Link] (18 responses)

> On reflection and now I've bothered to check I DO have perf enabled,
> what about treating kernel modules in a similar way to say PHP extras
> like cURL in Ubuntu? (I only say Ubuntu because I have some experience
> there - I'm sure others have similar policies)

People keep going on about disabling perf_event or using it as a module.

perf_event *cannot* be configured as a module, and it *cannot* be disabled on X86. It's always there. And you can't disable it after boot either.
It looks like this was introduced explicitly by cc2067a51424dd25 in Nov 2010 but the dependency existed before because HAVE_HW_BREAKPOINT (default on x86) depends on PERF_EVENTS.

Maybe it's time for a /proc/sys/perf_event_paranoid value of "3" meaning no perf_events at all.

I must admit I'm of two minds of this. As a HPC researcher the main positive feature of perf_event was the fact that it was available everywhere by default, negating the previous hassle of having to get sysadmin intervention to do performance analysis. It would be sad to return to those days, but I guess security trumps convenience.

An unexpected perf feature

Posted May 22, 2013 7:39 UTC (Wed) by nix (subscriber, #2304) [Link] (17 responses)

Quite. This API is too damn complicated (and people complained that dtrace was too complicated, bah!). I don't plan to do performance analysis on my little memory-constrained network-facing firewall: why am I obliged to carry the perf machinery there?

An unexpected perf feature

Posted May 22, 2013 9:11 UTC (Wed) by Frej (guest, #4165) [Link] (16 responses)

I'd blame the language rather than the API. The best innovation in any kernel would be to stop using unsafe languages or at least limiting usage. Ofcourse this won't help every security issue, but every crash seems to be these days.

The research is out there and has been for a long time (see cyclone), hopefully rust (somewhat similar in ideas) will show a viable way forward.

An unexpected perf feature

Posted May 22, 2013 10:34 UTC (Wed) by dgm (subscriber, #49227) [Link] (9 responses)

A "safe language" would be in fact equivalent to a "bug-free language", which is absurd, of course.

An unexpected perf feature

Posted May 22, 2013 11:46 UTC (Wed) by iq-0 (subscriber, #36655) [Link]

There is a whole lot of ground between a hypothetical "bug-free language", a language that makes it harder to write problematic constructs, a language that warns about use of dubious constructs and a language in the land where all people carry BFGs (without a safety) and walking around with size 150 shoes.

An unexpected perf feature

Posted May 22, 2013 20:50 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

"Safe language" is actually well-defined. It means a language where it's impossible to cause memory corruption, which is certainly possible.

An unexpected perf feature

Posted May 23, 2013 9:17 UTC (Thu) by dgm (subscriber, #49227) [Link] (6 responses)

There's no way to prevent memory corruption, as long as I can write to the wrong variable (ooops!). Memory protection is a feature of the underlaying machine, not the language.

An unexpected perf feature

Posted May 23, 2013 12:32 UTC (Thu) by renox (guest, #23785) [Link]

Uh? Your definition of memory protection is different from the normal definition which makes your post quite useless..

An unexpected perf feature

Posted May 23, 2013 12:46 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Write a wrong value _where_?

And yes, it's totally possible to create a language (hint: Java, C#) that do NOT require hardware memory protection at all to isolate misbehaving applications.

An unexpected perf feature

Posted May 27, 2013 17:13 UTC (Mon) by dgm (subscriber, #49227) [Link] (3 responses)

> Write a wrong value _where_?

Anywhere it was not intended to be, but mostly strings, specially strings that are interpreted by the program.

> And yes, it's totally possible to create a language (hint: Java, C#) that do NOT require hardware memory protection at all to isolate misbehaving applications.

It's not the language but the virtual _machine_ they run on what isolates misbehaving applications.

An unexpected perf feature

Posted May 27, 2013 23:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Anywhere it was not intended to be, but mostly strings, specially strings that are interpreted by the program.
That won't cause memory damage.

> It's not the language but the virtual _machine_ they run on what isolates misbehaving applications.
Nope. If your language allows unrestricted pointer arithmetic then it doesn't matter at all if you are running it inside the most secure VM.

And if your language simply doesn't have a way to express pointer arithmetic then you can't use it to do memory damage.

An unexpected perf feature

Posted May 28, 2013 8:59 UTC (Tue) by etienne (guest, #25256) [Link] (1 responses)

> And if your language simply doesn't have a way to express pointer arithmetic then you can't use it to do memory damage.

IHMO you can't use that language to manage the memory neither, i.e. you can't write malloc()/free() nor what the kernel needs, for instance "allocate a contiguous DMA able buffer accessible with DMA32 PCI card and fail if a wait is needed and give back it's physical address" or "free memory blocks after DMA hardware has finished sending them", or any variation of it.
Obviously if you had perfect hardware and a processor with an "allocate_memory" and a "free_memory_after_both_software_and_DMA_have_finished_with_it" assembly instruction, then maybe...

An unexpected perf feature

Posted May 28, 2013 9:06 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

You can certainly do it. You won't be able to do free() unless you can statically prove its correctness, but doing malloc() and using GC to collect unused objects is possible.

You also certainly can allocate buffers with special properties, and you can even do stuff like reading from within buffers by using fat pointers (i.e. a pointer with a length).

In fact, it's all been done in the past even for OS kernels. It's the question of practicality, not possibility.

An unexpected perf feature

Posted May 22, 2013 16:40 UTC (Wed) by nix (subscriber, #2304) [Link]

I'd blame the language *as well* as the API. C makes it very easy to write insecure and buggy code, but the API being ridiculously complex gives you a lot of corners to make undetected-until-too-late mistakes in. I don't see how anyone could be confident that anything of that complexity was secure :/

(This is really the same complaint I have about SELinux policies. I *like* complexity, but Schneier is right: it is the enemy of security -- and the entire programmer profession is helping out. Including me. Ah well.)

An unexpected perf feature

Posted May 22, 2013 18:07 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (3 responses)

Show me a type system that I can accurately type functions written in assembly, and I will believe there are safe languages that could be used.

Until we can stop escaping the type-system in a kernel there is no such thing as a safe language.

An unexpected perf feature

Posted May 22, 2013 20:51 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

>Show me a type system that I can accurately type functions written in assembly
Your wish is my command: http://www.cs.cornell.edu/talc/

An unexpected perf feature

Posted May 23, 2013 4:34 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (1 responses)

Talx86 can not accurately type programs encoded in assembly.

There are no dependent types for allowing the removal of bounds checks in array updates. Instead magic array macros must be used. (Not supporting general memory accesses is a signification failure in adding types to assembly language).

There is no support for multiprocessing.

There is no support for support for manual memory management. Talx86 requires a garbage collector.

Which means a large number of common kernel constructs can not be encoded in this assembler. We are unfortunately quite a ways from safe languages that can be used for kernel programming.

An unexpected perf feature

Posted May 23, 2013 4:39 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Dependent types can be feasibly added and fat pointers are already supported in hardware (on a couple of exotic architectures, but still).

I don't see much problems with multithreading, though memory model formalization should be quite interesting.

Garbage collector of some sort seems inevitable in any case. Perhaps with some kind of region inference to help with short-lived allocations. In some limited cases it may be possible to use static proofs of correctness.

An unexpected perf feature

Posted May 29, 2013 5:41 UTC (Wed) by schabi (guest, #14079) [Link]

I agree.

Interpreting unsigned 64 bit IDs as signed 32 bit is exactly what I'd expect to generate at least a compiler warning. Even in C, this should have been caught by good compilers or static analysis tools.

So, what went wrong?

An unexpected perf feature

Posted May 26, 2013 13:03 UTC (Sun) by spender (guest, #23067) [Link]

I've uploaded the definitive exploit for the vulnerability here:
http://grsecurity.net/~spender/exploits/enlightenment.tgz

It should work on any distro, x86 or x64, with any combination (or lack of) CONFIG_MODULES and CONFIG_JUMP_LABEL. I've personally tested it on RHEL, Ubuntu, Debian, and Gentoo, custom kernels and distro kernels: 2.6.32 (RHEL), 2.6.38, 3.0, 3.2, 3.5, 3.8. It requires no System.map or /proc/kallsyms on x64 (even though a System.map could be trivially obtained, or the symbols extracted from the visible kernel image in /boot instead). Once it gains control in the kernel it resolves symbols internally. Its generic ring0 payload (reusable with any other kernel exploit where the attacker controls eip/rip) disables SELinux, AppArmor, IMA -- all LSMs. It breaks out of any chroot or mnt namespace. It breaks out of vserver and OpenVZ. It creates no logs and leaves the system in a consistent state.

The initial port was completed last week:
http://www.youtube.com/watch?v=WI0FXZUsLuI
http://www.youtube.com/watch?v=llqxbMgIztk

I delayed publication a week to give people more time to update, but this exploit should be considered a demonstration of the true risk of depending on patching individual bugs as a means to security or in using shared-kernel virtualization without any kind of kernel self-protection. The techniques in the exploit, some of which have never been published before, are the kinds of techniques that are used and sold in private.

-Brad

Bug class

Posted May 22, 2013 2:18 UTC (Wed) by cesarb (subscriber, #6266) [Link] (5 responses)

Would it be possible to add a check to something like sparse to detect this class of bug (value being truncated by being assigned from a larger type to a smaller type)? And the related sign confusion class of bug (assigning from unsigned to signed or vice versa without checking the range first)?

Bug class

Posted May 22, 2013 5:25 UTC (Wed) by smurf (subscriber, #17840) [Link]

I suspect that there are a whole heap of places in the kernel where this happens; most are probably perfectly sane and don't need to be checked, because the resulting value is not used as an offset to index anything.

Long-term, of course, that's not an excuse not to do the work.

Bug class

Posted May 22, 2013 9:28 UTC (Wed) by mcfrisk (guest, #40131) [Link]

Wouldn't Coverity find this already?

Bug class

Posted May 22, 2013 12:28 UTC (Wed) by PaXTeam (guest, #24616) [Link] (1 responses)

> [...]to detect this class of bug (value being truncated by being assigned from a larger type to a smaller type)

it's not true that casting a value from a wider type to a narrower one is a bug in general (think how often long->int happens when error codes are returned in the kernel) so detecting that construct would produce immense amounts of false positives (based on our own experiments last year we're talking about many thousands of instances for allyesconfig/amd64). and even if the value is not preserved during the narrowing cast it may not be a bug but intended behaviour and it's very hard to tell for a compiler.

with that said, Emese Revfy wrote a gcc plugin for us that tries to detect this specific instance of unchecked signed array index usage. the results are somewhat more managable, there're about 200 instances on allmod/amd64, most of which are false positives (interprocedural analysis, etc would help eliminate most of them, but that's not a half an hour project as this one was), however some of them look genuine bugs, albeit nothing security related so far (of those that i checked that is).

> And the related sign confusion class of bug (assigning from unsigned to
> signed or vice versa without checking the range first)?

the size overflow plugin from Emese (http://forums.grsecurity.net/viewtopic.php?f=7&t=3043) does this but we had to scale it back due to the amount of false positives (even gcc creates these itself during canonicalization), as it's just not feasible to eliminate such constructs for now.

Bug class

Posted May 22, 2013 13:35 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> and even if the value is not preserved during the narrowing cast it may not be a bug but intended behaviour and it's very hard to tell for a compiler.

If the narrowing losing the higher bits is intended behavior, the programmer should either use an explicit cast instead of an implicit one (x = (int)y instead of int x = y), or explicitly mask out the upper bits (x = y & 0xffffffffu). This would be similar to having to add an extra pair of parenthesis when using bitwise operators within an if/while/for, and would also make it more visible to readers of the code that the narrowing is taking place. So this class of false positives (intentionally masking by narrowing) is not a problem.

However, I do agree that the rest of the false positives would be a problem. What would be needed would be a way to somehow track the range of the variables, so the checker would know that for instance the variable being narrowed is in the range -1..-4095 (the error code case).

> however some of them look genuine bugs, albeit nothing security related so far (of those that i checked that is).

Will they be reported upstream? A bug is a bug, so even if they are not security related, they should be fixed.

Bug class

Posted May 23, 2013 10:41 UTC (Thu) by error27 (subscriber, #8346) [Link]

Smatch theoretically can find buffer underflows.

In this case it missed for several reasons. 1) This wasn't getting marked as untrusted user data. 2) The cross function tracking wasn't working. 3) The underflow check was ignoring if we capped the upper value. *eyeroll*

I haven't pushed all the Smatch changes yet, but it's sort of working now to the point where it would have found this bug. It did find a couple problems in ATM network drivers as well.

An unexpected perf feature

Posted May 22, 2013 13:45 UTC (Wed) by fuhchee (guest, #40059) [Link]

"At that time, it was not recognized as a security problem ..."

The passive voice underlines the uncertainty. One can certainly take Greg at his word that he didn't realize it. However, the absence of formal treatment at the security@ mailing list by others is just as consistent with unawareness as with sweeping-under-the-rug.