An unexpected perf feature

Posted May 22, 2013 20:09 UTC (Wed) by PaXTeam (guest, #24616)
In reply to: An unexpected perf feature by helge.bahmann
Parent article: An unexpected perf feature

first, the cost of non-0 bases is independent of width, the CPU already has full 64 bit adders just for this purpose (think support for 'lea' and fs/gs overrides). however for my purposes what is important is the ability to define segment limits and flip the meaning of that limit (lower or upper, for expand-down segments), so non-0 bases are pretty much irrelevant.

second, segment limits are pretty much the best way to implement world separation, they require the least amount of hw resources: one parallelizable compare on the virtual address (i.e., it can be done before or in parallel to the virtual->physical translation) vs. tons of paging related caching while resolving the physical address. so no, i maintain that ditching all of segmentation was a design mistake and sparc or VT-x style ASIDs are not equivalent replacements (just ask Bromium for their performance numbers ;).

third, while SMEP/SMAP are useful performance enchancements for amd64, they require quite a bit more kernel infrastructure to make them robust. in particular, since they rely on paging, *all* of the paging related structures must be properly protected against memory corruption bugs which is quite a bit larger attack surface than the GDTs (and no kernel i know does this work except for PaX). so while i 'mourn the loss of segmentation' (which is not as oddball as you think, and it's never been an incomplete hack) i've been doing the extra work for a decade now to make paging based replacements actually secure as well ;).

An unexpected perf feature

Posted May 22, 2013 21:06 UTC (Wed) by helge.bahmann (subscriber, #56804) [Link] (10 responses)

First, AFAICT fs/gs overrides (as well as lea) go through ALU and pay the access penalty exactly because there is no dedicated adder anymore, it would be a total waste of resources otherwise as segment overrides are in practice used just for TLS which is not that much of a fast path. (It is actually even more disturbing what happens with a base!=0 cs as it totally messes up the btb).

Second, TLBs tagged with address space identifiers are actually cheaper and more generic than segment limits: You can save the comparators (even though they don't add latency because they can operate in parallel), and address space layout can be done at will. Since for separation the ranges are inteded to be disjoint, both approaches will actually have the same TLB foot print, so no advantages here for segment limits either.

VT-x style ASIDs are only poor due to their tie-in with, err, VT-x, their equivalents work just fine on every architecture that has been designed with address space identifiers (under their various names, as context/thread/process/... identifiers) to begin with (and BTW have more general applications for fast thread switching etc.). As for the paging-related caching: You don't get rid of that using segment limits, you just pile another layer on top.

As for the attack surface regarding page tables: Why do you think it is easier to protect page tables mapping virtual address space [kern_start;kern_end) in the segment limit case, than it is to protect page tables mapping asid=kern in the asid case? (Rhetorical question, as there is no difference, so the whole argument regarding paging is a red herring).

And yes x86 segmentation is oddball in that it has never been designed for what it is now being used for, is difficult to make efficient in hardware, while address-space-based methods are both easy to make efficient and can more explicitly support the intended separation semantic. Considering segmentation to be the best solution to the problem is suffering quite a bit from Stockholm syndrome ;)

Really I don't question your accomplishments, but segmentation is a shallow local minimum, and while infinite effort can be spent trying to micro-optimize beside this minimum, there are far deeper local minima (and many of them outside x86).

An unexpected perf feature

Posted May 23, 2013 23:33 UTC (Thu) by PaXTeam (guest, #24616) [Link] (9 responses)

1. the ALU concept is so 80's ;). seriously, 'modern' CPUs are a tad bit more complex, Intel and AMD CPUs have all dedicated adders for address calculations (not just for the already mentioned purposes but also for rip-relative addressing). TLS not being a fast path is probably news to everyone spending their time on multithreaded applications, and Intel, in the grand conspiracy of schemes, must have added dedicated fs/gs base manipulating insns to their latest CPUs in order to slow these workloads down even more. as for the BTB, isn't it indexed by virtual and not logical addresses?

2. ASIDs cannot by definition be cheaper, paging related caches and checks will always require more circuits and cycles than a simple comparator. not sure what address space layout decisions have to do with this though, when you have ASIDs by definition you have full address spaces for each ASID. if you meant mixing different ASIDs in the same virtual address space (how?), then nobody does that.

conceptually ASIDs are indeed more generic except this fact is utterly irrelevant, there isn't a mainstream OS out there that would make use of this ability (i.e., mix user and kernel pages in arbitrary ways in the same address space). in practice everyone simply divides the virtual address space into two parts between userland and the kernel, so simple limit checking would do just fine (vs. checking access rights at every level of the paging hierarchy).

3. ASIDs do have their uses indeed, in fact i would love to have a better mechanism on Intel/AMD CPUs to implement some of my ideas but for simple user/kernel separation a segment limit check has no match.

4. to understand the difference in the security level provided by a segmentation and paging based non-exec/no-access kernel protection scheme we have to consider the attacker's assumed ability. against an arbitrary read and write capability they're equivalent. however this is the ideal attacker model only we use to evaluate theoretical boundaries of protection schemes, in practice we rarely get such bugs and that's exactly where the difference becomes important. in particular, the segmentation based approach can achieve a certain level of self-protection by simply ensuring that the top-level page tables enforce the read-only property on the GDTs whereas doing the same for page tables themselves is much harder - this is the attack surface difference. that said, KERNEXEC (on i386/amd64 so far) does attempt to minimize the exposure of top-level page tables but it's far from being a closed system yet (breaking up kernel large page mappings, tracking virtual aliases, etc have non-negligible performance impact).

5. what has segmentation been designed for then? surely there's only so much you can do with a data or code segment ;). why is it difficult to make it efficient in hardware? and which particular bit (there're many descriptor types)? why would paging related data structures be easier to handle in hw than segmentation ones? and how do you imagine beating a simple comparator? so far you haven't offered any facts to make me think otherwise.

An unexpected perf feature

Posted May 24, 2013 8:53 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link] (7 responses)

1. Sure it's not a single 80s style alu anymore, rather it is a set of independently operating arithmetic units; typically two are adders, and whatever is needed is scheduled to them (addr gen or general arithmetic alike). Doing anything else is wasteful (which is to say that it is done nevertheless occasionally *if* it can speed up a fast-path, which for address calculations it cannot).

2. TLS is not a fast path, profiling shows around 1 in 1000 to 10000 instructions is TLS, so no one bothers paying a 1 cycle penalty for that.

3. ASIDs are cheaper because they just become part of the tag in the TLB. You do the TLB lookup (which you do anyways) and compare the tag for equality which is cheaper than a "greater than" comparison against an address space limit (which, incidentally, is just another adder), end of story.

4. Enforcing read-only is easily done at the page level, so what's the point?

5. What segmentation has been designed for? To map the concept of program object segments (the name may be a hint, right?) directly to hardware, facilitate sharing and relocatability this way. This is also where the security model for them originates from. It was conceived when people were somehow not yet certain paging was scalable, but I am too young to have been involved back then, you would have to ask the 'bearded guys'.

And to answer your last question: Paging is cheaper because hashed lookup (TLB) and equality comparison are cheaper than "less than" comparisons in hardware. Segmentation is a conceptual dead-end, live with it :)

An unexpected perf feature

Posted May 24, 2013 12:00 UTC (Fri) by ras (subscriber, #33059) [Link] (6 responses)

As a "bearded guy" (I have no beard) who has written designed and written bios's and protected mode operating systems x86, and who had to look at the x86 architecture in detail again in mind numbing detail (as in reading the 4 Intel x86 "data sheets" several times in order to port linux-abi system to AMD64), I recall my thoughts at the time as being "my - AMD has cleaned this mess up".

For those of you defending Intel's decisions at this time - I lived through it. At the time Intel regarded all programmers as idiots, and decided to solve the problem with hardware. Thus we have the absurdly complex designs we see today, with x86 interrupts taking 2000 cycles (?!?!? - that was back when Intel was game enough to publish cycles) and it wasn't the slowest instruction. Can anyone remember a Task Gate Descriptor?

Yet that wasn't the worst of it. The worst of the worst died. It was a new Intel architecture called iAPX432. It caused more excitement than Haswell in it's day. I am sure it's forebears would prefer we forgot it entirely. It remains in my mind the ultimate testimony to the arrogance caused by ignorance, in this case the Electrical Engineers thinking they could tell Software Engineers how we should do our jobs. But I exaggerate. Back then we weren't allowed to call ourselves Engineers.

Still they got their revenge with x86. In it they made it plain we could not be trusted to swap between two tasks quickly. Only 30 years later with the advent of ARM has their folly been made plain to everyone.

An unexpected perf feature

Posted May 24, 2013 13:40 UTC (Fri) by dlang (guest, #313) [Link] (4 responses)

you aren't going nearly far enough back

segmentation on the x86 came about with the 80286 (or possibly even the 80186, I'm not sure) CPU.

It was intended as a way to allow programs to continue to use 16 bit 8086 addressing but be able to use more than 64K of ram in a system (by setting a segment offset and then running the 16 bit code inside that segment)

It never was an effective security measure, because to avoid wasting tons of memory on programs that didn't need it, the segments overlap, allowing one program to access memory of another program.

When the 80386 came out and supported paging and real memory protection, everyone stopped using segments real fast

An unexpected perf feature

Posted May 24, 2013 23:31 UTC (Fri) by ras (subscriber, #33059) [Link] (3 responses)

> segmentation on the x86 came about with the 80286 (or possibly even the 80186, I'm not sure) CPU.

8086 actually. It was their way of extending the address space of the 8080 to 20 bits.

It was just a slightly more sophisticated take on the way CPU designers have gone about extending the range of addressable memory beyond the natural size of internal registers for eons. It had the advantage you could use separate address spaces for both code and data, giving you 64K for each. So by using a small amount of extra silicon they effectively doubled the amount of address space you had over simple "page extension" hack everybody else was doing. Kudos to them.

So you are saying it was the 80286 is where the rot set it. In the 80286 they had enough silicon to give the programmer true isolation between processes. They could of gone the 32 bits + page table route everybody else did, but no we got segmentation (without page tables instead) and retained 16 bit registers. Why? Well this was also the time the hardware designers had enough silicon to implement microcode - so they could become programmers to! And they decided they could do task switching, ACL's and god know what else better than programmers, so they did. In other words somehow they managed to forget their job was to provide hardware that ran programs quickly, and instead thought their job was to do operating system design.

It was a right royal mess. Fortunately for them the transition from DOS to Windows / OS2 (which is where the extra protections and address space matters) took a while, and by then the i386 was released. It added 32 bits and paging, so we could ignore all that 16 bit segmentation rubbish and get on with life. It turned out the transition to multitasking operating systems wasn't waiting on programmers figuring out had to do it (who would have thunk it?), but rather the price of the RAM needed to hold several tasks at once had to come down.

People here defending the segmentation model should try writing for the 80286, which could only address 64K of code and 64K of data at any one time. There was no reason for it. The 68000 family had a much better memory model at the time, so it wasn't silicon constrains. Well there would have been enough silicon if they hadn't devoted so much of it to creating their own operating system on a chip.

Intel finally came to their senses with Itanium. With it they used all that extra silicon to do what hardware designers should be doing - make programs run fast. Sadly it came along too late.

Back to your point - the time line. The 80286 was released in 1982. The iAPX432 was meant to be released in 1981. The 80286 was the fall back position. As Wikipedia points out, this is a part of its history Intel strives to forget. You will find no mention of the iAPX432 on their web site, for instance.

An unexpected perf feature

Posted May 24, 2013 23:50 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

> 8086 actually. It was their way of extending the address space of the 8080 to 20 bits.

That's right, I forgot about that. And IIRC, with the 286 they only expanded it to 24 bits

but in any case, my point was that the dawn of the AMD64 chip is eons past the point where segmentation was introduced, failed, and was rejected

P.S. I disagree with you about the Itanium. It wasn't a good design. They took far more silicon than other processors and ended up being less productive with it.

In part, the speed increases were the cause of this failure. As the speed differences between memory and the CPU core get larger, having a compact instruction set where each instruction can do more, becomes more important, and for all it's warts, the x86 instructions do rate pretty well on the size/capability graph.

but in part the programmers really should NOT have to change their programs to work well on a new CPU where the hardware designers have implemented things differently. By hiding such changes in the microcode, instructions that get used more can be further optimized, and ones that aren't can be emulated so they take less silicon. It's a useful layer of abstraction.

An unexpected perf feature

Posted May 25, 2013 0:21 UTC (Sat) by ras (subscriber, #33059) [Link]

> P.S. I disagree with you about the Itanium. It wasn't a good design. They took far more silicon than other processors and ended up being less productive with it.

I wasn't commenting whether it was a good design. I don't know, as I haven't used it. I was just saying Itanium marked the point when Intel went back to sticking to the knitting - in other words trying to design a new architecture whose sole goal was to run programs fast. If you say they failed despite having that explicit goal then that's a shame.

As I recall the Itanium tried to improve it's speed in a RISC like way - ie by keeping things simple on the hardware side and offloading decisions to the compiler. In the Itanium's case those decisions were about parallelism.

I think it's pretty clear now tuning the machine language to whatever hardware is available at the time is a mistake when you are going for speed. It might work when the hardware is first designed, but then the transistor budget doubles and all those neat optimisations don't much so much sense anymore, but you are welded to them because there are hardwired into the instruction set. Instead the route we have gone down is to implement a virtual CPU, rather like the JVM. The real CPU then compiles the instruction set on the fly into something that can be run fast with todays transistor budget. With tomorrows transistor budget it might be complied into something different.

The x86 instruction set is actually good pretty in this scenario - better than ARM. It's compact, and each instruction gives lots of opportunities to execute bits of it in parallel. If this is true, then Intel ending up with an architecture that can run fast is just dumb luck, as they weren't planning for it 30 years ago. Now that I think about it, the VAX instruction set would probably be even better again.

An unexpected perf feature

Posted May 25, 2013 10:36 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

192k of data (point DS, ES, SS to different segments).

An unexpected perf feature

Posted May 27, 2013 22:06 UTC (Mon) by nix (subscriber, #2304) [Link]

As a guy who used to do stuff with Arcs, '30 years later with the advent of ARM' is bizarre. ARMs of a sort were around in the early 90s :) but, sure, they hadn't set the world on fire yet.

An unexpected perf feature

Posted May 24, 2013 9:16 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]

Just as an addendum... Don't get me wrong, you are right to complain that a mechanism that was usable for security purposes was removed while nobody bothered to add a suitable substitute, and that all of the pretty architectural ideas that had already been present and demonstrated to be workable for 25+ years now had been ignored, but the "proper" way going forward is not to revive segmentation but implement comparable semantics with mechanisms that are performance-neutral. SMEP and SMAP are actually quite "easy" from a hardware conceputal point of view, so it is kind of annoying that it took so long.