LWN.net Logo

x16 ABI coming up

x16 ABI coming up

Posted Apr 2, 2012 2:47 UTC (Mon) by ncm (subscriber, #165)
Parent article: The 3.4 merge window is closed

While a 32-bit address space is needed for many programs, many others would run fine in a 16-bit space. With 16-bit pointers and 16-bit int, data structures would be much smaller than on x64 or x32. Better yet, the complete data sets of dozens of programs would fit in L3 cache. A machine with no RAM at all, just a CPU, would be useful in many applications where the need for separate RAM and a RAM controller add prohibitive expense.

Carefully designed, an x16 mode would enable four-slice SIMD programming on commodity hardware, using a quarter of each register for each slice. Indeed, x32 could run with two slices. A kernel confined to 4G is little inconvenienced, but a kernel that can do twice as many operations in many cycles may be noticeably faster. Gcc already generates code for Itanic; can sliced x32 be difficult to add?


(Log in to post comments)

x16 ABI coming up

Posted Apr 2, 2012 3:05 UTC (Mon) by dmarti (subscriber, #11625) [Link]

"Carefully designed" -- I'm putting that in for Understatement of the Year. It would be an amazing project though.

x16 ABI coming up

Posted Apr 4, 2012 4:55 UTC (Wed) by ncm (subscriber, #165) [Link]

Probably you'd only get to use half of each register -- odd-numbered registers for odd slices, even for even -- when doing carry arithmetic.

A great advantage of an SSE ABI is that the kernel promises not touch those registers. If it could then be persuaded not to push the other registers, context switches ought to get very quick -- great for interrupt latency. That is, until you build the kernel using the SSE ABI, too...

x16 ABI coming up

Posted Apr 2, 2012 3:57 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246) [Link]

L3? A 64K working set could fit in L1 in many modern processors, and in L2 in the rest. Ok, technically it's "64K 'elements'" where an "element" might be a char, short, int, long or long long. That gives you a potential working set up to 512K in the case of long long and a potential working set of 128K for the more common int. That's still within the bounds of most L2s, though.

Carefully designed, an x16 mode would enable four-slice SIMD programming on commodity hardware, using a quarter of each register for each slice.

Provided you figure out how to make them all branch together. ;-)

I'm sure whatever you come up with will be Lirpa 1 compliant, should you attempt it.

x16 ABI coming up

Posted Apr 2, 2012 4:44 UTC (Mon) by ncm (subscriber, #165) [Link]

Note "dozens". But yes, it's a good idea to fit your whole program and all its data structures in L1 cache.

Speaking with entire seriousness, I have read of a brilliant implementation of AES that uses 8 128-bit SSSE registers - one register per bit - and enciphers 128 bytes in 256 cycles. It's sadly obsoleted by aes-ni instructions.

x16 ABI coming up

Posted Apr 2, 2012 5:48 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246) [Link]

Ah yes, bitslicing. I was able to implement a certain stream cipher in about 1 cycle per bit on our DSP by doing 32 parallel blocks in a bit-slice configuration. This was about 25x as fast as the best non-bitsliced implementation. At the time (and to my knowledge), our implementation was the only software implementation they ever certified.

(I won't say which, but the company that owns the algorithm will happily sell you a synthesizeable accelerator, and their algorithm is in the standard. Furthermore, they're responsible for system certification, so... The software implementation was practical because of how I sped it up. It was certifiable because of various hardware security features we had developed.)

Still... 2 cycles/bit seems a bit slow. IIRC, our DSP achieves that on AES without bitslicing. Granted, though, that's assuming everything is all in cache a priori...

And yes, I noted "dozens." I refrained from trying to list them...

x16 ABI coming up

Posted Apr 2, 2012 5:49 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246) [Link]

I should say "2 cycles/bit seems slow for a bit-sliced AES." Of course, I've never tried to write / benchmark AES on an x86.

x16 ABI coming up

Posted Apr 3, 2012 0:22 UTC (Tue) by ncm (subscriber, #165) [Link]

Two cycles per _byte_, on the other hand, is stellar.

x16 ABI coming up

Posted Apr 3, 2012 1:30 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246) [Link]

Ah yes, with that I must agree. For some reason I read that as 128 bits, not 128 bytes. Mea culpa.

One of these days I'll have to give something like that a try on one of our newer processors that have wider operations. The bitsliced stream cipher I mentioned only had 32 lanes because I used 32-bit registers. The magical thing about bitslicing algorithms like this is that you can go even faster (at least for parallelizable blocks) just by making the variables wider.

Thinking about AES specifically... The S-box must've been a bear! LUTs don't work very well in a bitslice world, and IIRC the AES S-boxes are 8-input, 8-output, so rendering them as a system of binary functions of 8 variables can also be messy. (I haven't looked to see just how reduceable they are or aren't, but I suspect they're pretty tough.)

Reducing the logic functions for S-boxes is an enterprise in its own right. For the unnamed algorithm I mentioned previously, I was able to take the total logic operations for all its S-boxes from around 280 down to around 140 using a special solver that tried to find minimal tree-like sequences of instructions to evaluate all possible boolean functions of five variables. (The 280 vs. 140 was measured across the entire set of S-boxes.) I did this after multiple compilers and synthesis tools failed to reduce the logic operation count below ~280.

Of course, I found out after-the-fact that Donald Knuth was playing in the same space at about the same time, and came up with an even better approach than mine.

Aaaaanyway... I'm horribly off topic. I'll stop now.

x16 ABI coming up

Posted Apr 4, 2012 2:32 UTC (Wed) by ncm (subscriber, #165) [Link]

To list them, use ps(1).

Probably I should have written "dozens of processes", instead.

Has anybody booted Linux on a desktop CPU with no RAM, yet, using only L3 cache for volatile storage? Maybe it's still possible to show off.

x16 ABI coming up

Posted Apr 4, 2012 4:13 UTC (Wed) by Fowl (subscriber, #65667) [Link]

The bootloaders burned into most motherboards are a bit fussy about ram being installed, unfortunately.

x16 ABI coming up

Posted Apr 4, 2012 4:45 UTC (Wed) by ncm (subscriber, #165) [Link]

Yes, but as noted below, coreboot might be made more forgiving. Probably any DMA must be avoided...

x16 ABI coming up

Posted Apr 2, 2012 7:22 UTC (Mon) by elanthis (guest, #6227) [Link]

While I'm about 87.3% sure you're foolin'... it's impossible to add, due to the nature of how the x86 instruction set actually works, and "slicing" as you imply it is impossible to do without a lot of instruction overhead to emulate it. x32 does not allow for addressing just the upper half of a 32-bit register, in particular. You'd have to copy values into other register, shift and mask them, modify them, shift them back, combine them with the destination register, and then finally store them. The extra x86_64 register space would be offset by the excess registers needed to get anything done.

Also, I really doubt that many useful applications can fit into a 16-bit address space anymore, given that the code size of many essential system libraries is already larger than 64k. The data sets you can work on are small, the algorithms small, and hence are suitable to just be written with 16-bit ints and 16-bit offsets into buffers. This is vastly different than the 32-bit world, which is still large enough to handle large data-sets and very huge, complex codebases.

x32 is just about giving the ISA improvements to applications that perform better with 32-bit addressing. x16 would be about inventing a new retarded emulated ISA for applications that would perform better with 32-bit addressing.

x16 ABI coming up

Posted Apr 2, 2012 7:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

You can do slicing just fine with SIMD instructions.

In fact, there's an ultra-fast XML parser that works based on this technique: http://parabix.costar.sfu.ca/

And yes, it's really really fast.

x16 ABI coming up

Posted Apr 2, 2012 13:28 UTC (Mon) by etienne (subscriber, #25256) [Link]

Maybe you would not be able to memory-map the libraries you would use in 64 Kbytes, but sometimes I miss variable-size pointers in C, like:
char *ptr; // 32 bits (in fact system default)
char *short shortptr; // 16 bits
char *long longptr; // 64 bits
The shortptr is usefull for ia32 "mov (%bx),%eax" but also for risc processors where you cannot load a 32 bits immediate value to a register in a single instruction " lis r9,0x1234 ; ori r9,0x5678 ".
Sometimes you know that the upper 16 bits will not change in two different pointers, so the lower 16 bits would be a "short pointer".
Also, in some C structures defined by a standard, some addresses are 64 bits wide even on 32 bit environment, it would be nice to be able to declare those fields as 64 bits pointers...

x16 ABI coming up

Posted Apr 3, 2012 13:37 UTC (Tue) by jengelh (subscriber, #33263) [Link]

>char *ptr; // 32 bits (in fact system default)
>char *short shortptr; // 16 bits
>char *long longptr; // 64 bits

Why not just reuse "near char *shortptr" and "far char *longptr" :-)

x16 ABI coming up

Posted Apr 4, 2012 10:52 UTC (Wed) by etienne (subscriber, #25256) [Link]

Well I did not want to add the concept of the old FAR pointer (16 bits segments + 16 bits offsets), and pointer attributes (const, volatile) are already written at the position I proposed:
char *const constptr;
But source code would be simpler (less asm("") statements) if we had more choices of pointers, like:
pointers to I/O space (inb/outb)
pointers to MSR space (rdmsr/wrmsr; mfspr/mtspr)
pointers to PCI space
pointers to segmented space (16+32 bits gs:(%ebx) )
pointers to kernel/user space (even for x86 architecture)
pointers to physical memory vs virtual memory
Maybe the source code of GCC would not be as simple...

x16 ABI coming up

Posted Apr 4, 2012 12:42 UTC (Wed) by PaXTeam (subscriber, #24616) [Link]

gcc 4.6+ has some support for the C11 named address space feature, some on your list could be simulated that way (in PaX there's a plugin that (ab)uses this mechanism to implement __user/__kernel/etc).

x16 ABI coming up

Posted Apr 2, 2012 18:20 UTC (Mon) by njs (guest, #40338) [Link]

I think this is what the GPU programming folks are actually doing.

x16 ABI coming up

Posted Apr 2, 2012 18:50 UTC (Mon) by khim (subscriber, #9252) [Link]

Not just GPU programmers. But that's not the same. Addresses are usually kept as 32bit or 64bit in this scheme. Only data is reduced to 16bit.

x16 ABI coming up

Posted Apr 3, 2012 0:31 UTC (Tue) by ncm (subscriber, #165) [Link]

I, also, am only 87.3% sure I was fooling. An ABI that (mostly?) only used the SSSE registers could be interesting, given good compiler support.

But it's getting harder and harder to write April Fools' jokes. Perhaps the death knell for the form was The Onion's c.2000 headline "Long National Nightmare of Peace and Prosperity Finally Over".

x16 ABI coming up

Posted Apr 3, 2012 0:50 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Hm!

There IS such an ABI and a compiler. CoreBoot uses it for the code to initialize RAM controller.

x16 ABI coming up

Posted Apr 3, 2012 4:26 UTC (Tue) by ncm (subscriber, #165) [Link]

See, this is exactly what I mean. There's no room for japes any more. Linux on 6502? Done. Linux on x86 emulator coded in Javascript running under Firefox? Done. Probably this is a direct corollary of Rule 34.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds