The word-at-a-time interface

Posted Jun 14, 2012 6:17 UTC (Thu) by jzbiciak (guest, #5246)
Parent article: The word-at-a-time interface

Nifty. I'm a big fan of sub-word arithmetic without the benefit of SIMD extensions. (And, in kernel mode, I'm certain there's a strong impetus to stay entirely in the integer part of the pipe for these things.)

It might help to walk through some examples on the pieces to really understand what the 32-bit x86 version does. At least, it helped me to mentally do these gymnastics. I'll first just stick to the byte-at-a-time version, and then extend to the word oriented version, expanding our our dear editor's notes. (Most or all of this is packed into the description in the article; I'm unpacking it for my benefit as much as for others'.)

The sequence expanded out is:

tmp1 = byte - 1
tmp2 = tmp1 & ~byte
rslt = tmp2 & 0x80

If you start with byte not equal to 0, then byte - 1 will not equal 0xFF. The right-most 1 will change from a 1 to a 0, and the zeros to the right of that will change to 1s. Therefore, at the end of step 1, if byte != 0, the MSB of tmp1 will either be the same as it was for byte, or it will (in the case where byte == 0x80) have changed from 1 to 0.

So, when you get to step 2, the MSB of tmp2 will be 0, since step 2 ANDs the result of step 1 with the NOT of the original byte. There are two cases here: In case 1 (byte != 0x80 && byte != 0x00) the MSB of tmp1 is the same as it was for byte so you're ANDing opposite values, whereas in case 2 (byte == 0x80) the MSB of both tmp1 and ~byte are both 0. Either way, you end up with a 0 in the MSB of tmp2. (BTW, the byte == 0x80 case is the reason you need AND-NOT, rather than XOR.)

Step 3 just extracts this bit, which should give rslt == 0x00 for all values of byte != 0x00.

If you start with byte == 0x00, at the end of step 1 tmp1 = 0xFF. When you reach step 2, then, tmp2 == 0xFF & ~(0x00) == 0xFF. Thus, at the end of step 3, rslt == 0x80 when byte == 0x00.

Things get only slightly more involved when you pack these bytes into words. All the same math works with one caveat: Any zero-byte in the word will corrupt the result for bytes to the left of it in the word. That's OK though -- the first byte is all you need, and in little-endian, that also happens to be the rightmost byte. On a big-endian machine, you're still OK, although you may have a little more work ahead of you to determine which byte was zero from the resulting packed mask.

Here's an example of where you run might into problems: (Here 'byte' is actually four bytes in a 32-bit word.)

byte = 0xAA0100AA
tmp1 = 0xAA0100AA - 0x01010101 = 0xA9FFFFA9
tmp2 = 0xA9FFFFA9 & 0x55FFFF55 = 0x01FFFF01
rslt = 0x01FFFF01 & 0x80808080 = 0x00808000

On little endian, it's enough to consider the rightmost 0x80 as the "first zero", and ignore the "ghost zero" that showed up due to the borrow. (You only get borrows when there's actually a zero.) On big-endian, if you ever have 0x80s in consecutive bytes of the result, you have to scan to find the real 0. Discontinuous 0x80s are real zeros. Consider:

byte = 0xAA00AA00
tmp1 = 0xAA00AA00 - 0x01010101 = 0xA8FFA8FF
tmp2 = 0xA8FFA8FF & 0x55FF55FF = 0x00FF00FF
rslt = 0x00FF00FF & 0x80808080 = 0x00800080

There you didn't end up with a "ghost zero" because the non-zero byte between them ate the carry. But any time you have 0x80s in consecutive bytes, you have more work to do:

byte = 0xAA000100
tmp1 = 0xAA000100 - 0x01010101 = 0xA8FEFFFF
tmp2 = 0xA8FEFFFF & 0x55FFFEFF = 0x00FEFEFF
rslt = 0x00FEFEFF & 0x80808080 = 0x00808080

This has entirely to do with the fact that big-endian orders bytes left-to-right in the word, but carries and borrows go right-to-left, and so only the rightmost result is reliable.

That gets us through determining a word has a zero. What about the next part? This comment is already long enough, so I'll start a new one for that...