Weekly edition Kernel Security Distributions Contact Us Search Archives Calendar Subscribe Write for LWN LWN.net FAQ Sponsors

# 1000 single bit operations per FLOP

## 1000 single bit operations per FLOP

Posted Jun 20, 2008 17:42 UTC (Fri) by mikov (subscriber, #33179)
Parent article: The Kernel Hacker's Bookshelf: Ultimate Physical Limits of Computation

```I wonder how accurate the 1000 single-bit operations estimation is. Mind you, I am not a
electronic design expert, so this is just rambling.

I've been thinking that a 1-bit adder has three inputs, so it has a 8-entry truth table. (Each
entry looks like something like "o=i1 & ~i2 & i3").
After optimizations, I am counting 18 single bit operations ("ands" and "nots"). It has two
outputs, I guess that is just double the whole thing  to 36.

So for a 32-bit integer adder we have 32*36=1152 single-bit operations operations. The
straight-forward multiplier, which they thought us in school, will do an addition per set bit,
so  if we assume 16 set bits on average, we get 18432 single bit ops, ignoring the shifts.

So, when using the most simplistic algorithms, the 1000 operations number seems far too
optimistic.

Can someone do a similar estimation for the super-clever algorithms they actually use the CPUs
? (There is a detailed description in "Computer Architecture: A Quantitative Approach", but I
never understood it well enough to actually keep it in my head, and the book is not with me.
Oh yes, and I am lazy)

```

1000 single bit operations per FLOP

Posted Jun 23, 2008 6:56 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246) [Link]

First, let's assume that each computation is done in the minimum number of steps possible. Many of the transistors spent in functional units today are spent on making a single copy of the operation go faster. When computing at an atomic scale, this probably no longer makes sense. It probably makes more sense to make each operation as simple and economical as possible so you can put down as many copies as you like precisely where they're needed.

With volume computing you'll have so much more connectivity to neighbors that economical implementations would likely win out. For instance, a carry-select adder, which computes two versions of the result speculatively and throws away one, wouldn't make sense since it's much, much larger than a ripple-carry adder. (IIRC, a state of the art 32-bit adder is around 3000 transistors, whereas a 32-bit ripple carry adder should be around 700 or so.)

A 1 bit full-adder—3 inputs, 2 outputs—consists of two pieces: The sum computation, which is two two-input XOR gates, and the carry computation, which is 3 two-input AND gates and 2 two-input OR gates. The logic equations are:

• sum_out = a XOR b XOR carry_in
• carry_out = (a AND b) OR (a AND c) OR (b AND c)

If you assume a ripple carry implementation, then that's it. 7 operations per bit if you allow the full complement of boolean operations. So, a 32-bit add will be a mere 224 boolean operations. Even if you penalize XOR and count them as 3 bit operations each, that's still only 352 boolean operations. Of course, IEEE-754 floating point isn't built around 32-bit adds. You go through the following major steps for a single-precision floating point addition:

(For now I assume a parallel implementation since it doesn't really affect the number of SBOPs. Also, I assume the more conservative 11 SBOP number for a single bit full adder.)

1. Subtract the mantissas to get the "alignment difference" between the two numbers. This is an 8 bit subtract, IIRC, so would cost 88 SBOPs.
2. Shift the numbers to align them. To be conservative, I'll say this comprises the following steps in a simplistic implementation:
1. Swap the two numbers based on the sign of the difference of the magnitudes. This requires 24 2:1 muxes, and each 2:1 mux costs 3 SBOPs, for a cost of 2 × 24 × 3 = 144 SBOPs.
2. Shifting the smaller magnitude number to the right by up to 24 positions. If you structure this as 5 layers of 2:1 muxes (which is overkill, since some of the inputs will be zero and so this could be optimized), you get 5 × 24 × 3 = 360 SBOPs.
3. Apply the sign to both numbers. IEEE-754 is stored in sign-magnitude form. To apply the sign, you merely need to XOR the number with the sign and inject an appropriate carry when you do the addition. Thus, we can estimate this as 48 XORs, which, if you count XOR as 3 SBOPs, is 144 SBOPs.
4. Actually add the two 24 bit mantissas. If we go with 11 SBOPs per bit, this is 264 SBOPs.
5. Strip the sign from the result. Again, you could simply XOR with the result's sign bit (which should be the carry out of the last adder). 3 × 24 = 72 SBOPs.
6. Count the number of leading zeros so that we can renormalize the result. I don't know off the top of my head how many SBOPs this ought to take.
7. Shift the final result to normalize it. Conservatively, let's use the crummy shifter above which is 5 × 24 × 3 = 360 SBOPs
8. Update the mantissa. This is another 8-bit add. 8 × 11 = 88 SBOPs.

So where does that put us in this really basic, if perhaps naive implementation? Around 1520 SBOPs plus the cost of the priority encoder used for normalization.

Sure, we haven't handled overflow, underflow, denormals, NaNs and so on. But, I also have assumed clumsy implementations for some of the more expensive bits, such as the shifters. If the shifters were half the cost, for example, the total drops to 1160. Overall, I'd say this quick and dirty analysis validates that Val's SWAG (Scientific Wild Ass Guess) is in the right ball park, and certainly well within an order of magnitude. Since nearly all the rest of the numbers are expressed in orders of magnitude, being off by 50% ain't so bad. The log10 of 1.5 is pretty small.

1000 single bit operations per FLOP

Posted Jun 23, 2008 7:07 UTC (Mon) by jzbiciak (✭ supporter ✭, #5246) [Link]

Regarding the multiplier... It turns out that floating point multipliers can get rid of most of the alignment steps that the adder needs to do. (See my other post.) You also don't need to apply the sign to your input or your output. So, you're just left with the addition steps.

A naive 24 × 24 multiplier would effectively do 24 24-bit adds. That's 6336 SBOPs. Assume for a moment, though, that we can eliminate about half of those because only 24 output bits are needed. This brings us down to 3168 SBOPs for the multiplier. Remaining steps:

1. Adding the mantissas: 88 SBOPs.
2. Aligning the result (shift by up to 3 positions, IIRC, implemented as 2 levels of 2:1 mux): 2 × 24 × 3 = 144 SBOPs

This gives a total of ~3400 SBOPs. About 2× to 3× the cost of an adder.

Oh, and that reminds me, you can eliminate one of the argument inversions in my adder estimate above. That squeezes another 60-70 SBOPs out. (You do add some other bit inversions here and there on the sign bits, which is why you don't get all 72 back.)

1000 single bit operations per FLOP

Posted Jun 24, 2008 22:49 UTC (Tue) by nix (subscriber, #2304) [Link]

```That came off the top of your head?!

Allow me to briefly say 'wow', if so. :)
```

1000 single bit operations per FLOP

Posted Jun 25, 2008 0:06 UTC (Wed) by mikov (subscriber, #33179) [Link]

```Wow, thanks! I have nothing to say except to retract my idiotic suggestion of implementing the
adder with a truth table :-)

I have to agree with "nix" that this is very impressive. Your post deserves an article of its
own.
```