User: Password:
|
|
Subscribe / Log in / New account

Rethinking optimization for size

Rethinking optimization for size

Posted Feb 1, 2013 21:02 UTC (Fri) by khim (subscriber, #9252)
In reply to: Rethinking optimization for size by epa
Parent article: Rethinking optimization for size

It won't, for example, turn on or off loop unrolling depending on how memory-bound a particular CPU tends to be.

It's impossible to predict unless you know what your program is doing - and compiler deals with functions. Even puny Atoms have 32KiB of L1 cache where functions (even with loops unrolled) are usually smaller.

If the compiler could be just a bit smarter, that would be a big win.

This way lies madness. Difference between contemporary CPUs is much smaller then you think. The aforementioned cache which presumably should be handled differently in different cases is between 32KiB on most Intel CPUs (from Atoms to XeonsCore and 64KiB for AMD, L2 differs more substantially but difference is not large enough to affect issues at small (function-sized) scale and LTO is not yet in wide use and not all that mature besides.

But most programmers don't want to spend time testing each optimization separately; they tend to just pick -O2 and let the compiler decide.

It's even worse: they often abuse "premature optimization is the root of all evil" mantra to introduce 3x-5x-10x slowdown. What the compiler does after that is more-or-less irrelevant. If people care about efficiency they should start thinking about efficiency first and hot hope that compiler will magically make 10 levels of indirections disappear. No, they don't disappear and compiler very rarely can do anything to thembut looks like developers (except for kernel developers, of course) understand that. Instead most books explain how you can use them to nicely "encapsulate" and "separate" stuff - usually without ever mentioning their price.


(Log in to post comments)

Rethinking optimization for size

Posted Feb 8, 2013 17:50 UTC (Fri) by daglwn (guest, #65432) [Link]

> Difference between contemporary CPUs is much smaller then you think.

But not as small as you think. I work on a production compiler and we are always tuning for new targets. Even something as simple as moving from Sandy Bridge to Ivy Bridge can result in a change in strategy.

> What's even worse: they often abuse "premature optimization is the root
> of all evil" mantra to introduce 3x-5x-10x slowdown.

But worse than that is programmers trying to out-guess the compiler and do hand loop unrolling, converting of array accesses to pointer arithmetic and the like. This *kills* the compiler's ability to analyze the program and thus make transformations to improve the code.

It is essential for the developer to take a high-level view of performance. Algorithm and data structure choice is the #1 performance decision to make. After that, the ROI decreases rapidly for the programmer. Yes, the programmer should be aware of the cost of abstractions when appropriate but we should not throw away those abstractions on a whim. They save expensive programmer time. Do hand performance tuning *only* after a proven need via profiling.

Rethinking optimization for size

Posted Feb 8, 2013 21:41 UTC (Fri) by khim (subscriber, #9252) [Link]

Do hand performance tuning *only* after a proven need via profiling.

By that time it's often much too late to do anything. The one thing programmer should keep in mind are few numbers. If you introduce nice level of indirection (to facilitate future expansion or something like this) and this indirection triggers access to the main memory then you are losing about 500-600 ticks right there. And contemporary CPU can move hundred of sequential bytes and do thousand of operations in that time! If your program is built around bazillion tiny objects it's too late to anything at this point: to remove these useless levels of indirection you need to basically rewrite program from scratch.

But not as small as you think. I work on a production compiler and we are always tuning for new targets. Even something as simple as moving from Sandy Bridge to Ivy Bridge can result in a change in strategy.

Sure, but how much can you hope to win? I speak from the experience: just recently we've rewrote piece of code - it went from nice "modern" structure with five or six independent layers and couple of dozen structures to one function (autogenerated one) with 20'000 lines of code and dozen of simple local variables. Speedup was about 10x (in one mode was 8x and in another mode was 12x). Do you really believe you can do something like this with a compiler options or small tweaks after profiler run?

P.S. Actually we can squeeze additional ~30% with PGO and some other compiler tricks but in the end we decided that it complicates our build system too much and accepted "mere" 8x/12x speedup.

Rethinking optimization for size

Posted Feb 9, 2013 0:00 UTC (Sat) by dlang (subscriber, #313) [Link]

it depends on what you define as 'hand optimization'

unrolling loops is something that is almost always better left to the compiler.

But changing the algorithm from using a pointer-heavy set of linked lists to an implementation using small offsets in a buffer is not something the compiler will ever do, but can result is huge speedups.

"trust the compiler, write whatever you want" doesn't work well in real life, and by the time you get things under a real load to find the bottlenecks, it's frequently too late to change them short of a re-write.

You need to keep end efficiency in mind as you go along. This requires that you keep up to date with what sorts of things are cheap to do and what are expensive to do. If you get it right, you are an unsung hero (you seldom get thanks for things that don't crumple under load), if you get it wrong you get ridiculed.

Rethinking optimization for size

Posted Feb 9, 2013 10:32 UTC (Sat) by khim (subscriber, #9252) [Link]

Are you sure you wanted to answer to my post? Because it looks like you are saying almost exactly the same thing I did: compiler can usually optimize your code well enough, but it can't do anything to your data structures - and this is where major inefficiencies lurk. And my major I mean major: 10x slowdown, 100x slowdown, sometimes even 1000x slowdown!

Rethinking optimization for size

Posted Feb 9, 2013 23:37 UTC (Sat) by daglwn (guest, #65432) [Link]

I believe I said exactly the same.

Rethinking optimization for size

Posted Feb 10, 2013 0:16 UTC (Sun) by dlang (subscriber, #313) [Link]

> converting of array accesses to pointer arithmetic and the like.

You said this is bad behavior on the programmer's side

however, this may be exactly the algorithm/data structure change that produces the multiple orders of magnitude performance improvements, and the type of thing the developer should keep in mind

Rethinking optimization for size

Posted Feb 10, 2013 3:59 UTC (Sun) by daglwn (guest, #65432) [Link]

> > converting of array accesses to pointer arithmetic and the like.

> You said this is bad behavior on the programmer's side

> however, this may be exactly the algorithm/data structure change that
> produces the multiple orders of magnitude performance improvements, and
> the type of thing the developer should keep in mind

I find that extremely hard to believe. Do you have an example? I can't think of any possible reason that converting array index operations into pointer arithmetic is ever a win.

Now if you're talking about changing out an array data structure for some other kind of data organization, that's something else entirely.

Rethinking optimization for size

Posted Feb 10, 2013 11:34 UTC (Sun) by khim (subscriber, #9252) [Link]

I can't think of any possible reason that converting array index operations into pointer arithmetic is ever a win.

It may be a win because to access data using an index you need two variables (array address and index) but to access data using pointer you need one (just pointer is enough). When callbacks are involved it may replace code which uses just registers to pass information around with code which uses structure in memory - in this case wins can be substantial (speaking from experience). But of course such cases are rare.

But you are saying that conversion to pointer arithmetic is somehow bad? In my experience it's usually net neutral. What kind of optimization compiler can apply if I'm using indexes? How often it can apply them? I'm not saying it never happens, I just don't see and a common occurrence.

Rethinking optimization for size

Posted Feb 11, 2013 17:21 UTC (Mon) by daglwn (guest, #65432) [Link]

> It may be a win because to access data using an index you need two
> variables (array address and index) but to access data using pointer you
> need one (just pointer is enough).

At code-generation time the compiler knows well enough how to convert from array notation to pointer arithmetic so no extra registers are used. Indeed it must do so for most architectures. By doing that too early, the compiler loses valuable information about array stride accesses and the like that are critical for transformations like vectorization.

In addition, pointer arithmetic has all sorts of nasty aliasing properties that hamper the compiler's ability to analyze the code.

"But arrays and pointers are the same in C!" some may cry. Well, that's not entirely true and even if it were there's nothing that prohibits the compiler from representing the two very differently in its internal data structures.

> When callbacks are involved it may replace code which uses just
> registers to pass information around with code which uses structure in
> memory.

This sounds like an ABI issue. Most sane ABIs pass small structs via registers.

> What kind of optimization compiler can apply if I'm using indexes?

Anything that involves analyzing loop inductions and stride patterns.

- Vectorization
- Loop interchange
- Cache blocking
- Strip mining
- Loop collapse

and about a dozen other transformations. These are the "big improvement" optimizations. Petty things like CSE and copy propagation are important but are done more to enable these big-gain transformations than for their code improvement in and of themselves.

In rare cases the compiler can convert pointer arithmetic back to array indices but this often involves extra bit-twiddling or other arithmetic to recover the original indices and more likely than not aliasing rules get in the way and make it impossible to recover the critical information.

Pointer arithmetic is just generally bad for the compiler. Avoid it if possible.

Incidentally, this is also why it's a very *bad* idea to use unsigned as a loop counter. One is not "giving the compiler more information." On the contrary, by using unsigned one is moving the arithmetic away from normal algebraic rules to the realm of modulo arithmetic. The compiler must then make all kinds of pessimistic assumptions about loop termination and access patterns. This also kills many of the loop transformations listed above.

Of course I am speaking in generalities. One can always find a counter-example. However, the vast majority of codes behave as I describe.

Rethinking optimization for size

Posted Feb 11, 2013 18:45 UTC (Mon) by Aliasundercover (subscriber, #69009) [Link]

In times past I often reviewed the code output from compilers I was working with. It was helpful to calibrate what idioms carried what costs and track how they changed with compilers.

I still get the itch to look now and then but mostly recoil in horror at how hard it is to relate the code GCC gives me to what I wrote. Typically I have some function I want to review but give up after 10 minutes of searching the listing without finding it. The listing is of course fabulously ugly but that is no change from any other compiler I ever used. I just don't find my code. Well, I do, but it is all in a huge block of raw C code without the corresponding output.

No doubt some cool optimization hoisted my code out of its place in the C code I wrote and dropped it some place which makes sense to the compiler. No doubt all this transformation is good for run time performance or size or something. Perhaps listing quality just isn't an interesting bit of work for compiler writers.

I used to know what those old Windows and SunOS compilers would do with my code but now I have no similar feeling for GCC. I can read what people write about indexing vs. pointers and signed vs. unsigned but I used to really know from first hand observation.

I wish the listing got more respect. It can be a powerful optimization tool.

Maybe I just do it wrong. My makefile has the following incantation for generating listings.

%.lst: %.c
$(CC) $(CFLAGS) $(CPPFLAGS) -g -Wa,-a,-ad -c $< > $@

Rethinking optimization for size

Posted Feb 11, 2013 19:25 UTC (Mon) by daglwn (guest, #65432) [Link]

> The listing is of course fabulously ugly but that is no change from any
> other compiler I ever used.

This is actually a quality of implementation issue. There are compilers out there that give fantastic listings, even down to a mostly-readable highish-level decompliation even after very aggressive code transformations. This is not very easy to obtain and really has to be baked-in at the beginning of the compiler design.

But you're right in that a good compiler will make your code unrecognizable. :)

> I wish the listing got more respect. It can be a powerful optimization
> tool.

Indeed. I don't think there's much one can do with gcc to get a decent listing. It just doesn't carry enough information. Neither does LLVM, unfortunately, at least AFAIK.

Rethinking optimization for size

Posted Feb 11, 2013 21:09 UTC (Mon) by PaXTeam (guest, #24616) [Link]

maybe you want gcc -S -fverbose-asm for commented assembly output? also -fdump-tree-all and -fdump-rtl-all if you want to see what happens to your C and also relate the internal GIMPLE variable names to the comments in the verbose asm output.

Rethinking optimization for size

Posted Feb 12, 2013 16:28 UTC (Tue) by Aliasundercover (subscriber, #69009) [Link]

Well, you got me there. I tried those dump options and got 151 spam files in my source directory.

-fverbose-asm seems useful. It doesn't help with finding where my code went but the comments are nice.

I wish compiler listings got more respect.

Rethinking optimization for size

Posted Feb 11, 2013 19:54 UTC (Mon) by khim (subscriber, #9252) [Link]

Anything that involves analyzing loop inductions and stride patterns.
- Vectorization
- Loop interchange
- Cache blocking
- Strip mining
- Loop collapse
and about a dozen other transformations.

Which are not applicable at all if you have arrays of various complex objects. Sure, I can understand that in some tight loops where "vectorization", "loop interchange" and other cool things can be applied indexes may be beneficial. But for typical high-level code (that is: for 90% of code if not 99% of code in a given project) changes from indexes to pointers and back make absolutely no difference: any function call tend to break all these nice techniques - and there are a lot of them.

Sure, for inner loop it may be interesting, but then they usually are faster when implemented in the appropriate CPU intrinsics (the change in data structures needed to use them efficiently are impossible for the compiler) anyway thus in practice I rarely see any observable difference. Of course nowadays it's often better to use CUDA or OpenCL to push all that to the GPU - where different rules are used altogether and where traditional pointers make no sense at all, but these are specialized applications.

This sounds like an ABI issue. Most sane ABIs pass small structs via registers.

Yup. Up to six registers for x86-64 case. And if you have couple of arrays plus some kind of "options" argument plus some callback_data... you've already used all of them. Add one additional argument - and spill is inevitable.

You may say that code which calls a callback in a tight loop is hopeless in a first place, but that's the problem: quite often I can not afford doing anything else. It's just too expensive to have one function for buttons, another for lists and so on: code is pushed from from L1 (or sometimes even L2 cache) and all these benchmark-friendly optimizations actually slow the code down.

Rethinking optimization for size

Posted Feb 11, 2013 21:41 UTC (Mon) by daglwn (guest, #65432) [Link]

> Which are not applicable at all if you have arrays of various complex
> objects.

How did you come to that conclusion? Sometimes it is not worth it but for smallish structs it can be a win. Complex numbers are a good example.

> any function call tend to break all these nice techniques

If the function call is still there and even then the compiler can sometimes accomplish the task, depending on how aggressive the developers were.

> Of course nowadays it's often better to use CUDA or OpenCL to push all
> that to the GPU

No. There is never a case where CUDA or OpenCL are a good idea. Only poor compilers have led us down that path. Look at what's being done with OpenACC. A good compiler can match CUDA performance pretty easily and can often dramatically outperform it.

> Add one additional argument - and spill is inevitable.

In the grand scheme of thing, one write to cache/read from cache isn't usually critical. Sure, there are cases where this kind of optimization is very important -- if the callback is called in a tight loop for example. But in that case a better solution is often to refactor the code and/or rework the data structures. It should not be necessary to hand-linearize array accesses to get performance.

> You may say that code which calls a callback in a tight loop is hopeless
> in a first place

Actually, I would say that's often good design for the reasons you give. But then this kind of code usually isn't dealing with numerical arrays or arrays of very small structs which might be well vectorized. I suppose that is what you were thinking of with your first statement.

If that's the kind of code you're concerned about then yes, hand-linearizing array accesses is probably not going to hurt much. But it won't really help either. I've no problem with a loop that iterates over such arrays using a single pointer. I am more concerned about people who take multidimensional arrays and translate accesses to complex pointer arithmetic.

Still, if the objects are small enough, vectorization targeting efficient load/store of the data can often be a win, even if the actual data manipulation isn't vectorized.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds