Rethinking optimization for size

Posted Feb 11, 2013 19:54 UTC (Mon) by khim (subscriber, #9252)
In reply to: Rethinking optimization for size by daglwn
Parent article: Rethinking optimization for size

Anything that involves analyzing loop inductions and stride patterns.
- Vectorization
- Loop interchange
- Cache blocking
- Strip mining
- Loop collapse
and about a dozen other transformations.

Which are not applicable at all if you have arrays of various complex objects. Sure, I can understand that in some tight loops where "vectorization", "loop interchange" and other cool things can be applied indexes may be beneficial. But for typical high-level code (that is: for 90% of code if not 99% of code in a given project) changes from indexes to pointers and back make absolutely no difference: any function call tend to break all these nice techniques - and there are a lot of them.

Sure, for inner loop it may be interesting, but then they usually are faster when implemented in the appropriate CPU intrinsics (the change in data structures needed to use them efficiently are impossible for the compiler) anyway thus in practice I rarely see any observable difference. Of course nowadays it's often better to use CUDA or OpenCL to push all that to the GPU - where different rules are used altogether and where traditional pointers make no sense at all, but these are specialized applications.

This sounds like an ABI issue. Most sane ABIs pass small structs via registers.

Yup. Up to six registers for x86-64 case. And if you have couple of arrays plus some kind of "options" argument plus some callback_data... you've already used all of them. Add one additional argument - and spill is inevitable.

You may say that code which calls a callback in a tight loop is hopeless in a first place, but that's the problem: quite often I can not afford doing anything else. It's just too expensive to have one function for buttons, another for lists and so on: code is pushed from from L1 (or sometimes even L2 cache) and all these benchmark-friendly optimizations actually slow the code down.

Rethinking optimization for size

Posted Feb 11, 2013 21:41 UTC (Mon) by daglwn (guest, #65432) [Link]

> Which are not applicable at all if you have arrays of various complex
> objects.

How did you come to that conclusion? Sometimes it is not worth it but for smallish structs it can be a win. Complex numbers are a good example.

> any function call tend to break all these nice techniques

If the function call is still there and even then the compiler can sometimes accomplish the task, depending on how aggressive the developers were.

> Of course nowadays it's often better to use CUDA or OpenCL to push all
> that to the GPU

No. There is never a case where CUDA or OpenCL are a good idea. Only poor compilers have led us down that path. Look at what's being done with OpenACC. A good compiler can match CUDA performance pretty easily and can often dramatically outperform it.

> Add one additional argument - and spill is inevitable.

In the grand scheme of thing, one write to cache/read from cache isn't usually critical. Sure, there are cases where this kind of optimization is very important -- if the callback is called in a tight loop for example. But in that case a better solution is often to refactor the code and/or rework the data structures. It should not be necessary to hand-linearize array accesses to get performance.

> You may say that code which calls a callback in a tight loop is hopeless
> in a first place

Actually, I would say that's often good design for the reasons you give. But then this kind of code usually isn't dealing with numerical arrays or arrays of very small structs which might be well vectorized. I suppose that is what you were thinking of with your first statement.

If that's the kind of code you're concerned about then yes, hand-linearizing array accesses is probably not going to hurt much. But it won't really help either. I've no problem with a loop that iterates over such arrays using a single pointer. I am more concerned about people who take multidimensional arrays and translate accesses to complex pointer arithmetic.

Still, if the objects are small enough, vectorization targeting efficient load/store of the data can often be a win, even if the actual data manipulation isn't vectorized.