|| ||Linus Torvalds <torvalds-AT-linux-foundation.org> |
|| ||"H. Peter Anvin" <hpa-AT-zytor.com> |
|| ||Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE=
y in the 64-bit defconfig |
|| ||Sat, 26 Jan 2013 11:43:21 -0800|
|| ||Borislav Petkov <bp-AT-alien8.de>, Ingo Molnar <mingo-AT-kernel.org>,
Linux Kernel Mailing List <linux-kernel-AT-vger.kernel.org>,
Arjan van de Ven <arjan-AT-linux.intel.com>,
Jan Beulich <jbeulich-AT-suse.com>, ling.ml-AT-alipay.com,
Steven Rostedt <rostedt-AT-goodmis.org>,
Andrew Morton <akpm-AT-linux-foundation.org>,
Thomas Gleixner <tglx-AT-linutronix.de>,
|| ||Article, Thread
On Sat, Jan 26, 2013 at 7:18 AM, H. Peter Anvin <email@example.com> wrote:
> On the CPUs Ling is testing on the downsides of -Os probably matter less, in particular since rep
movsb works well.
> It is questionable as a generic default, though.
So being the person who really pushed for -Os to begin with (I think
I$ and instruction decode bandwidth is one of the most fundamental
limits to CPU performance), I wouldn't mind it if we reintroduced it.
It wasn't just "rep movs". The thing that killed -Os for me was that
it makes it impossible to try to optimize hot code, because -Os seems
to throw out branch prediction information. So when you use "likely()"
etc to try to teach the compiler to lay out code a certain way so that
code that never really gets executed isn't even brought into the I$,
-Os then screws it up completely.
Of course, maybe newer versions of gcc might not suck so horribly with
-Os, I haven't actually tried in a while.
[ Just tested. Still does it ]
Also, I doubt Ling was testing a SB CPU. Because "rep movb" still
sucks pretty bad on SB. What core *is* Ling testing? Haswell?
Ugh. We could make it depend on the optimization target. I'd also wish
there was some way to just tune gcc -Os to be closer to reasonable. Or
make -O2 not do some of the excessive crap it does (it aligns code
*much* too much, for example - who cares if you can do it with a
single instruction, if that instruction is so long that it uses up
half your decode bandwidth?)
The problem, of course, is that most -O2 code generation is done
assuming hot loops that don't show much if any I$ issues. And the -Os
thing is done *purely* for size, not taking any performance into
account at all. There's no balanced middle ground, which is what _we_
to post comments)