Re: [RFC v4 PATCH 12/13] x86: implementation for HARDENED_ATOMIC

From:		Peter Zijlstra <peterz-AT-infradead.org>
To:		"Reshetova, Elena" <elena.reshetova-AT-intel.com>
Subject:		Re: [RFC v4 PATCH 12/13] x86: implementation for HARDENED_ATOMIC
Date:		Fri, 11 Nov 2016 11:29:13 +0100
Message-ID:		<20161111102913.GE3117@twins.programming.kicks-ass.net>
Cc:		Kees Cook <keescook-AT-chromium.org>, "kernel-hardening-AT-lists.openwall.com" <kernel-hardening-AT-lists.openwall.com>, Arnd Bergmann <arnd-AT-arndb.de>, Thomas Gleixner <tglx-AT-linutronix.de>, Ingo Molnar <mingo-AT-redhat.com>, "Anvin, H Peter" <h.peter.anvin-AT-intel.com>, Will Deacon <will.deacon-AT-arm.com>, Hans Liljestrand <ishkamiel-AT-gmail.com>, David Windsor <dwindsor-AT-gmail.com>

On Fri, Nov 11, 2016 at 09:32:45AM +0000, Reshetova, Elena wrote:
> It is covered in the documentation file (Documentation/security/
> hardened-atomic.txt) included in the first patch.
> There are just so many details that including them all in commit
> messages will produce pages long commit messages, so some
> things are moved to the documentation only. 

No, never skimp on Changelogs. Nobody reads documentation.

Also, this really should also have a very explicit code comment,
non-atomic constructs in atomic.h are 'surprising' at the very least.

> >Now is this harmless? If you have two increments racing like:
> 
> 	  inc
> 	  jno 1 // overflow
> 
> 			  inc
> 			  jno 1 // !overflow
> 
> 	  dec
> 	1:		1:
> 
> >The second thread will still affect your wrap and not BUG.

This still wants an answer, because attackers never exploit races?

> >I have a benchmark somewhere, I can run numbers tomorrow, but it
> >really shows once you get a bit of contention going. Once you hit 4
> >nodes contending on a variable its completely out there IIRC.

> This would help to get more numbers on this, thank you. 

		LOCK addl		LOCK cmpxchg-addl

1-node		1: 22.038250		1: 41.572270
		2: 174.019700           2: 198.965635
		3: 185.852060           3: 274.293927
		4: 389.169783           4: 266.738485
		6: 347.827897           6: 454.785715
		8: 369.649510           8: 463.125426


2-nodes		2: 428.448130		2: 1422.221850
		4: 616.203497           4: 1166.427205
		6: 855.639025           6: 1424.131080
		8: 1083.613291          8: 1402.484560

4-nodes		 4: 1180.591315		 4: 1830.301125
		 8: 1480.023056          8: 2043.418720
		16: 2602.128429         16: 2611.188079


Results are in cycles:u, average of 100000 loops.

As measured on a 4 socket IVB-EX (E7-4890 v2).

> >The unconditional atomic ops really are loads faster than cmpxchg
> >loops.
> 
> Yes, and this is what we saw when doing performance measurements. So,
> as a result we went with a faster method, which we believed still has
> a low risk of getting into race.  If you numbers prove otherwise, then
> we have to reconsider. 

Its fundamentally a question of semantics though. These are _atomic_
ops, they really should be, well, atomic. No exceptions.

If you want to play funny games, don't call them atomic.