LWN.net Logo

Transactional memory for GCC 4.7.0

Transactional memory for GCC 4.7.0

Posted Nov 10, 2011 15:48 UTC (Thu) by tvld (subscriber, #59052)
Parent article: Transactional memory for GCC 4.7.0

Some more information about the C/C++ constructs for TM can be found here (including links to lists where one can discuss the specification):
http://transform.t-labs.tu-berlin.de/tw11/torvalds.pdf

GCC implements two transaction keywords (__transaction_atomic and __transaction_relaxed instead of just __transaction), and this is also what the next version of the specification will require. Furthermore, there is only static checking of atomic transactions, exception specifications have been replaced with (optional) noexcept specifications as in C++11, and cancel-and-throw is not implemented by GCC currently.

Exception handling does not get more complicated because of the default commit-on-throw semantics, meaning that even if you add transactions, the behavior of code that throws exceptions will not change.

A certain variable can be accessed with and without transactions, as long as the program ensures using synchronization that there will be no concurrent transactional and nontransactional accesses of which at least one is a write (i.e., data-race-free code is required, and "concurrent" is as defined by the C++11 memory model extended by transactions). For example, you can use a transaction to remove an element from a shared data structure, and after the transaction has finished, access this element like any other private data (both privatization and publication are safe, provided that there are no data races). So, this is in line with the rest of the C++11 memory model.

We have been planning to write a TM tutorial that explains the specification in a more easily accessible way, but this just hasn't happened yet.

It is true that the current implementation uses TM algorithms that synchronize using a single global lock, but this is due to just not having enough time yet to include other algorithms. However, GCC's TM runtime library (libitm) has all the code in place to easily add further algorithms and switch between them at runtime. As a side note, using a single lock is not necessarily bad but depends on the workload at runtime; if transactions are executed rather infrequently, are often read-only, or have frequent data conflicts anway, this performs better than other approaches.


(Log in to post comments)

Intel TM ABI requires a lot of function calls

Posted Nov 10, 2011 17:21 UTC (Thu) by scottt (subscriber, #5028) [Link]

I was really surprised that the Intel TM ABI requires all load and store operations in transactions be translated into function calls that can't be inlined. Since the STM library, libitm, would like be built as a shared object on Linux this introduces extra jumps and register moves for each load/store operation.

tvld, do you think most pthread applications could be converted to this TM ABI without performance lost?

Intel TM ABI requires a lot of function calls

Posted Nov 11, 2011 11:57 UTC (Fri) by tvld (subscriber, #59052) [Link]

Transactions have to work across all code in a process, so we have to make sure that a mix of code compiled by different compilers works. Otherwise, you would have to have a single TM runtime library implementation. However, even though this is the default, it should still be possible to statically link an application and use link-time optimizations to inline everything. Alternatively, we have been discussing generating additional code paths for each transaction that run a particular TM algorithm/implementation but this hasn't been implemented yet.

Regarding load/store runtime overheads, the function calls are part of it but synchronization overheads can be still higher, depending on TM algorithm and workload (e.g., if you get a cache miss when reading data modified previously on another core).

You can use TM with applications that handle threading via pthreads, so I guess your question was about performance of TM vs. locking via pthread mutexes. There is no general answer to that because it really depends on what kind of locking scheme you have, and what kind of synchronization workload your application encounters at runtime. Example 1:, well-tuned fine-granular locking can often perform better than generic SW-only implementations of TM (STM) -- unless you have a lot of read-sharing and pay for contention or cache ping-pong on your readlocks (in this case, the "invisible reads" used in many STMs are usually better). Example 2: Course-granular locking probably often has worse performance than a typical STM -- unless the thing you're protecting with the lock really is a sequential bottleneck, and there is no parallelism in there (in this case, better execute it as fast as possible). Overall, this might sound tricky, but you're facing the same issues as well when just using locks (should you use fine-granular or coarse-granular locking? should you use RW locks?).

What TM tries to do is to give the programmer a simple way to synchronize between nontrivial operations, with best-effort performance. Given that optimizing synchronization is tricky and time-consuming, TM might even give you better performance than implementing something custom when having a normal (aka small) development-effort budget for synchronization. With TM, you declare what synchronizes, but not how this is implemented. In turn, you can benefit from improvements in the generic TM implementations without changing anything in your program. So, TM is supposed to give you better options in the trade-off between development effort and performance. And if you would get HW support for TM (e.g., like on IBM BlueGene), then performance differences between locking and TM will become even smaller and development effort might be the significant factor.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds