Transactions have to work across all code in a process, so we have to make sure that a mix of code compiled by different compilers works. Otherwise, you would have to have a single TM runtime library implementation. However, even though this is the default, it should still be possible to statically link an application and use link-time optimizations to inline everything. Alternatively, we have been discussing generating additional code paths for each transaction that run a particular TM algorithm/implementation but this hasn't been implemented yet.
Regarding load/store runtime overheads, the function calls are part of it but synchronization overheads can be still higher, depending on TM algorithm and workload (e.g., if you get a cache miss when reading data modified previously on another core).
You can use TM with applications that handle threading via pthreads, so I guess your question was about performance of TM vs. locking via pthread mutexes. There is no general answer to that because it really depends on what kind of locking scheme you have, and what kind of synchronization workload your application encounters at runtime. Example 1:, well-tuned fine-granular locking can often perform better than generic SW-only implementations of TM (STM) -- unless you have a lot of read-sharing and pay for contention or cache ping-pong on your readlocks (in this case, the "invisible reads" used in many STMs are usually better). Example 2: Course-granular locking probably often has worse performance than a typical STM -- unless the thing you're protecting with the lock really is a sequential bottleneck, and there is no parallelism in there (in this case, better execute it as fast as possible). Overall, this might sound tricky, but you're facing the same issues as well when just using locks (should you use fine-granular or coarse-granular locking? should you use RW locks?).
What TM tries to do is to give the programmer a simple way to synchronize between nontrivial operations, with best-effort performance. Given that optimizing synchronization is tricky and time-consuming, TM might even give you better performance than implementing something custom when having a normal (aka small) development-effort budget for synchronization. With TM, you declare what synchronizes, but not how this is implemented. In turn, you can benefit from improvements in the generic TM implementations without changing anything in your program. So, TM is supposed to give you better options in the trade-off between development effort and performance. And if you would get HW support for TM (e.g., like on IBM BlueGene), then performance differences between locking and TM will become even smaller and development effort might be the significant factor.