User: Password:
|
|
Subscribe / Log in / New account

Intel's upcoming transactional memory feature

Here is a posting on the Intel software network describing the "transactional synchronization extensions" feature to be found in the future "Haswell" processor.

With transactional synchronization, the hardware can determine dynamically whether threads need to serialize through lock-protected critical sections, and perform serialization only when required. This lets the processor expose and exploit concurrency that would otherwise be hidden due to dynamically unnecessary synchronization. At the lowest level with Intel TSX, programmer-specified code regions (also referred to as transactional regions) are executed transactionally. If the transactional execution completes successfully, then all memory operations performed within the transactional region will appear to have occurred instantaneously when viewed from other logical processors. A processor makes architectural updates performed within the region visible to other logical processors only on a successful commit, a process referred to as an atomic commit.

Needless to say, there should be interesting ways to use such a feature in the kernel if it works well, but other projects (PyPy, for example) have also expressed interest in transactional memory.


(Log in to post comments)

Intel's upcoming transactional memory feature

Posted Feb 8, 2012 21:13 UTC (Wed) by landley (subscriber, #6789) [Link]

Huh, at a wild guess you batch all your cache line read/writes and then only post them if none of those cachelines has changed since you read them.

Clever. I wonder when the patent expires?

Intel's upcoming transactional memory feature

Posted Feb 8, 2012 21:50 UTC (Wed) by josh (subscriber, #17465) [Link]

The explicit transaction mechanism (RTM) looks interesting, but Hardware Lock Elision (HLE) looks much more immediately useful. It provides a pair of instruction prefixes to apply to the instructions which acquire and release a lock variable. The processor can then try to perform all the memory operations within that lock as a single atomic operation, and if so it will not bother touching the lock variable at all, removing contention. Best of all, those instruction prefixes don't require the HLE feature; on older CPUs they'll just make the instruction encoding slightly longer with no other effect. (The HLE prefixes correspond to the repe/repne prefixes used with string instructions, which have no effect on the non-string instructions commonly used for lock operations.)

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 17:53 UTC (Thu) by kjp (subscriber, #39639) [Link]

Yeah. This post explains the big picture a lot better.
http://software.intel.com/en-us/blogs/2012/02/07/coarse-g...

HLE looks pretty cool, just wondering how large a 'transaction' can be in practice. If it gets too big I guess the cpu can just abort and force a lock.

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 6:56 UTC (Fri) by Tuna-Fish (guest, #61751) [Link]

The practical limit is most likely that the transaction needs to fit in it's entirely in L1 cache. At first, that sounds like a lot (16kBish per thread), but remember that the L1 cache is typically only 8-way. So, as long as accesses don't align badly you can touch quite a bit of memory without issue, but when you do the 9th access at a 4k alignment, it faults. (or likely *before* that -- the system will very likely want to leave some ways for the other thread.)

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 19:03 UTC (Fri) by Ben_P (guest, #74247) [Link]

How small can transactional regions be? Hopefully smaller than a page?

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 19:12 UTC (Fri) by josh (subscriber, #17465) [Link]

As small as one cache line, as far as I can tell. The spec mentions that other operations within the same cache line may cause spurious transaction conflicts.

Intel's upcoming transactional memory feature

Posted Feb 8, 2012 21:59 UTC (Wed) by zomonto (guest, #82108) [Link]

Intel's upcoming transactional memory feature

Posted Feb 8, 2012 22:12 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

Should be some very interesting ways to use it in userspace, as well. Various forms of shared-memory IPC, for instance, or zero-copy message passing.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 8:38 UTC (Thu) by colo (subscriber, #45564) [Link]

Sounds like I'll consider upgrading my desktop when "Haswell" has launched.

I just hope that Intel will eventually get its act together and (re-)enable ECC support in their mainstream desktop platforms, because I don't feel like paying extra for a Xeon when all I really need is what AMD has been providing even with their Sempron CPUs for the last couple of years.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 8:54 UTC (Thu) by k8to (subscriber, #15413) [Link]

does sempron, etc, provide all that you would want?

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 10:26 UTC (Thu) by colo (subscriber, #45564) [Link]

Yeah, AMD CPUs tend to have all the features I want (you can build an AMD machine supporting an IOMMU, ECC memory and also recent SSE extensions really cheap these days), but their CPUs' power consumption is quite high in comparison to, e. g., Sandy Bridge. I also do like Intel's onboard IGPs and their driver support quite much, although AMD's FOSS drivers seem to catch up more and more.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 21:08 UTC (Thu) by Jonno (subscriber, #49613) [Link]

Regarding the integrated graphics, I'm using both a Sandy Bridge (Intel) CPU and a Stars (AMD) CPU on different computers, and I find that the GPU drivers are comparable in features and stability. The Intel driver is better optimized, but the AMD hardware is so much faster that you still get way better performance using the AMD system. Of course, the Stars CPU draws more than twice the power of Sandy Bridge, so on a laptop Intel has an advantage on battery life...

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 7:03 UTC (Fri) by Tuna-Fish (guest, #61751) [Link]

Intel has been making the all-around better processors for quite a few years now. However, they like to artificially segment the market by fusing off features (that have little or no marginal cost to implement, and are in fact present in all Intel processors but disabled in the cheaper ones.) that are typically used by companies and professionals that are willing and able to fork over more money for their rigs.

Unfortunately, many of those features only available in +$500 cpus are the ones that a lot of people here would be interested in.

There really is nothing *wrong* in this, per se, it's just unfortunate for us. Right now, in the reasonable price class, the choice is between a fast and power-efficient cpu, and one that has all the interesting features.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 9:38 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

So finally Load-link/store-conditional instruction style come to x86 and in a few years ABA bugs, http://en.wikipedia.org/wiki/ABA_problem, caused by improper use of compare-and-swap, will be the thing of the past.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 13:25 UTC (Thu) by ms (subscriber, #41272) [Link]

Sadly, it doesn't support Retry, let alone OrElse, which makes it fairly useless (or at the very least massively limits its use), and nested transactions are basically ignored - the inner transaction boundaries are ignored and the whole thing just becomes one monolithic transaction. Still, it's a start.

Once we have transactional harddiscs, network cards, keyboards, mice and monitors, use of TM might actually become widespread...

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 7:52 UTC (Fri) by Tuna-Fish (guest, #61751) [Link]

> Sadly, it doesn't support Retry, let alone OrElse

Why would it need to? Those features can be perfectly well provided by the software. When a transaction fails, it jumps to a pre-defined failure destination, where most languages would place some sort of exponential backoff retry mechanism, but nothing stops you from providing a OrElse instead.

Modern CPUs don't run microcode any faster than your code. (Because of the way it blocks the frontend, a reasonable case could be made that they run long and complex microcode *a lot slower* than your code.) There would be no value *at all* in providing those complex operations in the cpu itself.

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 10:19 UTC (Fri) by ms (subscriber, #41272) [Link]

> > Sadly, it doesn't support Retry, let alone OrElse
> Why would it need to? Those features can be perfectly well provided by the software.

Erm, so the txn hits a retry, which will have to have been rewritten as an abort. The purpose of retry is to suspend the txn until a value of the readset has been changed (and then restart the txn afresh), but you have no means to access the readset. So you'll then have to rerun the whole txn in software only in order to establish the readset, and then you'll have to go into some sort of backoff mode polling for changes to the readset. That on its own may introduce all sorts of cache thrashing issues. It's a bit fugly to do any sort of polling - STM tends to implement retry as an observer pattern. I wouldn't be too surprised if the broadcasts going on as part of the cache coherency protocol were pretty close to being sufficient to implementing such an observer-pattern implementation in hardware, though I understand it would depend heavily on what type of CCP you're doing.

Intel's upcoming transactional memory feature

Posted Feb 9, 2012 14:39 UTC (Thu) by slashdot (guest, #22014) [Link]

"An XRELEASE prefixed instruction must restore the value of the elided lock to the value it had before the lock acquisition"

This means that Intel's HLE feature won't work with Linux ticketlocks by just adding the hints.

It seems unclear whether it's possible to have fair spinlocks and take advantage of HLE at the same time.

Intel's upcoming transactional memory feature

Posted Feb 10, 2012 12:16 UTC (Fri) by csamuel (✭ supporter ✭, #2624) [Link]

It will be interesting to hear how this compares to the transactional memory in IBM's BlueGene/Q - that was announced at Hot Chips last August and the first 4 racks (for Sequoia at LLNL) have already shipped. Seems like there might be more details for the Intel version even though it's farther in the future (not to mention that BG/Q compute nodes don't run Linux, or are unlikely to appear in consumer gear).

http://arstechnica.com/hardware/news/2011/08/ibms-new-tra...

Intel's upcoming transactional memory feature

Posted Feb 11, 2012 1:15 UTC (Sat) by vomlehn (subscriber, #45588) [Link]

I have spent a bit of time thinking about this, and I'm not sure that this is going to be much of a boon to the Linux kernel. As a rule, we get performance either by using fine-grained granularity or going with lockless algorithms. The transactional memory feature means that, in situations where you don't touch too much memory, the hardware allows you to perform pretty well with coarse-grained locking.

On Intel processors things could benefit. However, kernel code has to work on quite a range of processors. If performance is inadequate, we're going to move to fine-grained locking so that everyone wins. This means that all this fancy hardware is going to reduce to a load locked/store conditional functionality.

Of course, lockless algorithms don't need the fancy hardware, at all.

I'm sure that Intel would like this to help lock Linux onto Intel architectures, but ARM is coming on very strong. Over three decade I've seen small, even "toy", technologies unseat the reigning champion. Linux is, of course, one of those. It looks to me that, unless ARM gets a feature similar enough to this that we can use a single API to cover them both, the power of this feature will be largely under utilized.

Or maybe I'm missing something...

Intel's upcoming transactional memory feature

Posted Feb 11, 2012 4:02 UTC (Sat) by deater (subscriber, #11746) [Link]

> I have spent a bit of time thinking about this, and I'm not sure that this > is going to be much of a boon to the Linux kernel. As a rule, we get
> performance either by using fine-grained granularity or going with
> lockless algorithms. The transactional memory feature means that, in
> situations where you don't touch too much memory, the hardware allows you > to perform pretty well with coarse-grained locking.

There was a paper presented at ASPLOS 2009, "Maximum Benefit from a minimal HTM" that describes how even though converting a 2.4 Linux kernel to use transactional memory led to a large increase in performance, doing the same for 2.6 had no benefit of all because the locking had gotten so much better by then.

So it will be interesting to see if anything useful comes out of this work by Intel. From what I understand, Transactional Memory is mostly there to help people who aren't very good at writing multithread code; it can almost always be beaten by hand-tuned locking written by an expert.

Intel's upcoming transactional memory feature

Posted Feb 17, 2012 0:06 UTC (Fri) by rise (guest, #5045) [Link]

I believe the paper you're referencing is the Hofmann, Rossbach, and Witchel[0] one? It's interesting, but from a quick read it appears all their results are from a machine simulator and describe a time of HTM that requires explicit conversion of parts of the kernel. I'm not at all sure it's directly comparable or evidence that Intel's model of HTM will provide "no benefit at all". They looked only at the kernel (already highly optimized around locking, disregarding other potential users), use a massively simplified in-order x86 model[1], and the handling of more complex & advanced effects in the far-from-ideal Intel hardware - say on instruction caches/prefetching, reordering, speculative execution, etc. - are outside the paper's focus. It's good research, but it's hardly conclusive evidence about real world performance of a somewhat different implementation on spectacularly different hardware.

[0] http://www.cs.utexas.edu/users/witchel/pubs/hofmann09aspl...
[1] "Because exercising an operating system imposes a heavy burden on
simulation time, we use Simics’ in-order processor model
at 1 cycle per instruction"

Intel's upcoming transactional memory feature

Posted Feb 17, 2012 3:21 UTC (Fri) by deater (subscriber, #11746) [Link]

I'm usually not one to like results done purely in a simulator either, but this work was likely done in 2008... what actual hardware could they have used to do the experiment? I was just impressed they tackled an actual real-world example like the Linux kernel and not just some set of microbenchmarks.

It will be interesting to see similar work re-done now that actual implementations exist of TM. I just don't think anyone has ever shown performance benefits from TM that couldn't also be realized with well-written traditional locking.

Intel's upcoming transactional memory feature

Posted Feb 16, 2012 3:07 UTC (Thu) by slashdot (guest, #22014) [Link]

Well, this feature provides no guarantees that transactions will complete, so a traditional fallback path needs to be present even for Intel x86, meaning that at least in the current iteration it's useless for any kind of customer lock-in.


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds