LWN.net Logo

ARM SoC launched with Linux support (Linux Devices)

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 3:00 UTC (Fri) by robert_s (subscriber, #42402)
In reply to: ARM SoC launched with Linux support (Linux Devices) by ncm
Parent article: ARM SoC launched with Linux support (Linux Devices)

"When there's only one processor, all operations are atomic."

No. If you're using a preemptive operating system, you can never trust that two consecutive instructions in your program, say T (test) and U (update), have actually been executed consecutively or whether the operating system ran instructions X , Y and Z from another process between T and U. The instructions may or may not have affected the state of the memory you're updating, so it has made the test invalid, but you have no way of knowing that.

Load-linked/store-conditional solves this by allowing code to 'mark' memory locations and only store if the location hasn't been touched since the test.


(Log in to post comments)

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 5:50 UTC (Fri) by ncm (subscriber, #165) [Link]

OK, I get you: user-space. I gather that the "load-locked/store conditional" instruction pair has fallen from favor, of late, and more constrained, single atomic instructions (e.g. compare-and-swap) have come back in vogue.

It's hard to understand how any modern CPU design could lack such primitives, or at least an uninterruptible increment; I thought surely PA-RISC would be the last.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 9:48 UTC (Fri) by simlo (subscriber, #10866) [Link]

Why doesn't Linux support a preemptive off primitive in user space?

It could be made cheaply: Each thread has a counter, read-writeable from userspace. Just as in the kernel, "preempoff" is done by counter++, and preempton by the counter--. In the rare case where the kernel wants to preempt the thread it checks if counter>0. If so it does not preempt the user space task, but sets up a timer to force a preemption. It also sets
a flag to the task which the userspace have to check on --counter == 0 to
call a volentarely schedule(). Just as is done in the kernel right now.

That would effectively make atomic operations for uniprocessors systems in usersspace on ARMv5 and other architectures with no atomic instructions.
Maybe it will even be cheaper on other UP systems with native atomic instructions.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 10:24 UTC (Fri) by ncm (subscriber, #165) [Link]

It seems simpler, safer, and faster for the scheduler to look at the instruction at the return address on the stack of the process being pre-empted, and if it's in the middle of what ought to be an atomic sequence (e.g. a naïve implementation of compare-and-swap) just complete it and update the stacked status word and return address.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 12:37 UTC (Fri) by simlo (subscriber, #10866) [Link]

"It seems simpler, safer, and faster"

I can't see any of the 3 to be right:
simpler: How do you determine if this is the case?
safer: What if you where wrong and missed one? What if you where wrong
and some attacher could abuse getting something run in kernel mode?
faster: How can it be faster to make a complicated search into the binary
code than checking the content of one address?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 20:22 UTC (Fri) by ncm (subscriber, #165) [Link]

Time spent in the kernel scheduler doesn't count, because it happens so rarely. What counts is time spent in user code that isn't pre-empted; incrementing and decrementing a memory word (never mind fooling with a timer!) are expensive. By contrast, code in the kernel to compare a couple of memory words against a pattern is simple, and updating the process state if they match is also simple. So, you just need an instruction sequence that does the right thing very cheaply when it doesn't get pre-empted, and is easy to recognize and patch when it does.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 22:21 UTC (Fri) by jgg (guest, #55211) [Link]

It is actually a pretty good idea if you combine it with kernel controlled code in the VDSO. Clearly operating on arbitrary user code is not a good idea..

Basically, the kernel has useful atomic_* function calls exported in its VDSO, if ever there is a preemption then during the context switch back to the suspended task the kernel checks if the PC is within the bad portion of the VDSO (two compares) and if so then it inspects the situation in more detail and adjusts the PC to return to the start of the atomic op function and tries again.

Similar precautions can be taken for signals and other cases. The basic structure of the atomic functions in the VDSO would be of the load locked/store conditional type with the kernel providing a loop back to start if the 'reservation' is lost.

User space simply links to these calls like normal VDSO linking. The VDSO is readonly to user space and has a unique physical and virtual address.

If generalized properly it could be efficiency neutral for userspace by being optimized to a specific CPU model. Certainly on ARM it has got to be a win if you need to support all CPU types. PPC and others with ll/sc it might be neutral, if the vdso functions are thick enough to replace what would have been a function call anyhow. Probably not useful on x86..

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 13:52 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Because this is an unspeakable horror ?

Your approach grants one of the following:

• It's OK to give unprivileged userspace processes an unlimited quantum. Who needs an operating system anyway ?

• When the timer fires the atomic operation will be interrupted anyway, thus adding a complex feature which doesn't solve the problem.

The Linux kernel team presumably don't think either of these options is sane.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 15:17 UTC (Fri) by simlo (subscriber, #10866) [Link]

As you appearently understood from the idea, the timer should prevent the
user from getting unlimited CPU. What should happen if he does anyway? Send him a SIGILL and let him deal with it.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 15:45 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

So, the user must absolutely avoid racing your timer, or he'll be punished with SIGILL. Thus the "feature" of disabling pre-emption becomes a thin wrapper around simple atomic ops only.

But, it incurs a full system call for every atomic op to let the kernel know that pre-emption can be re-enabled (if the user tries to avoid the system call, he's racing the timer and will SIGILL)

So now your solution has /worse/ performance than the existing solutions, and it's needlessly more complicated.

You might want to check the link nearby (about 0xffff0fc0) in this thread for what Linux actually provides to ARM users, and consider if your solution doesn't look a bit silly by comparison.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 14:33 UTC (Fri) by robert_s (subscriber, #42402) [Link]

"I gather that the "load-locked/store conditional" instruction pair has fallen from favor, of late, and more constrained, single atomic instructions (e.g. compare-and-swap) have come back in vogue."

Not as far as I've noticed. Thing is, it's quite easy to implement compare&swap etc. with ll/sc. So a programmer may still be using a c&s primitive which is ll/sc underneath.

This sums up ARMv5 well in this respect:

http://0pointer.de/blog/projects/atomic-rt.html

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 19:03 UTC (Fri) by ncm (subscriber, #165) [Link]

Well, yes, it's easy to emulate one with the other, but what matters is, what is not too inefficient to implement on a die shared with dozens of other processors?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 10, 2009 13:22 UTC (Sat) by alonz (subscriber, #815) [Link]

Well, in practice, LL/SC is quite easy to implement—it is almost a natural extension of the MESI protocol used for synchronizing the L1 caches in a multiprocessor chip.

IIRC, "load-locked" ensures the line is in the cache; "store-conditional" simply tests the line is currently not in "Invalid" state (i.e., it has not been written by another processor).

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 23:42 UTC (Fri) by endecotp (guest, #36428) [Link]

> http://0pointer.de/blog/projects/atomic-rt.html

Yes, that's a good page - but do read it all, the important bit is down in the comments where Andrew Haley mentions the "magic" code in high memory that does it all automagically (and fairly efficiently).

In my experience, with some thought you can often get quite reasonable code by trying to implement the algorithm directly in terms of atomic swap (i.e. the one atomic operation that ARMv5 has), rather than by writing the code using e.g. atomic add and then trying to implement atomic add on top of swap. I think I had futex-based mutex and condition code that did this.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 10, 2009 9:59 UTC (Sat) by Ze (subscriber, #54182) [Link]

>>In my experience, with some thought you can often get quite reasonable code by trying to implement the algorithm directly in terms of atomic swap (i.e. the one atomic operation that ARMv5 has), rather than by writing the code using e.g. atomic add and then trying to implement atomic add on top of swap. I think I had futex-based mutex and condition code that did this.

I'm going by memory for this but I seem to remember you are limited to a fixed number of atomic locks at compile time. Whether you use Peterson's algorithm or Lamport's Bakery algorithm.

Thinking about it now I don't see why you can't do a binary semaphore using atomic swap , something like the following pseudo code. I'd be happy if somebody told me I was wrong though since I could be a bit rusty on my atomic primitives.

binary_semaphore_lock()
{
thread_local = 1;
do
{
atomic_swap(thread_local,lock);
}
while(thread_local==1);
//If we get to here then local = 0 , and thus lock must've been equal to 0 before we swapped 1 into it.
}
binary_semaphore_unlock()
{
lock = 0;
}

If you used Peterson's or Lamport's it would depend upon the particular workload and environment but I wouldn't be surprised if the patching method in the kernel worked quicker in most cases. It'd depend upon number of locks and the ratio of lock/unlocks to kernel interruptions/pre-emptions though.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 9:41 UTC (Tue) by endecotp (guest, #36428) [Link]

> Peterson's algorithm or Lamport's Bakery algorithm.

Those are the algorithms that you have to resort to when you don't even have atomic swap, as I recall. With ARMv5 you can use code like your example. I have had success with something like this:

bool locked;

void lock() {
while (atomic_swap(locked,true)) {
sched_yield();
}
}

void unlock() {
locked = false;
}

This works well IFF:

- The probability of contention is low.
- You have tens, not thousands, of threads.
- You care only about average delays, not worst case.
- You don't care about RT threads, priority inversion etc.

The really great thing about this code is that the fast (uncontended) path is just a couple of instructions as long as the compiler inlines it, making it maybe an order of magnitude faster than something like pthread_mutex_lock().

I was trying to recall how to extend this to use a futex() call instead of sched_yield(). It gets tricky....

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 21:08 UTC (Tue) by npitre (subscriber, #5680) [Link]

> In my experience, with some thought you can often get quite reasonable
> code by trying to implement the algorithm directly in terms of atomic
> swap (i.e. the one atomic operation that ARMv5 has), rather than by
> writing the code using e.g. atomic add and then trying to implement
> atomic add on top of swap. I think I had futex-based mutex and
> condition code that did this.

That certainly works for things like semaphores where you're expected to go to sleep on contention. However, for simple atomic primitives such as atomic adds and the like, relying on futexes for arbitration is rather heavy weight, whereas the technique currently implemented in ARM Linux for those is extremely light and has no contention issue at all.

Nevertheless, a simple swap instruction, along with futexes, allows for pretty efficient semaphore support. For example, see my implementation here (if I'm not mistaken, you are the one I replied to in that message). That implementation, unfortunately, is not POSIX compliant as it is not safe wrt concurrent usage from signal handlers.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 22:09 UTC (Tue) by endecotp (guest, #36428) [Link]

Yes that's me! It's a bit embarrassing that that exchange was not much more than a year ago yet I seem to have forgotten most of the details...

User space atomic ops on ARMv5 and earlier

Posted Jan 12, 2009 19:58 UTC (Mon) by npitre (subscriber, #5680) [Link]

> This sums up ARMv5 well in this respect:
>
> http://0pointer.de/blog/projects/atomic-rt.html

This page is not completely accurate.

I'm the author of the __kernel_cmpxchg facility. I simply disagree with the claim that true and efficient atomic operations are not possible on ARMv5.

The trick is very simple: you do your cmpxchg operation in user mode using standard instructions, without any lock, syscall, exception trap, etc. However this should be a controlled set of instruction at a fixed location. Those instructions are provided by the kernel for this purpose and made read-only to user space.

So what you have on pre-ARMv6 at 0xffff0fc0 is this:

1:      ldr     r3, [r2]        @ load current val
        subs    r3, r3, r0      @ compare with oldval
2:      streq   r1, [r2]        @ store newval if eq
        rsbs    r0, r3, #0      @ set return val and C flag
        bx      lr              @ or "mov pc, lr" if no thumb support

This is about the fastest you can get, even when comparing this to ARMv6 with its ll-sc instructions.

Then, upon entry in the kernel which has the potential to schedule another thread, only suffice to perform a simple test on the saved user space pc. value. If it is above 0xc0000000 then execution was possibly interrupted while executing that code, and that may be only due to an interrupt, or a page fault when attempting to dereference the provided pointer. So in those exception handlers, this simple test is added:

        cmp     r2, #TASK_SIZE  @ saved user space pc value
        blhs    kuser_cmpxchg_fixup

The out-of-line kuser_cmpxchg_fixup code determines if pc actually corresponds to the code located between 1: and 2: labels above, meaning that the atomicity cannot be guaranteed. In that case the saved user space pc value is simply rewound to 1: so to restart the operation entirely the next time this thread is scheduled. Suffice to say that this has extremely low probability to happen therefore having next to zero overhead, but when it happens then full "atomicity" is preserved.

This works on non SMP system only, of course. But none of the existing ARMv5 implementations out there are SMP anyway. And on SMP capable ARM systems, the kernel replaces the above code by another version which is SMP safe by using ARMv6 ldrex/strex instructions, making this interface portable.

All this to say that perfect atomic operations are possible and even fast on ARMv5 and earlier with no problem at all. This works even for RT tasks, is signal safe, and if currently this trick is not implemented on uClinux, there is no inherent limitation preventing this to be usable there as well.

This interface may look awkward for user space programs, but the purpose of standard libraries is actually to encapsulate and hide those things. Here's for example an optimized atomic_add() implementation based on the above:

#define atomic_add(ptr, val) \
     ({ register unsigned int *__ptr asm("r2") = (ptr); \
        register unsigned int __result asm("r1"); \
        asm volatile ( \
            "1: @ atomic_add\n\t" \
            "ldr     r0, [r2]\n\t" \
            "mov     r3, #0xffff0fff\n\t" \
            "add     lr, pc, #4\n\t" \
            "add     r1, r0, %2\n\t" \
            "add     pc, r3, #(0xffff0fc0 - 0xffff0fff)\n\t" \
            "bcc     1b" \
            : "=&r" (__result) \
            : "r" (__ptr), "rIL" (val) \
            : "r0","r3","ip","lr","cc","memory" ); \
        __result; })

And so on.

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 16:44 UTC (Tue) by endecotp (guest, #36428) [Link]

It would be great if gcc could make use of this when you use its __sync_* atomic builtins, or if glibc used it in its pthread_* implementations. I'm pretty sure that neither does at present, though I'd love to be corrected if they do!

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 19:39 UTC (Tue) by npitre (subscriber, #5680) [Link]

NPTL support for ARM in glibc certainly does. I developed the kernel part in collaboration with the person who did the glibc part.

As to gcc, I don't see any specific ARM support for the __sync_* primitives, not even for ARMv6+ which has native instructions that could be used for that purpose. However this should be easy to implement following the PA model. Incidentally, the file gcc/config/pa/linux-atomic.c contains this note:

/* Linux-specific atomic operations for PA Linux.
   Copyright (C) 2008 Free Software Foundation, Inc.
   Based on code contributed by CodeSourcery for ARM EABI Linux.
   Modifications for PA Linux by Helge Deller <deller@gmx.de>
[...]

So maybe that support does exist somewhere already?

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 22:18 UTC (Tue) by endecotp (guest, #36428) [Link]

I've just done a quick test and it looks like gcc emits a function call when you invoke __sync_* for ARM. So presumably it would be possible to write a small library of things with the right names that calls the kernel helper code. Then near-optimal portable code would be possible.

Maybe someone has already done this....

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds