LWN.net Logo

ARM SoC launched with Linux support (Linux Devices)

Linux Devices covers Marvell's launch of the PXA168 system-on-chip device. "The PXA168 is said to offer the processing capabilities of an "entry-level laptop," on devices that still often run single-purpose real-time operating systems (RTOSes), says the company. Touted features include instant-on web surfing and widgets, multi-format video, Adobe Flash-based playback, image processing, video conferencing, and advanced graphical user interfaces (GUIs)."
(Log in to post comments)

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 8, 2009 22:42 UTC (Thu) by robert_s (subscriber, #42402) [Link]

Please god let ARMv5 die along with its inadequate atomic operation support.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 1:31 UTC (Fri) by ncm (subscriber, #165) [Link]

When there's only one processor, all operations are atomic. Or are you talking about interaction with the various coprocessors on the chip?

It's disappointing that they omit Vorbis and Dirac codecs on the reference platform.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 2:17 UTC (Fri) by jengelh (subscriber, #33263) [Link]

And given that at least the Linux implementation of the Flash plugin is so horribly slow, I wonder how they can keep the real-time.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 3:00 UTC (Fri) by robert_s (subscriber, #42402) [Link]

"When there's only one processor, all operations are atomic."

No. If you're using a preemptive operating system, you can never trust that two consecutive instructions in your program, say T (test) and U (update), have actually been executed consecutively or whether the operating system ran instructions X , Y and Z from another process between T and U. The instructions may or may not have affected the state of the memory you're updating, so it has made the test invalid, but you have no way of knowing that.

Load-linked/store-conditional solves this by allowing code to 'mark' memory locations and only store if the location hasn't been touched since the test.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 5:50 UTC (Fri) by ncm (subscriber, #165) [Link]

OK, I get you: user-space. I gather that the "load-locked/store conditional" instruction pair has fallen from favor, of late, and more constrained, single atomic instructions (e.g. compare-and-swap) have come back in vogue.

It's hard to understand how any modern CPU design could lack such primitives, or at least an uninterruptible increment; I thought surely PA-RISC would be the last.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 9:48 UTC (Fri) by simlo (subscriber, #10866) [Link]

Why doesn't Linux support a preemptive off primitive in user space?

It could be made cheaply: Each thread has a counter, read-writeable from userspace. Just as in the kernel, "preempoff" is done by counter++, and preempton by the counter--. In the rare case where the kernel wants to preempt the thread it checks if counter>0. If so it does not preempt the user space task, but sets up a timer to force a preemption. It also sets
a flag to the task which the userspace have to check on --counter == 0 to
call a volentarely schedule(). Just as is done in the kernel right now.

That would effectively make atomic operations for uniprocessors systems in usersspace on ARMv5 and other architectures with no atomic instructions.
Maybe it will even be cheaper on other UP systems with native atomic instructions.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 10:24 UTC (Fri) by ncm (subscriber, #165) [Link]

It seems simpler, safer, and faster for the scheduler to look at the instruction at the return address on the stack of the process being pre-empted, and if it's in the middle of what ought to be an atomic sequence (e.g. a naïve implementation of compare-and-swap) just complete it and update the stacked status word and return address.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 12:37 UTC (Fri) by simlo (subscriber, #10866) [Link]

"It seems simpler, safer, and faster"

I can't see any of the 3 to be right:
simpler: How do you determine if this is the case?
safer: What if you where wrong and missed one? What if you where wrong
and some attacher could abuse getting something run in kernel mode?
faster: How can it be faster to make a complicated search into the binary
code than checking the content of one address?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 20:22 UTC (Fri) by ncm (subscriber, #165) [Link]

Time spent in the kernel scheduler doesn't count, because it happens so rarely. What counts is time spent in user code that isn't pre-empted; incrementing and decrementing a memory word (never mind fooling with a timer!) are expensive. By contrast, code in the kernel to compare a couple of memory words against a pattern is simple, and updating the process state if they match is also simple. So, you just need an instruction sequence that does the right thing very cheaply when it doesn't get pre-empted, and is easy to recognize and patch when it does.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 22:21 UTC (Fri) by jgg (guest, #55211) [Link]

It is actually a pretty good idea if you combine it with kernel controlled code in the VDSO. Clearly operating on arbitrary user code is not a good idea..

Basically, the kernel has useful atomic_* function calls exported in its VDSO, if ever there is a preemption then during the context switch back to the suspended task the kernel checks if the PC is within the bad portion of the VDSO (two compares) and if so then it inspects the situation in more detail and adjusts the PC to return to the start of the atomic op function and tries again.

Similar precautions can be taken for signals and other cases. The basic structure of the atomic functions in the VDSO would be of the load locked/store conditional type with the kernel providing a loop back to start if the 'reservation' is lost.

User space simply links to these calls like normal VDSO linking. The VDSO is readonly to user space and has a unique physical and virtual address.

If generalized properly it could be efficiency neutral for userspace by being optimized to a specific CPU model. Certainly on ARM it has got to be a win if you need to support all CPU types. PPC and others with ll/sc it might be neutral, if the vdso functions are thick enough to replace what would have been a function call anyhow. Probably not useful on x86..

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 13:52 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Because this is an unspeakable horror ?

Your approach grants one of the following:

• It's OK to give unprivileged userspace processes an unlimited quantum. Who needs an operating system anyway ?

• When the timer fires the atomic operation will be interrupted anyway, thus adding a complex feature which doesn't solve the problem.

The Linux kernel team presumably don't think either of these options is sane.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 15:17 UTC (Fri) by simlo (subscriber, #10866) [Link]

As you appearently understood from the idea, the timer should prevent the
user from getting unlimited CPU. What should happen if he does anyway? Send him a SIGILL and let him deal with it.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 15:45 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

So, the user must absolutely avoid racing your timer, or he'll be punished with SIGILL. Thus the "feature" of disabling pre-emption becomes a thin wrapper around simple atomic ops only.

But, it incurs a full system call for every atomic op to let the kernel know that pre-emption can be re-enabled (if the user tries to avoid the system call, he's racing the timer and will SIGILL)

So now your solution has /worse/ performance than the existing solutions, and it's needlessly more complicated.

You might want to check the link nearby (about 0xffff0fc0) in this thread for what Linux actually provides to ARM users, and consider if your solution doesn't look a bit silly by comparison.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 14:33 UTC (Fri) by robert_s (subscriber, #42402) [Link]

"I gather that the "load-locked/store conditional" instruction pair has fallen from favor, of late, and more constrained, single atomic instructions (e.g. compare-and-swap) have come back in vogue."

Not as far as I've noticed. Thing is, it's quite easy to implement compare&swap etc. with ll/sc. So a programmer may still be using a c&s primitive which is ll/sc underneath.

This sums up ARMv5 well in this respect:

http://0pointer.de/blog/projects/atomic-rt.html

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 19:03 UTC (Fri) by ncm (subscriber, #165) [Link]

Well, yes, it's easy to emulate one with the other, but what matters is, what is not too inefficient to implement on a die shared with dozens of other processors?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 10, 2009 13:22 UTC (Sat) by alonz (subscriber, #815) [Link]

Well, in practice, LL/SC is quite easy to implement—it is almost a natural extension of the MESI protocol used for synchronizing the L1 caches in a multiprocessor chip.

IIRC, "load-locked" ensures the line is in the cache; "store-conditional" simply tests the line is currently not in "Invalid" state (i.e., it has not been written by another processor).

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 23:42 UTC (Fri) by endecotp (guest, #36428) [Link]

> http://0pointer.de/blog/projects/atomic-rt.html

Yes, that's a good page - but do read it all, the important bit is down in the comments where Andrew Haley mentions the "magic" code in high memory that does it all automagically (and fairly efficiently).

In my experience, with some thought you can often get quite reasonable code by trying to implement the algorithm directly in terms of atomic swap (i.e. the one atomic operation that ARMv5 has), rather than by writing the code using e.g. atomic add and then trying to implement atomic add on top of swap. I think I had futex-based mutex and condition code that did this.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 10, 2009 9:59 UTC (Sat) by Ze (guest, #54182) [Link]

>>In my experience, with some thought you can often get quite reasonable code by trying to implement the algorithm directly in terms of atomic swap (i.e. the one atomic operation that ARMv5 has), rather than by writing the code using e.g. atomic add and then trying to implement atomic add on top of swap. I think I had futex-based mutex and condition code that did this.

I'm going by memory for this but I seem to remember you are limited to a fixed number of atomic locks at compile time. Whether you use Peterson's algorithm or Lamport's Bakery algorithm.

Thinking about it now I don't see why you can't do a binary semaphore using atomic swap , something like the following pseudo code. I'd be happy if somebody told me I was wrong though since I could be a bit rusty on my atomic primitives.

binary_semaphore_lock()
{
thread_local = 1;
do
{
atomic_swap(thread_local,lock);
}
while(thread_local==1);
//If we get to here then local = 0 , and thus lock must've been equal to 0 before we swapped 1 into it.
}
binary_semaphore_unlock()
{
lock = 0;
}

If you used Peterson's or Lamport's it would depend upon the particular workload and environment but I wouldn't be surprised if the patching method in the kernel worked quicker in most cases. It'd depend upon number of locks and the ratio of lock/unlocks to kernel interruptions/pre-emptions though.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 9:41 UTC (Tue) by endecotp (guest, #36428) [Link]

> Peterson's algorithm or Lamport's Bakery algorithm.

Those are the algorithms that you have to resort to when you don't even have atomic swap, as I recall. With ARMv5 you can use code like your example. I have had success with something like this:

bool locked;

void lock() {
while (atomic_swap(locked,true)) {
sched_yield();
}
}

void unlock() {
locked = false;
}

This works well IFF:

- The probability of contention is low.
- You have tens, not thousands, of threads.
- You care only about average delays, not worst case.
- You don't care about RT threads, priority inversion etc.

The really great thing about this code is that the fast (uncontended) path is just a couple of instructions as long as the compiler inlines it, making it maybe an order of magnitude faster than something like pthread_mutex_lock().

I was trying to recall how to extend this to use a futex() call instead of sched_yield(). It gets tricky....

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 21:08 UTC (Tue) by npitre (subscriber, #5680) [Link]

> In my experience, with some thought you can often get quite reasonable
> code by trying to implement the algorithm directly in terms of atomic
> swap (i.e. the one atomic operation that ARMv5 has), rather than by
> writing the code using e.g. atomic add and then trying to implement
> atomic add on top of swap. I think I had futex-based mutex and
> condition code that did this.

That certainly works for things like semaphores where you're expected to go to sleep on contention. However, for simple atomic primitives such as atomic adds and the like, relying on futexes for arbitration is rather heavy weight, whereas the technique currently implemented in ARM Linux for those is extremely light and has no contention issue at all.

Nevertheless, a simple swap instruction, along with futexes, allows for pretty efficient semaphore support. For example, see my implementation here (if I'm not mistaken, you are the one I replied to in that message). That implementation, unfortunately, is not POSIX compliant as it is not safe wrt concurrent usage from signal handlers.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 13, 2009 22:09 UTC (Tue) by endecotp (guest, #36428) [Link]

Yes that's me! It's a bit embarrassing that that exchange was not much more than a year ago yet I seem to have forgotten most of the details...

User space atomic ops on ARMv5 and earlier

Posted Jan 12, 2009 19:58 UTC (Mon) by npitre (subscriber, #5680) [Link]

> This sums up ARMv5 well in this respect:
>
> http://0pointer.de/blog/projects/atomic-rt.html

This page is not completely accurate.

I'm the author of the __kernel_cmpxchg facility. I simply disagree with the claim that true and efficient atomic operations are not possible on ARMv5.

The trick is very simple: you do your cmpxchg operation in user mode using standard instructions, without any lock, syscall, exception trap, etc. However this should be a controlled set of instruction at a fixed location. Those instructions are provided by the kernel for this purpose and made read-only to user space.

So what you have on pre-ARMv6 at 0xffff0fc0 is this:

1:      ldr     r3, [r2]        @ load current val
        subs    r3, r3, r0      @ compare with oldval
2:      streq   r1, [r2]        @ store newval if eq
        rsbs    r0, r3, #0      @ set return val and C flag
        bx      lr              @ or "mov pc, lr" if no thumb support

This is about the fastest you can get, even when comparing this to ARMv6 with its ll-sc instructions.

Then, upon entry in the kernel which has the potential to schedule another thread, only suffice to perform a simple test on the saved user space pc. value. If it is above 0xc0000000 then execution was possibly interrupted while executing that code, and that may be only due to an interrupt, or a page fault when attempting to dereference the provided pointer. So in those exception handlers, this simple test is added:

        cmp     r2, #TASK_SIZE  @ saved user space pc value
        blhs    kuser_cmpxchg_fixup

The out-of-line kuser_cmpxchg_fixup code determines if pc actually corresponds to the code located between 1: and 2: labels above, meaning that the atomicity cannot be guaranteed. In that case the saved user space pc value is simply rewound to 1: so to restart the operation entirely the next time this thread is scheduled. Suffice to say that this has extremely low probability to happen therefore having next to zero overhead, but when it happens then full "atomicity" is preserved.

This works on non SMP system only, of course. But none of the existing ARMv5 implementations out there are SMP anyway. And on SMP capable ARM systems, the kernel replaces the above code by another version which is SMP safe by using ARMv6 ldrex/strex instructions, making this interface portable.

All this to say that perfect atomic operations are possible and even fast on ARMv5 and earlier with no problem at all. This works even for RT tasks, is signal safe, and if currently this trick is not implemented on uClinux, there is no inherent limitation preventing this to be usable there as well.

This interface may look awkward for user space programs, but the purpose of standard libraries is actually to encapsulate and hide those things. Here's for example an optimized atomic_add() implementation based on the above:

#define atomic_add(ptr, val) \
     ({ register unsigned int *__ptr asm("r2") = (ptr); \
        register unsigned int __result asm("r1"); \
        asm volatile ( \
            "1: @ atomic_add\n\t" \
            "ldr     r0, [r2]\n\t" \
            "mov     r3, #0xffff0fff\n\t" \
            "add     lr, pc, #4\n\t" \
            "add     r1, r0, %2\n\t" \
            "add     pc, r3, #(0xffff0fc0 - 0xffff0fff)\n\t" \
            "bcc     1b" \
            : "=&r" (__result) \
            : "r" (__ptr), "rIL" (val) \
            : "r0","r3","ip","lr","cc","memory" ); \
        __result; })

And so on.

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 16:44 UTC (Tue) by endecotp (guest, #36428) [Link]

It would be great if gcc could make use of this when you use its __sync_* atomic builtins, or if glibc used it in its pthread_* implementations. I'm pretty sure that neither does at present, though I'd love to be corrected if they do!

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 19:39 UTC (Tue) by npitre (subscriber, #5680) [Link]

NPTL support for ARM in glibc certainly does. I developed the kernel part in collaboration with the person who did the glibc part.

As to gcc, I don't see any specific ARM support for the __sync_* primitives, not even for ARMv6+ which has native instructions that could be used for that purpose. However this should be easy to implement following the PA model. Incidentally, the file gcc/config/pa/linux-atomic.c contains this note:

/* Linux-specific atomic operations for PA Linux.
   Copyright (C) 2008 Free Software Foundation, Inc.
   Based on code contributed by CodeSourcery for ARM EABI Linux.
   Modifications for PA Linux by Helge Deller <deller@gmx.de>
[...]

So maybe that support does exist somewhere already?

User space atomic ops on ARMv5 and earlier

Posted Jan 13, 2009 22:18 UTC (Tue) by endecotp (guest, #36428) [Link]

I've just done a quick test and it looks like gcc emits a function call when you invoke __sync_* for ARM. So presumably it would be possible to write a small library of things with the right names that calls the kernel helper code. Then near-optimal portable code would be possible.

Maybe someone has already done this....

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 10:01 UTC (Fri) by gnb (subscriber, #5132) [Link]

Well, in fairness to ARM, they do regard it as dead. You can still licence
older cores, but they haven't announced a new core that does v5 since... 2002
I think. Why Intel and then Marvell stuck with it for the xscale I don't know
( my guess would be legal reasons). But then these days I'm not clear why
Marvell don't just drop the xscale processor from their SoC lines and replace
it with an ARM core of similar performance.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 10:48 UTC (Fri) by ajb (subscriber, #9694) [Link]

Because the Xscale license held (now) by Marvell is more permissive than the standard ARM licenses - they can make derived works.

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 11:33 UTC (Fri) by gnb (subscriber, #5132) [Link]

Yes, I realise that, I probably worded the question badly. What isn't clear
to me is that they get any technological advantage out of it any more. Time
was when strongarm, and subsequently the xscale, were clearly faster than the
cores ARM were selling, but is that still true (my impression is that it
isn't) and if not why keep putting R&D into an in-house implementation of an
antiquated version of the ARM ISA?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 13:37 UTC (Fri) by nhippi (subscriber, #34640) [Link]

Marvell mv78200 (a armv5t) is still very much the fastest ARM around (perhaps Qualcomm snapdragon beats it, but I'm not sure how available it is yet).

Also, if you read the fine article, they link to a (very insightful) Marvell .pdf which mentions that the next generation (to be announced 2009) is "fully armv6 and armv7 compatible".

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 14:45 UTC (Fri) by robert_s (subscriber, #42402) [Link]

"Marvell mv78200 (a armv5t) is still very much the fastest ARM around"

Faster than any of the OMAP3s?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 9, 2009 23:08 UTC (Fri) by endecotp (guest, #36428) [Link]

My recollection is the the mv78200 has two cores, each capable of issuing two instructions per cycle, running at a little over 1GHz. Can anyone quote comparable data for the OMAP?

ARM SoC launched with Linux support (Linux Devices)

Posted Jan 10, 2009 1:02 UTC (Sat) by drag (subscriber, #31333) [Link]

Well it's probably more comparing apples to oranges.

With the proprietary drivers for the PowerVR SGX video stuff (also used in Intel's Atom stuff when not using the desktop oriented 945g stuff) then Ti's OMAP3 platform can decode and playback 720p video over DVI interface in realtime. It can do GLSL and has effective 3D graphics. (it's used in the Pandora handheld gaming system which can play Quake 2 with steady 40fps.)

And also besides the PowerVR stuff the Cortex-A8 core (the OMAP3 is TI's baby, while the Cortex-A8 is at least licensed by Freescale also and freescale has it's own platform in the works. For example OMAP3x stuff so far only supports DDR1, while Freescale supports DDR2) supports full ARMv7, NEON instruction set and SIMD stuff.

The Cortex-A8 can do 1ghz, has 13 stage integer pipeline, 10 stage NEON, L2 cache ranging from 64K to 2M, runs using 300mA worth of power, and has some crypto acceleration besides other features. Blah blah.

You can find nice PDF's describing all of this with easy google'ng

In the future Cortex-A9 cores should be coming out and they are the multicore versions of the same architecture. I think they are designed, but I haven't seen anybody using them yet.

Both of these can be clocked at excess of 1ghz given proper fabrication and power requirements.

---------------------

However the MV stuff your talking about is optimized for network performance with the ability to do 4 ethernet ports and happy fun stuff like that. It's dual core is probably more optimized for traditional ARM integer functions and thus is probably much better at some tasks then the OMAP3 is.

---------------------

It would be interesting to see benchmarks to compare various ARM systems and help people decide with architecture is good for them.

The nicest thing about Ti is that they are very developer friendly. They provided the Beagleboard, which is a developer's board that is insanely cheap for the performance it offers. $150

http://beagleboard.org/hardware

For another 350 dollars you can get a nice mini projector from Ti to go with it. A entire Linux PC with with HVGA graphics on a 30 inch display you can fit in one pocket for around 600 bucks.

Then the Pandora handheld gaming thing is using it. I don't think they are selling yet, but they have released pre-production devices to testers/developers.

Also Gumstix is producing the OMAP3-based Overo for hackers for $150 for their flexible and tiny gumstix formfactor.

So all of this platform is very accessible to Linux hackers and hobbyists. If it wasn't for the proprietary 3D drivers then it would be almost perfect for low-power systems.

OMAP3...

Posted Jan 10, 2009 13:13 UTC (Sat) by HalfMoon (guest, #3211) [Link]

OMAP is probably best known for being energy-efficient, which XScale never did that well. The process technologies have really affected the hardware designs, which gets to be really significant with smaller geometries where leakage currents *really* hurt. OMAP has quite a few power domains on the chip, which can be individually switched off; and that's on top of selecting power-efficient transistors. That's critical if you're going to eke out a week's usage off teeny tiny batteries... and part of why it's not clocked as fast as the ARM core would allow.

Also, OMAP3 has four basic options. OMAP 3503 (the one in the entry level Gumstix Overo) has no video accelerators. The higher end chips can give you one or both of: the OpenGL engine; or a rather powerful C64x+ fixed point DSP (VLIW, four concurrent ALUs, many MMACs/sec). The Beagle board has one of the higher end chips (OMAP 3530), with all the available graphical goodies. (But not the "secure" version used in most cell phones.) If you need number crunching, use the DSP not the ARM.

You can compare the OMAP 3525 (ARM plus DSP, no OpenGL hardware) to the DaVinci 6446, which is marketed as a video processor; except the OMAP has more peripherals, a much faster ARM (Cortex-A8 vs arm926), and better power management.

OMAP3...

Posted Jan 13, 2009 2:28 UTC (Tue) by drag (subscriber, #31333) [Link]

Thanks for the info. It is very valuable.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds