>>In my experience, with some thought you can often get quite reasonable code by trying to implement the algorithm directly in terms of atomic swap (i.e. the one atomic operation that ARMv5 has), rather than by writing the code using e.g. atomic add and then trying to implement atomic add on top of swap. I think I had futex-based mutex and condition code that did this.
I'm going by memory for this but I seem to remember you are limited to a fixed number of atomic locks at compile time. Whether you use Peterson's algorithm or Lamport's Bakery algorithm.
Thinking about it now I don't see why you can't do a binary semaphore using atomic swap , something like the following pseudo code. I'd be happy if somebody told me I was wrong though since I could be a bit rusty on my atomic primitives.
binary_semaphore_lock()
{
thread_local = 1;
do
{
atomic_swap(thread_local,lock);
}
while(thread_local==1);
//If we get to here then local = 0 , and thus lock must've been equal to 0 before we swapped 1 into it.
}
binary_semaphore_unlock()
{
lock = 0;
}
If you used Peterson's or Lamport's it would depend upon the particular workload and environment but I wouldn't be surprised if the patching method in the kernel worked quicker in most cases. It'd depend upon number of locks and the ratio of lock/unlocks to kernel interruptions/pre-emptions though.
ARM SoC launched with Linux support (Linux Devices)
Posted Jan 13, 2009 9:41 UTC (Tue) by endecotp (guest, #36428)
[Link]
> Peterson's algorithm or Lamport's Bakery algorithm.
Those are the algorithms that you have to resort to when you don't even have atomic swap, as I recall. With ARMv5 you can use code like your example. I have had success with something like this:
bool locked;
void lock() {
while (atomic_swap(locked,true)) {
sched_yield();
}
}
void unlock() {
locked = false;
}
This works well IFF:
- The probability of contention is low.
- You have tens, not thousands, of threads.
- You care only about average delays, not worst case.
- You don't care about RT threads, priority inversion etc.
The really great thing about this code is that the fast (uncontended) path is just a couple of instructions as long as the compiler inlines it, making it maybe an order of magnitude faster than something like pthread_mutex_lock().
I was trying to recall how to extend this to use a futex() call instead of sched_yield(). It gets tricky....