LWN.net Logo

Making Linux safe for pthreads

The Linux kernel has long been criticized for its thread support. This criticism is surprising to some, since the Linux clone() system call provides a great deal of flexibility in the creation of threads that share resources with their parent process. But clone() is not enough to allow Linux to fully support the Posix thread (pthreads) standard with good performance - especially for applications which create thousands of threads.

And such applications do exist. A lot of kernel hackers dismiss highly threaded applications as being poorly written - having more threads than processors on the system is almost always a loss from a performance point of view, and truly robust thread programming is difficult. But Linux must support what users want to do, or they will use a different system. This week has seen the culmination of quite a bit of work aimed at improving the kernel's basic thread support.

The push to improve thread support began some months ago with Rusty Russell's "Futex" (fast user-space mutex) patch. Futexes allow the implementation of pthread mutexes and condition variables in a fast manner that only requires a system call when there is contention. This patch was merged in 2.5.7 and has been refined since then.

More recently, Ingo Molnar has been working on thread support issues. His first thread-local storage (TLS) patch was posted on July 25; it was merged in 2.5.29 and is still being hacked upon. The purpose of TLS, of course, is to give each thread access to a region of memory which is not shared with all other threads. Ingo's patch, which is implemented only for the x86 architecture, supports TLS with the following changes:

  • Doing thread-local storage right on the x86 requires using the segment mechanism. The patch sets aside a few entries in the processor's global descriptor table (GDT) to implement the TLS segments. In the most recent patch as of this writing (tls-2.5.31-D9) creates three segments: one for glibc (and, thus, pthreads), one for Wine, and one unassigned.

  • A new set_thread_area() system call allows library code to set up thread-local storage using one of the TLS segments.

  • At every context switch, the kernel copies the new process's TLS entries into the appropriate part of the GDT.

With these changes, each thread can have its own, transparent, local storage area. There was just one last complication: the x86 GDT was global and shared on SMP systems. So Ingo had to create a separate GDT for each processor, with the interesting result that context switches got a little faster.

Next problem: what if you want to create lots of threads in a quick and safe manner? The classic Unix fork() system call has a problem in that the newly-created child process could exit before the process ID is ever returned to the parent; if the parent loses this race, it can be left in a position where it no longer knows what is going on with its children. This problem can be worked around, but the workaround involves more system calls, which slow down thread creation.

Ingo's solution comes in the form of a couple of new flags to the clone() system call. The pthread library can throw in CLONE_SETTID, which causes the process ID of the new thread to be written back to a variable in the parent's address space before the new thread begins running. There is also a CLONE_SETTLS flag which causes the equivalent of a set_thread_area() call to happen as well. The result is a robust way of creating new threads with a single system call.

Finally, the pthreads code has a couple of issues to deal with when threads die. The stack used by the thread must be deallocated - and the dying thread can not do that itself. With enough system calls, pthreads handles that now, but thread exit should really be a lightweight event, and a system call-heavy solution defeats that purpose.

Much of the overhead can be eliminated if the thread library can be told about thread exit without the usual SIGCHLD signal - signals are expensive. The new pthreads code can do that with the futex mechanism - almost. It is still difficult to know, without a signal, when the thread has truly finished using its stack, so that said stack can be freed. If the stack gets freed before the thread is done with it, the result is a big mess and a new interest on the developer's part in Windows threading packages; this outcome needs to be avoided.

Ingo's first attempt to solve this problem was through the addition of an exit_free() system call, which would simply write a special value in the parent's address space to indicate that the stack could be freed. Linus, however, called this patch "too ugly to live." After some discussion, the solution that emerged was to add another clone() flag: CLONE_RELEASE_VM. If a thread is created with that flag, a word is set aside at the top of the thread's stack. When the thread releases its current virtual memory - by exiting, or by execing another program - that word is written with a flag value. The parent can see that value and know that the stack can be freed.

Finally, Ingo has posted yet another patch implementing the CLONE_DETACHED flag. If a thread is created with that flag, no signal is sent to the parent process when the thread exits. This solution is faster than having the parent simply ignore SIGCHLD, and also does not require the parent to do without notification for all of its children.

The other half of all this work, of course, is a new pthreads library that actually uses all of these new features. The code is in progress and will be part of a future glibc release. Then, maybe, people will stop complaining about thread support in Linux.


(Log in to post comments)

Making Linux safe for pthreads

Posted Aug 15, 2002 6:13 UTC (Thu) by IkeTo (subscriber, #2122) [Link]

I might be mistaken, but I really think that all the work cannot make it affordable to create, say, 1000 threads by a single process. My worry is that a thread is not cheap to the kernel at all. Each thread requires its own task_struct, which means one page of memory. To create 1000 threads in a program, it means 4M space will be needed for holding these task_struct's. And these pages are in the kernel, i.e., they cannot be put into the swap space even if all the 1000 threads are sleeping. The old wisdom is that to support many threads, one need user-mode threads that maps to threads within the kernel, which is not what provided by the pthread library. Of course, the kernel patches improves the efficiency of thread creation and destruction. But does it mean that creating 1000 kernel threads in a process is no longer a really broken behaviour?

Making Linux safe for pthreads

Posted Aug 15, 2002 10:06 UTC (Thu) by stevelinton (guest, #3274) [Link]

OK, so 1000 threads consumed 4MB of unswappable kernel memory. How is this a problem? Maybe on some embedded systems, but even a palmtop usually has 64MB today, and a laptop at least 128MB. Anyway 1000 threads probably means a high-end server task, and no serious server today has less than 1GB of RAM.

Making Linux safe for pthreads

Posted Aug 16, 2002 20:41 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I think 1000 threads implies high end server only because the threads are so expensive. If they were cheaper, we could use that mechanism on old cheap computers for simple things too. There are LOTS of old machines with less than 64MB of memory that would be nice to be able to use. (My primary machine, that I use for mail, web browsing, compiling, database, and other routine computing has 40 MB).

There remains the irony that it takes 8K of state information and stack space to do with a Linux thread what would take only a few words to do in a non-thread alternative. Hence the idea that there must be some waste there that can be reclaimed.

In defense of thousands of threads

Posted Aug 15, 2002 16:13 UTC (Thu) by bhurt (guest, #3281) [Link]

The problem with user-space threads is that when one thread blocks, for whatever reason, all threads block. For library calls (read, write, etc) this can be worked around with some difficulty (does Linux support asynchronous I/O yet?). For page faults, the only work around is to spawn more threads than CPUs- that way, when one thread blocks due to a page fault another thread can run.

If you assume a couple dozen CPUs, each needing a couple dozen threads to make sure there is always at least one thread that is runnable, you get into thousands of threads really quickly.

Brian

In defense of thousands of threads

Posted Aug 19, 2002 11:21 UTC (Mon) by shane (subscriber, #3335) [Link]

IBM is currently working on the Next-Generation POSIX Threads effort, which seems to combine the best of both worlds. It creates multiple kernel threads (by default based on the number of CPU's), but not one per user thread. In this way, the kernel doesn't waste time multitasking thousands of kernel threads, but not all threads block if one is busy hogging the context in a computation loop, for instance.

Check out the home page:

http://www-124.ibm.com/pthreads/

BTW, it's based on GNU Pth.

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.