User: Password:
|
|
Subscribe / Log in / New account

Leading items

The state of KDE4

By Jake Edge
October 31, 2007

Kicking a software release out the door is always a chaotic time; the developers never feel like the software is ready while project management is trying to put a box around something that can be shipped. KDE 4, which is the first major K Desktop Environment release since KDE 3.5 in 2005, is currently in this stage. Questions are being asked about changing deadlines, beta versus alpha quality software, whether a stable release can be delivered on time, and so on. The questions reflect an underlying concern for the quality of the final release.

Part of the problem, it seems, is that earlier beta versions of KDE 4 did not get widespread use. There were enough fundamental problems that users, even those expecting a rough ride, were unable to get things going enough to start filing bugs. Torsten Rahn reported on some feedback he received at a show:

Almost all visitors who had been long-time KDE enthusiasts and were known for early feedback in the release cycle had completely failed to participate in our Beta program so far (and I know some of these people since about 8 years). Most of them had tried to build the KDE betas, failed to see a desktop / panel appearing and therefore assumed that the apps were not worth testing at the current stage. This is rather alarming given that during a late Beta cycle the targeted beta testers should have moved from the developers to the enthusiasts.

It is a difficult problem, a beta must be usable enough for people to work with it. But if it worked correctly all the time, it would already be released. For KDE 4, part of the difficulty is that the platform and underlying libraries are in pretty good shape, "release candidate" worthy, but the workspace (Plasma) and applications have lagged. Those are things that users see first, of course; fundamental bugs will turn them away from testing further.

[KDE 4 desktop]

One project that is helping get more beta testers is the KDE Four Live CD which is an openSUSE-based live distribution with KDE 4 as the desktop. Beta testers will not have to build everything from source, nor have to install anything on their machines. Screenshots from the Beta 4 release version of the live CD accompany this article.

There were some discussions on the kde-core-devel mailing list regarding whether to call things release candidates, betas, or alphas, with some advocating recognizing the quality of the code and sticking with a beta label. In response, Aaron Seigo makes a good point about the difference between software releases in the volunteer-driven free software world as opposed to paid developers in more traditional development shops:

if it isn't painfully obvious by now, people remain silent and aren't sufficiently motivated to start the last push work until these things happen. interesting human behaviour, circumvented in many corporate [environments] by having managers with unquestioned rights push developers manually. in open source, a useful tool are hard deadlines, names for releases and pushing out tarballs that are a notch in time.

The distinction is not quite as stark as Seigo paints it, "lines in the sand" are important tools for any project, volunteer or otherwise. It is the tension between getting it right and getting it into the hands of users that cause these conflicts as releases approach. There are an awful lot of "I can deal with that later" problems that come due. Determining which are critical and which can be deferred further is difficult and error-prone. It is much easier for developers to decide that they are all critical and the release needs to slip.

Sticking to previously established schedules can help by forcing things out the door, even if they aren't quite ready. Betas and release candidates are meant to be used, regardless of what they are called – they are also expected to have bugs. The flip side, of course, is that if a release is unusable, it will be largely ignored. This is what the early betas of KDE 4 seem to have run into.

[KDE 4 applications]

In a casual test drive of the Beta 4 release, there were plenty of problems to be found, but, by and large, it worked. It doesn't seem ready to run as a day-to-day desktop, yet, but testers and others interested should be able to assist in finding and tracking down bugs. The existence of a live CD should help, as will integration with existing installations; imagine being able to boot the live CD and have it work with the existing users, preferences, filesystems, etc. on the underlying system. If a showstopping bug comes along, reboot to the installed OS and file a bug. Perhaps KDE Four Live does some of this integration with an installed openSUSE, but it did not with the Kubuntu installed on the test machine.

Conflicts in the development process often come to a head when a release is imminent, KDE 4 is hardly alone in seeing these kinds of disagreements. Software is difficult to develop and is even harder to release, opinions will differ on the best approach. In general, though, everyone has an interest in the same end result – a solid, working final release – keeping that in mind can help defuse things. The release of KDE 4 is planned for December 2007, we look forward to seeing it.

Comments (12 posted)

GNOME and OOXML

By Jonathan Corbet
October 30, 2007
The OOXML document standard being pushed by Microsoft has caused a certain amount of stress within both the development and commercial sides of the free software community. In some quarters it is seen as the latest attempt by a monopolistic firm to co-opt free software and the move to more free file formats; they would like to limit our involvement to opposition to the adoption of OOXML as a standard. Others see it as an attempt by Microsoft to come to terms with the demand for more open formats and to promote, in its own special way, interoperability. Few people really think we need this particular format, but many feel that, given that it will exist, it might as well be documented and, to the extent possible, its development should be encouraged to go in relatively useful directions.

This debate returned to the foreground recently with the publication of this open letter to the GNOME Foundation on a site named, for better or worse, "Fanatic Attack." This letter begins:

It appears that the Gnome Foundation is participating in ECMA TC 45 regarding resolving comments and contradictions for DIS 29500. Given the technical shortcomings in the specification and the disregard for process that the backers of DIS 29500 have displayed during the process, Gnome's participation in this activity is to the detriment of interoperability among office suits [sic].

The letter is long and strongly-worded, but it is rather short on information about just what the GNOME Foundation's participation in this process actually is. It turns out that the letter's author never asked that question, but LWN did. One answer can be found in this response posted by GNOME Foundation board member Jeff Waugh:

While Jody Goldberg (Gnumeric maintainer) was at Novell, he had been doing rocking work on TC45-M to make sure OOXML didn't just slip through, under-specified and uninvestigated. When Jody left Novell, the GNOME Foundation joined TC45-M to support his participation, so he could continue to "keep the bastards honest". OOXML is better documented as a result of his participation.

Participation in ECMA and implementation of OOXML do not indicate support for it as an ISO standard. There are plenty of other organisations with similar "political expectations" as GNOME involved in TC45-M, most likely for many of the same reasons.

There is also an explanation from Jody Goldberg posted on the Foundation's mailing list:

OOX is a file format that is in use, and we will have to interact with it. The opportunity to improve the spec and have MS answer questions and clarify necessary details should not be wasted.

It's worth noting that Mr. Goldberg does support the standardization of the OOXML format.

This episode has inspired a certain amount of complaint on the Foundation mailing list. The problem is not the participation in the committee, which appears to be relatively uncontroversial there, but the fact that this particular controversy was not anticipated and addressed ahead of time. Had the Foundation issued a press release at the outset explaining what it was doing, it would not have to be engaging in a damage control effort now. As it is, said press release appears to be under construction, but it will likely be less effective than it could have been.

In any case, this response will not satisfy everybody. There appears to be a fundamental difference of opinion in the community over how we should deal with the OOXML effort. While nobody seems to really like this standard (OK, almost nobody), not everybody dislikes it in the same way. To some, OOXML is characterized by patent problems, extreme complexity, opaque binary blobs, and the questionable tactics of its corporate backer. For those people, any engagement with the standardization process other than outright opposition is an unacceptable compromise. They see no good that can come from recognizing this standard in any way when we already have a standardized open document format which needs support.

On the other hand, the truth of the matter is that this format exists and is in use. Free software will end up supporting this format, not (just) because certain companies want to sell services into corporate environments, but because interoperability has always been a high priority in our community. If, some day, we as a community decree that we are strong enough that we do not need to support formats we don't like, we will have lost something important.

One should not overlook another important component in this situation: the fact that OpenDocument is not the final answer to document formats. Instead, it seems that the level of criticism of this format is growing, and that development of document formats will have to continue into the future. We do not, in other words, have all the answers in this area.

So, assuming that we, as a community, do intend to interoperate with the OOXML format, it makes sense to take advantage of the opportunities presented by the standardization process to ensure that the format is (1) not completely irrational, and (2) documented as completely as it can be. Participation in the process at this level has the potential to save a lot of work and interoperability hassles in the coming years.

Once upon a time, the free software community would have had no influence over a major manufacturer's file formats. We have succeeded in changing the world to the point where such formats are expected to be open, and where our comments on those formats have to be taken seriously. To refuse to wield that influence would, in essence, be a decision to go back to the days of the early 1990's, when our thoughts were mostly confined to a few small mailing lists and went generally unheard. That would not be a step forward for our community.

That said, participation in groups like standards bodies should be done the way we do almost everything else: in full openness. The GNOME Foundation exists to represent the community of GNOME developers, many of whom, it seems, were unaware that the Foundation was representing them in this particular forum. What form this representation has taken, and what has been accomplished by it, is still somewhat unclear; this lack of transparency has made the recent flames possible. The GNOME Foundation board, presumably, knows what positions are being taken by its representative on the ECMA committee; it would behoove that board to be more active in communicating that information to the Foundation's members.

Comments (29 posted)

Memory part 6: More things programmers can do

October 31, 2007

This article was contributed by Ulrich Drepper

[Editor's note: this is part 6 of Ulrich Drepper's "What every programmer should know about memory"; this part contains the second half of section 6, covering the optimization of multi-threaded code. The first half of this section was published in part 5; please see part 1 for pointers to the other sections.]

6.4 Multi-Thread Optimizations

When it comes to multi-threading, there are three different aspects of cache use which are important:

  • Concurrency

  • Atomicity

  • Bandwidth

These aspects also apply to multi-process situations but, because multiple processes are (mostly) independent, it is not so easy to optimize for them. The possible multi-process optimizations are a subset of those available for the multi-thread scenario. So we will deal exclusively with the latter here.

Concurrency in this context refers to the memory effects a process experiences when running more than one thread at a time. A property of threads is that they all share the same address space and, therefore, can all access the same memory. In the ideal case, the memory regions used by the threads are distinct, in which case those threads are coupled only lightly (common input and/or output, for instance). If more than one thread uses the same data, coordination is needed; this is when atomicity comes into play. Finally, depending on the machine architecture, the available memory and inter-processor bus bandwidth available to the processors is limited. We will handle these three aspects separately in the following sections—although they are, of course, closely linked.

6.4.1 Concurrency Optimizations

Initially, in this section, we will discuss two separate issues which actually require contradictory optimizations. A multi-threaded application uses common data in some of its threads. Normal cache optimization calls for keeping data together so that the footprint of the application is small, thus maximizing the amount of memory which fits into the caches at any one time.

There is a problem with this approach, though: if multiple threads write to a memory location, the cache line must be in ‘E’ (exclusive) state in the L1d of each respective core. This means that a lot of RFO requests are sent, in the worst case one for each write access. So a normal write will be suddenly very expensive. If the same memory location is used, synchronization is needed (maybe through the use of atomic operations, which is handled in the next section). The problem is also visible, though, when all the threads are using different memory locations and are supposedly independent.

Figure 6.10: Concurrent Cache Line Access Overhead

Figure 6.10 shows the results of this “false sharing”. The test program (shown in Section 9.3) creates a number of threads which do nothing but increment a memory location (500 million times). The measured time is from the program start until the program finishes after waiting for the last thread. The threads are pinned to individual processors. The machine has four P4 processors. The blue values represent runs where the memory allocations assigned to each thread are on separate cache lines. The red part is the penalty occurred when the locations for the threads are moved to just one cache line.

The blue measurements (when using individual cache lines) match what one would expect. The program scales without penalty to many threads. Each processor keeps its cache line in its own L1d and there are no bandwidth issues since not much code or data has to be read (in fact, it is all cached). The measured slight increase is really system noise and probably some prefetching effects (the threads use sequential cache lines).

The measured overhead, computed by dividing the time needed when using one cache line versus a separate cache line for each thread, is 390%, 734%, and 1,147% respectively. These large numbers might be surprising at first sight but, when thinking about the cache interaction needed, it should be obvious. The cache line is pulled from one processor's cache just after it has finished writing to the cache line. All processors, except the one which has the cache line at any given moment, are delayed and cannot do anything. Each additional processor will just cause more delays.

It is clear from these measurements that this scenario must be avoided in programs. Given the huge penalty, this problem is, in many situations, obvious (profiling will show the code location, at least) but there is a pitfall with modern hardware. Figure 6.11 shows the equivalent measurements when running the code on a single processor, quad core machine (Intel Core 2 QX 6700). Even with this processor's two separate L2s the test case does not show any scalability issues. There is a slight overhead when using the same cache line more than once but it does not increase with the number of cores. {I cannot explain the lower number when all four cores are used but it is reproducible.} If more than one of these processors were used we would, of course, see results similar to those in Figure 6.10. Despite the increasing use of multi-core processors, many machines will continue to use multiple processors and, therefore, it is important to handle this scenario correctly, which might mean testing the code on real SMP machines.

Figure 6.11: Overhead, Quad Core

There is a very simple “fix” for the problem: put every variable on its own cache line. This is where the conflict with the previously mentioned optimization comes into play, specifically, the footprint of the application would increase a lot. This is not acceptable; it is therefore necessary to come up with a more intelligent solution.

What is needed is to identify which variables are used by only one thread at a time, those used by only one thread ever, and maybe those which are contested at times. Different solutions for each of these scenarios are possible and useful. The most basic criterion for the differentiation of variables is: are they ever written to and how often does this happen.

Variables which are never written to and those which are only initialized once are basically constants. Since RFO requests are only needed for write operations, constants can be shared in the cache (‘S’ state). So, these variables do not have to be treated specially; grouping them together is fine. If the programmer marks the variables correctly with const, the tool chain will move the variables away from the normal variables into the .rodata (read-only data) or .data.rel.ro (read-only after relocation) section {Sections, identified by their names are the atomic units containing code and data in an ELF file.} No other special action is required. If, for some reason, variables cannot be marked correctly with const, the programmer can influence their placement by assigning them to a special section.

When the linker constructs the final binary, it first appends the sections with the same name from all input files; those sections are then arranged in an order determined by the linker script. This means that, by moving all variables which are basically constant but are not marked as such into a special section, the programmer can group all of those variables together. There will not be a variable which is often written to between them. By aligning the first variable in that section appropriately, it is possible to guarantee that no false sharing happens. Assume this little example:

  int foo = 1;
  int bar __attribute__((section(".data.ro"))) = 2;
  int baz = 3;
  int xyzzy __attribute__((section(".data.ro"))) = 4;

If compiled, this input file defines four variables. The interesting part is that the variables foo and baz, and bar and xyzzy are grouped together respectively. Without the attribute definitions the compiler would allocate all four variables in the sequence in which they are defined in the source code the a section named .data. {This is not guaranteed by the ISO C standard but it is how gcc works.} With the code as-is the variables bar and xyzzy are placed in a section named .data.ro. The section name .data.ro is more or less arbitrary. A prefix of .data. guarantees that the GNU linker will place the section together with the other data sections.

The same technique can be applied to separate out variables which are mostly read but occasionally written. Simply choose a different section name. This separation seems to make sense in some cases like the Linux kernel.

If a variable is only ever used by one thread, there is another way to specify the variable. In this case it is possible and useful to use thread-local variables (see [mytls]). The C and C++ language in gcc allow variables to be defined as per-thread using the __thread keyword.

  int foo = 1;
  __thread int bar = 2;
  int baz = 3;
  __thread int xyzzy = 4;

The variables bar and xyzzy are not allocated in the normal data segment; instead each thread has its own separate area where such variables are stored. The variables can have static initializers. All thread-local variables are addressable by all other threads but, unless a thread passes a pointer to a thread-local variable to those other threads, there is no way the other threads can find that variable. Due to the variable being thread-local, false sharing is not a problem—unless the program artificially creates a problem. This solution is easy to set up (the compiler and linker do all the work), but it has its cost. When a thread is created, it has to spend some time on setting up the thread-local variables, which requires time and memory. In addition, addressing thread-local variables is usually more expensive than using global or automatic variables (see [mytls] for explanations of how the costs are minimized automatically, if possible).

One drawback of using thread-local storage (TLS) is that, if the use of the variable shifts over to another thread, the current value of the variable in the old thread is not available to new thread. Each thread's copy of the variable is distinct. Often this is not a problem at all and, if it is, the shift over to the new thread needs coordination, at which time the current value can be copied.

A second, bigger problem is possible waste of resources. If only one thread ever uses the variable at any one time, all threads have to pay a price in terms of memory. If a thread does not use any TLS variables, the lazy allocation of the TLS memory area prevents this from being a problem (except for TLS in the application itself). If a thread uses just one TLS variable in a DSO, the memory for all the other TLS variables in this object will be allocated, too. This could potentially add up if TLS variables are used on a large scale.

In general the best advice which can be given is

  1. Separate at least read-only (after initialization) and read-write variables. Maybe extend this separation to read-mostly variables as a third category.

  2. Group read-write variables which are used together into a structure. Using a structure is the only way to ensure the memory locations for all of those variables are close together in a way which is translated consistently by all gcc versions..

  3. Move read-write variables which are often written to by different threads onto their own cache line. This might mean adding padding at the end to fill a remainder of the cache line. If combined with step 2, this is often not really wasteful. Extending the example above, we might end up with code as follows (assuming bar and xyzzy are meant to be used together):

      int foo = 1;
      int baz = 3;
      struct {
        struct al1 {
          int bar;
          int xyzzy;
        };
        char pad[CLSIZE - sizeof(struct al1)];
      } rwstruct __attribute__((aligned(CLSIZE))) =
        { { .bar = 2, .xyzzy = 4 } };
    

    Some code changes are needed (references to bar have to be replaced with rwstruct.bar, likewise for xyzzy) but that is all. The compiler and linker do all the rest. {This code has to be compiled with -fms-extensions} on the command line.}

  4. If a variable is used by multiple threads, but every use is independent, move the variable into TLS.

6.4.2 Atomicity Optimizations

If multiple threads modify the same memory location concurrently, processors do not guarantee any specific result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

To assure this does not happen, processors provide atomic operations. These atomic operations would, for instance, not read the old value until it is clear that the addition could be performed in a way that the addition to the memory location appears as atomic. In addition to waiting for other cores and processors, some processors even signal atomic operations for specific addresses to other devices on the motherboard. All this makes atomic operations slower.

Processor vendors decided to provide different sets of atomic operations. Early RISC processors, in line with the ‘R’ for reduced, provided very few atomic operations, sometimes only an atomic bit set and test. {HP Parisc still does not provide more…} At the other end of the spectrum, we have x86 and x86-64 which provide a large number of atomic operations. The generally available atomic operations can be categorized in four classes:

Bit Test
These operations set or clear a bit atomically and return a status indicating whether the bit was set before or not.

Load Lock/Store Conditional (LL/SC)
{Some people use “linked” instead of “lock”, it is all the same.}

These operations work as a pair where the special load instruction is used to start an transaction and the final store will only succeed if the location has not been modified in the meantime. The store operation indicates success or failure, so the program can repeat its efforts if necessary.

Compare-and-Swap (CAS)
This is a ternary operation which writes a value provided as a parameter into an address (the second parameter) only if the current value is the same as the third parameter value;

Atomic Arithmetic
These operations are only available on x86 and x86-64, which can perform arithmetic and logic operations on memory locations. These processors have support for non-atomic versions of these operations but RISC architectures do not. So it is no wonder that their availability is limited.

An architecture supports either the LL/SC or the CAS instruction, not both. Both approaches are basically equivalent; they allow the implementation of atomic arithmetic operations equally well, but CAS seems to be the preferred method these days. All other operations can be indirectly implemented using it. For instance, an atomic addition:

  int curval;
  int newval;
  do {
    curval = var;
    newval = curval + addend;
  } while (CAS(&var, curval, newval));

The result of the CAS call indicates whether the operation succeeded or not. If it returns failure (non-zero value), the loop is run again, the addition is performed, and the CAS call is tried again. This repeats until it is successful. Noteworthy about the code is that the address of the memory location has to be computed in two separate instructions. {The CAS opcode on x86 and x86-64 can avoid the load of the value in the second and later iterations but, on this platform, we can write the atomic addition in a simpler way, with a single addition opcode.} For LL/SC the code looks about the same.

  int curval;
  int newval;
  do {
    curval = LL(var);
    newval = curval + addend;
  } while (SC(var, newval));

Here we have to use a special load instruction (LL) and we do not have to pass the current value of the memory location to SC since the processor knows if the memory location has been modified in the meantime.

The big differentiators are x86 and x86-64 where we have the atomic operations and, here, it is important to select the proper atomic operation to achieve the best result. Figure 6.12 shows three different ways to implement an atomic increment operation.

    for (i = 0; i < N; ++i)
      __sync_add_and_fetch(&var,1);

1. Add and Read Result

    for (i = 0; i < N; ++i)
      __sync_fetch_and_add(&var,1);

2. Add and Return Old Value

    for (i = 0; i < N; ++i) {
      long v, n;
      do {
        v = var;
        n = v + 1;
      } while (!__sync_bool_compare_and_swap(&var, v, n));
    }

3. Atomic Replace with New Value

Figure 6.12: Atomic Increment in a Loop

All three produce different code on x86 and x86-64 while the code might be identical on other architectures. There are huge performance differences. The following table shows the execution time for 1 million increments by four concurrent threads. The code uses the built-in primitives of gcc (__sync_*).

1. Exchange Add2. Add Fetch3. CAS
0.23s0.21s 0.73s

The first two numbers are similar; we see that returning the old value is a little bit faster. The important piece of information is the highlighted field, the cost when using CAS. It is, unsurprisingly, a lot more expensive. There are several reasons for this: 1. there are two memory operations, 2. the CAS operation by itself is more complicated and requires even conditional operation, and 3. the whole operation has to be done in a loop in case two concurrent accesses cause a CAS call to fail.

Now a reader might ask a question: why would somebody use the complicated and longer code which utilizes CAS? The answer to this is: the complexity is usually hidden. As mentioned before, CAS is currently the unifying atomic operation across all interesting architectures. So some people think it is sufficient to define all atomic operations in terms of CAS. This makes programs simpler. But as the numbers show, the results can be everything but optimal. The memory handling overhead of the CAS solution is huge. The following illustrates the execution of just two threads, each on its own core.

Thread #1Thread #2var Cache State
v = var ‘E’ on Proc 1
n = v + 1v = var‘S’ on Proc 1+2
CAS(var)n = v + 1‘E’ on Proc 1
CAS(var)‘E’ on Proc 2

We see that, within this short period of execution, the cache line status changes at least three times; two of the changes are RFOs. Additionally, the second CAS will fail, so that thread has to repeat the whole operation. During that operation the same can happen again.

In contrast, when the atomic arithmetic operations are used, the processor can keep the load and store operations needed to perform the addition (or whatever) together. It can ensure that concurrently-issued cache line requests are blocked until the atomic operation is done. Each loop iteration in the example therefore results in, at most, one RFO cache request and nothing else.

What all this means is that it is crucial to define the machine abstraction at a level at which atomic arithmetic and logic operations can be utilized. CAS should not be universally used as the unification mechanism.

For most processors, the atomic operations are, by themselves, always atomic. One can avoid them only by providing completely separate code paths for the case when atomicity is not needed. This means more code, a conditional, and further jumps to direct execution appropriately.

For x86 and x86-64 the situation is different: the same instructions can be used in both atomic and non-atomic ways. To make them atomic, a special prefix for the instruction is used: the lock prefix. This opens the door for atomic operations to avoid the high costs if the atomicity requirement in a given situation is not needed. Generic code in libraries, for example, which always has to be thread-safe if needed, can benefit from this. No information is needed when writing the code, the decision can be made at runtime. The trick is to jump over the lock prefix. This trick applies to all the instructions which the x86 and x86-64 processor allow to prefix with lock.

      cmpl $0, multiple_threads
      je   1f
      lock
  1:  add  $1, some_var

If this assembler code appears cryptic, do not worry, it is simple. The first instruction checks whether a variable is zero or not. Nonzero in this case indicates that more than one thread is running. If the value is zero, the second instruction jumps to label 1. Otherwise, the next instruction is executed. This is the tricky part. If the je instruction does not jump, the add instruction is executed with the lock prefix. Otherwise it is executed without the lock prefix.

Adding a relatively expensive operation like a conditional jump (expensive in case the branch prediction fails) seems to be counter productive. Indeed it can be: if multiple threads are running most of the time, the performance is further decreased, especially if the branch prediction is not correct. But if there are many situations where only one thread is in use, the code is significantly faster. The alternative of using an if-then-else construct introduces an additional unconditional jump in both cases which can be slower. Given that an atomic operation costs on the order of 200 cycles, the cross-over point for using the trick (or the if-then-else block) is pretty low. This is definitely a technique to be kept in mind. Unfortunately this means gcc's __sync_* primitives cannot be used.

6.4.3 Bandwidth Considerations

When many threads are used, and they do not cause cache contention by using the same cache lines on different cores, there still are potential problems. Each processor has a maximum bandwidth to the memory which is shared by all cores and hyper-threads on that processor. Depending on the machine architecture (e.g., the one in Figure 2.1), multiple processors might share the same bus to memory or the Northbridge.

The processor cores themselves run at frequencies where, at full speed, even in perfect conditions, the connection to the memory cannot fulfill all load and store requests without waiting. Now, further divide the available bandwidth by the number of cores, hyper-threads, and processors sharing a connection to the Northbridge and suddenly parallelism becomes a big problem. Programs which are, in theory, very efficient may be limited by the memory bandwidth.

In Figure 3.32 we have seen that increasing the FSB speed of a processor can help a lot. This is why, with growing numbers of cores on a processor, we will also see an increase in the FSB speed. Still, this will never be enough if the program uses large working sets and it is sufficiently optimized. Programmers have to be prepared to recognize problems due to limited bandwidth.

The performance measurement counters of modern processors allow the observation of FSB contention. On Core 2 processors the NUS_BNR_DRV event counts the number of cycles a core has to wait because the bus is not ready. This indicates that the bus is highly used and loads from or stores to main memory take even longer than usual. The Core 2 processors support more events which can count specific bus actions like RFOs or the general FSB utilization. The latter might come in handy when investigating the possibility of scalability of an application during development. If the bus utilization rate is already close to 1.0 then the scalability opportunities are minimal.

If a bandwidth problem is recognized, there are several things which can be done. They are sometimes contradictory so some experimentation might be necessary. One solution is to buy faster computers, if there are some available. Getting more FSB speed, faster RAM modules, and possibly memory local to the processor, can—and probably will—help. It can cost a lot, though. If the program in question is only needed on one (or a few machines) the one-time expense for the hardware might cost less than reworking the program. In general, though, it is better to work on the program.

After optimizing the program itself to avoid cache misses, the only option left to achieve better bandwidth utilization is to place the threads better on the available cores. By default, the scheduler in the kernel will assign a thread to a processor according to its own policy. Moving a thread from one core to another is avoided when possible. The scheduler does not really know anything about the workload, though. It can gather information from cache misses etc but this is not much help in many situations.

Figure 6.13: Inefficient Scheduling

One situation which can cause big FSB usage is when two threads are scheduled on different processors (or cores which do not share a cache) and they use the same data set. Figure 6.13 shows such a situation. Core 1 and 3 access the same data (indicated by the same color for the access indicator and the memory area). Similarly core 2 and 4 access the same data. But the threads are scheduled on different processors. This means each data set has to be read twice from memory. This situation can be handled better.

Figure 6.14: Efficient Scheduling

In Figure 6.14 we see how it should ideally look like. Now the total cache size in use is reduced since now core 1 and 2 and core 3 and 4 work on the same data. The data sets have to be read from memory only once.

This is a simple example but, by extension, it applies to many situations. As mentioned before, the scheduler in the kernel has no insight into the use of data, so the programmer has to ensure that scheduling is done efficiently. There are not many kernel interfaces available to communicate this requirement. In fact, there is only one: defining thread affinity.

Thread affinity means assigning a thread to one or more cores. The scheduler will then choose among those cores (only) when deciding where to run the thread. Even if other cores are idle they will not be considered. This might sound like a disadvantage, but it is the price one has to pay. If too many threads exclusively run on a set of cores the remaining cores might mostly be idle and there is nothing one can do except change the affinity. By default threads can run on any core.

There are a number of interfaces to query and change the affinity of a thread:

#define _GNU_SOURCE
#include <sched.h>

int sched_setaffinity(pid_t pid, size_t size, const cpu_set_t *cpuset);
int sched_getaffinity(pid_t pid, size_t size, cpu_set_t *cpuset);

These two interfaces are meant to be used for single-threaded code. The pid argument specifies which process's affinity should be changed or determined. The caller obviously needs appropriate privileges to do this. The second and third parameter specify the bitmask for the cores. The first function requires the bitmask to be filled in so that it can set the affinity. The second fills in the bitmask with the scheduling information of the selected thread. The interfaces are declared in <sched.h>.

The cpu_set_t type is also defined in that header, along with a number of macros to manipulate and use objects of this type.

#define _GNU_SOURCE
#include <sched.h>

#define CPU_SETSIZE
#define CPU_SET(cpu, cpusetp)
#define CPU_CLR(cpu, cpusetp)
#define CPU_ZERO(cpusetp)
#define CPU_ISSET(cpu, cpusetp)
#define CPU_COUNT(cpusetp)

CPU_SETSIZE specifies how many CPUs can be represented in the data structure. The other three macros manipulate cpu_set_t objects. To initialize an object CPU_ZERO should be used; the other two macros should be used to select or deselect individual cores. CPU_ISSET tests whether a specific processor is part of the set. CPU_COUNT returns the number of cores selected in the set. The cpu_set_t type provide a reasonable default value for the upper limit on the number of CPUs. Over time it certainly will prove too small; at that point the type will be adjusted. This means programs always have to keep the size in mind. The above convenience macros implicitly handle the size according to the definition of cpu_set_t. If more dynamic size handling is needed an extended set of macros should be used:

#define _GNU_SOURCE
#include <sched.h>

#define CPU_SET_S(cpu, setsize, cpusetp)
#define CPU_CLR_S(cpu, setsize, cpusetp)
#define CPU_ZERO_S(setsize, cpusetp)
#define CPU_ISSET_S(cpu, setsize, cpusetp)
#define CPU_COUNT_S(setsize, cpusetp)

These interfaces take an additional parameter with the size. To be able to allocate dynamically sized CPU sets three macros are provided:

#define _GNU_SOURCE
#include <sched.h>

#define CPU_ALLOC_SIZE(count)
#define CPU_ALLOC(count)
#define CPU_FREE(cpuset)

The CPU_ALLOC_SIZE macro returns the number of bytes which have to be allocated for a cpu_set_t structure which can handle count CPUs. To allocate such a block the CPU_ALLOC macro can be used. The memory allocated this way should be freed with CPU_FREE. The functions will likely use malloc and free behind the scenes but this does not necessarily have to remain this way.

Finally, a number of operations on CPU set objects are defined:

#define _GNU_SOURCE
#include <sched.h>

#define CPU_EQUAL(cpuset1, cpuset2)
#define CPU_AND(destset, cpuset1, cpuset2)
#define CPU_OR(destset, cpuset1, cpuset2)
#define CPU_XOR(destset, cpuset1, cpuset2)
#define CPU_EQUAL_S(setsize, cpuset1, cpuset2)
#define CPU_AND_S(setsize, destset, cpuset1, cpuset2)
#define CPU_OR_S(setsize, destset, cpuset1, cpuset2)
#define CPU_XOR_S(setsize, destset, cpuset1, cpuset2)

These two sets of four macros can check two sets for equality and perform logical AND, OR, and XOR operations on sets. These operations come in handy when using some of the libNUMA functions (see Section 12).

A process can determine on which processor it is currently running using the sched_getcpu interface:

#define _GNU_SOURCE
#include <sched.h>
int sched_getcpu(void);

The result is the index of the CPU in the CPU set. Due to the nature of scheduling this number cannot always be 100% correct. The thread might have been moved to a different CPU between the time the result was returned and when the thread returns to userlevel. Programs always have to take this possibility of inaccuracy into account. More important is, in any case, the set of CPUs the thread is allowed to run on. This set can be retrieved using sched_getaffinity. The set is inherited by child threads and processes. Threads cannot rely on the set to be stable over the lifetime. The affinity mask can be set from the outside (see the pid parameter in the prototypes above); Linux also supports CPU hot-plugging which means CPUs can vanish from the system—and, therefore, also from the affinity CPU set.

In multi-threaded programs, the individual threads officially have no process ID as defined by POSIX and, therefore, the two functions above cannot be used. Instead <pthread.h> declares four different interfaces:

#define _GNU_SOURCE
#include <pthread.h>

int pthread_setaffinity_np(pthread_t th, size_t size,
                           const cpu_set_t *cpuset);
int pthread_getaffinity_np(pthread_t th, size_t size, cpu_set_t *cpuset);
int pthread_attr_setaffinity_np(pthread_attr_t *at,
                                size_t size, const cpu_set_t *cpuset);
int pthread_attr_getaffinity_np(pthread_attr_t *at, size_t size, 
                                cpu_set_t *cpuset);

The first two interfaces are basically equivalent to the two we have already seen, except that they take a thread handle in the first parameter instead of a process ID. This allows addressing individual threads in a process. It also means that these interfaces cannot be used from another process, they are strictly for intra-process use. The third and fourth interfaces use a thread attribute. These attributes are used when creating a new thread. By setting the attribute, a thread can be scheduled from the start on a specific set of CPUs. Selecting the target processors this early—instead of after the thread already started—can be of advantage on many different levels, including (and especially) memory allocation (see NUMA in Section 6.5).

Speaking of NUMA, the affinity interfaces play a big role in NUMA programming, too. We will come back to that case shortly.

So far, we have talked about the case where the working set of two threads overlaps such that having both threads on the same core makes sense. The opposite can be true, too. If two threads work on separate data sets, having them scheduled on the same core can be a problem. Both threads fight for the same cache, thereby reducing each others effective use of the cache. Second, both data sets have to be loaded into the same cache; in effect this increases the amount of data that has to be loaded and, therefore, the available bandwidth is cut in half.

The solution in this case is to set the affinity of the threads so that they cannot be scheduled on the same core. This is the opposite from the previous situation, so it is important to understand the situation one tries to optimize before making any changes.

Optimizing for cache sharing to optimize bandwidth is in reality an aspect of NUMA programming which is covered in the next section. One only has to extend the notion of “memory” to the caches. This will become ever more important once the number of levels of cache increases. For this reason, the solution to multi-core scheduling is available in the NUMA support library. See the code samples in Section 12 for ways to determine the affinity masks without hardcoding system details or diving into the depth of the /sys filesystem.

6.5 NUMA Programming

For NUMA programming everything said so far about cache optimizations applies as well. The differences only start below that level. NUMA introduces different costs when accessing different parts of the address space. With uniform memory access we can optimize to minimize page faults (see Section 7.5) but that is about it. All pages are created equal.

NUMA changes this. Access costs can depend on the page which is accessed. Differing access costs also increase the importance of optimizing for memory page locality. NUMA is inevitable for most SMP machines since both Intel with CSI (for x86,x86-64, and IA-64) and AMD (for Opteron) use it. With an increasing number of cores per processor we are likely to see a sharp reduction of SMP systems being used (at least outside data centers and offices of people with terribly high CPU usage requirements). Most home machines will be fine with just one processor and hence no NUMA issues. But this a) does not mean programmers can ignore NUMA and b) it does not mean there are not related issues.

If one thinks about generalizations to NUMA one quickly realizes the concept extends to processor caches as well. Two threads on cores using the same cache will collaborate faster than threads on cores not sharing a cache. This is not a fabricated case:

  • early dual-core processors had no L2 sharing.

  • Intel's Core 2 QX 6700 and QX 6800 quad core chips, for instance, have two separate L2 caches.

  • as speculated early, with more cores on a chip and the desire to unify caches, we will have more levels of caches.

Caches form their own hierarchy, and placement of threads on cores becomes important for sharing (or not) of caches. This is not very different from the problems NUMA is facing and, therefore, the two concepts can be unified. Even people only interested in non-SMP machines should therefore read this section.

In Section 5.3 we have seen that the Linux kernel provides a lot of information which is useful—and needed—in NUMA programming. Collecting this information is not that easy, though. The currently available NUMA library on Linux is wholly inadequate for this purpose. A much more suitable version is currently under construction by the author.

The existing NUMA library, libnuma, part of the numactl package, provides no access to system architecture information. It is only a wrapper around the available system calls together with some convenience interfaces for commonly used operations. The system calls available on Linux today are:

mbind
Select binding for specified memory pages to nodes.

set_mempolicy
Set the default memory binding policy.

get_mempolicy
Get the default memory binding policy.

migrate_pages
Migrate all pages of a process on a given set of nodes to a different set of nodes.

move_pages
Move selected pages to given node or request node information about pages.

These interfaces are declared in <numaif.h> which comes along with the libnuma library. Before we go into more details we have to understand the concept of memory policies.

6.5.1 Memory Policy

The idea behind defining a memory policy is to allow existing code to work reasonably well in a NUMA environment without major modifications. The policy is inherited by child processes, which makes it possible to use the numactl tool. This tool can be used to, among other things, start a program with a given policy.

The Linux kernel supports the following policies:

MPOL_BIND
Memory is allocated only from the given set of nodes. If this is not possible allocation fails.

MPOL_PREFERRED
Memory is preferably allocated from the given set of nodes. If this fails memory from other nodes is considered.

MPOL_INTERLEAVE
Memory is allocated equally from the specified nodes. The node is selected either by the offset in the virtual memory region for VMA-based policies, or through a free-running counter for task-based policies.

MPOL_DEFAULT
Choose the allocation based on the default for the region.

This list seems to recursively define policies. This is half true. In fact, memory policies form a hierarchy (see Figure 6.15).

Figure 6.15: Memory Policy Hierarchy

If an address is covered by a VMA policy then this policy is used. A special kind of policy is used for shared memory segments. If no policy for the specific address is present, the task's policy is used. If this is also not present the system's default policy is used.

The system default is to allocate memory local to the thread requesting the memory. No task and VMA policies are provided by default. For a process with multiple threads the local node is the “home” node, the one which first ran the process. The system calls mentioned above can be used to select different policies.

6.5.2 Specifying Policies

The set_mempolicy call can be used to set the task policy for the current thread (task in kernel-speak). Only the current thread is affected, not the entire process.

#include <numaif.h>

long set_mempolicy(int mode, 
                   unsigned long *nodemask, 
		   unsigned long maxnode);

The mode parameter must be one of the MPOL_* constants introduced in the previous section. The nodemask parameter specifies the memory nodes to use and maxnode is the number of nodes (i.e., bits) in nodemask. If MPOL_DEFAULT is used the nodemask parameter is ignored. If a null pointer is passed as nodemask for MPOL_PREFERRED the local node is selected. Otherwise MPOL_PREFERRED uses the lowest node number with the corresponding bit set in nodemask.

Setting a policy does not have any effect on already-allocated memory. Pages are not automatically migrated; only future allocations are affected. Note the difference between memory allocation and address space reservation: an address space region established using mmap is usually not automatically allocated. The first read or write operation on the memory region will allocate the appropriate page. If the policy changes between accesses to different pages of the same address space region, or if the policy allows allocation of memory from different nodes, a seemingly uniform address space region might be scattered across many memory nodes.

6.5.3 Swapping and Policies

If physical memory runs out, the system has to drop clean pages and save dirty pages to swap. The Linux swap implementation discards node information when it writes pages to swap. That means when the page is reused and paged in the node which is used will be chosen from scratch. The policies for the thread will likely cause a node which is close to the executing processors to be chosen, but the node might be different from the one used before.

This changing association means that the node association cannot be stored by a program as a property of the page. The association can change over time. For pages which are shared with other processes this can also happen because a process asks for it (see the discussion of mbind below). The kernel by itself can migrate pages if one node runs out of space while other nodes still have free space.

Any node association the user-level code learns about can therefore be true for only a short time. It is more of a hint than absolute information. Whenever accurate knowledge is required the get_mempolicy interface should be used (see Section 6.5.5).

6.5.4 VMA Policy

To set the VMA policy for an address range a different interface has to be used:

#include <numaif.h>

long mbind(void *start, unsigned long len,
           int mode,
           unsigned long *nodemask,
           unsigned long maxnode,
           unsigned flags);

This interface registers a new VMA policy for the address range [start, start + len). Since memory handling operates on pages the start address must be page-aligned. The len value is rounded up to the next page size.

The mode parameter specifies, again, the policy; the values must be chosen from the list in Section 6.5.1. As with set_mempolicy, the nodemask parameter is only used for some policies. Its handling is identical.

The semantics of the mbind interface depends on the value of the flags parameter. By default, if flags is zero, the system call sets the VMA policy for the address range. Existing mappings are not affected. If this is not sufficient there are currently three flags to modify this behavior; they can be selected individually or together:

MPOL_MF_STRICT
The call to mbind will fail if not all pages are on the nodes specified by nodemask. In case this flag is used together with MPOL_MF_MOVE and/or MPOL_MF_MOVEALL the call will fail if any page cannot be moved.

MPOL_MF_MOVE
The kernel will try to move any page in the address range allocated on a node not in the set specified by nodemask. By default, only pages used exclusively by the current process's page tables are moved.

MPOL_MF_MOVEALL
Like MPOL_MF_MOVE but the kernel will try to move all pages, not just those used by the current process's page tables alone. This has system-wide implications since it influences the memory access of other processes—which are possibly not owned by the same user—as well. Therefore MPOL_MF_MOVEALL is a privileged operation (CAP_NICE capability is needed).

Note that support for MPOL_MF_MOVE and MPOL_MF_MOVEALL was added only in the 2.6.16 Linux kernel.

Calling mbind without any flags is most useful when the policy for a newly reserved address range has to be specified before any pages are actually allocated.

void *p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_ANON, -1, 0);
if (p != MAP_FAILED)
  mbind(p, len, mode, nodemask, maxnode, 0);

This code sequence reserve an address space range of len bytes and specifies that the policy mode referencing the memory nodes in nodemask should be used. Unless the MAP_POPULATE flag is used with mmap, no memory will have been allocated by the time of the mbind call and, therefore, the new policy applies to all pages in that address space region.

The MPOL_MF_STRICT flag alone can be used to determine whether any page in the address range described by the start and len parameters to mbind is allocated on nodes other than those specified by nodemask. No allocated pages are changed. If all pages are allocated on the specified nodes, the VMA policy for the address space region will be changed according to mode.

Sometimes the rebalancing of memory is needed, in which case it might be necessary to move pages allocated on one node to another node. Calling mbind with MPOL_MF_MOVE set makes a best effort to achieve that. Only pages which are solely referenced by the process's page table tree are considered for moving. There can be multiple users in the form of threads or other processes which share that part of the page table tree. It is not possible to affect other processes which happen to map the same data. These pages do not share the page table entries.

If both MPOL_MF_STRICT and MPOL_MF_MOVE are passed to mbind the kernel will try to move all pages which are not allocated on the specified nodes. If this is not possible the call will fail. Such a call might be useful to determine whether there is a node (or set of nodes) which can house all the pages. Several combinations can be tried in succession until a suitable node is found.

The use of MPOL_MF_MOVEALL is harder to justify unless running the current process is the main purpose of the computer. The reason is that even pages that appear in multiple page tables are moved. That can easily affect other processes in a negative way. This operation should thus be used with caution.

6.5.5 Querying Node Information

The get_mempolicy interface can be used to query a variety of facts about the state of NUMA for a given address.

#include <numaif.h>
long get_mempolicy(int *policy,
             const unsigned long *nmask,
             unsigned long maxnode,
             void *addr, int flags);

When get_mempolicy is called without a flag set in flags, the information about the policy for address addr is stored in the word pointed to by policy and in the bitmask for the nodes pointed to by nmask. If addr falls into an address space region for which a VMA policy has been specified, information about that policy is returned. Otherwise information about the task policy or, if necessary, system default policy will be returned.

If the MPOL_F_NODE flag is set in flags, and the policy governing addr is MPOL_INTERLEAVE, the value stored in the word pointed to by policy is the index of the node on which the next allocation is going to happen. This information can potentially be used to set the affinity of a thread which is going to work on the newly-allocated memory. This might be a less costly way to achieve proximity, especially if the thread has yet to be created.

The MPOL_F_ADDR flag can be used to retrieve yet another completely different data item. If this flag is used, the value stored in the word pointed to by policy is the index of the memory node on which the memory for the page containing addr has been allocated. This information can be used to make decisions about possible page migration, to decide which thread could work on the memory location most efficiently, and many more things.

The CPU—and therefore memory node—a thread is using is much more volatile than its memory allocations. Memory pages are, without explicit requests, only moved in extreme circumstances. A thread can be assigned to another CPU as the result of rebalancing the CPU loads. Information about the current CPU and node might therefore be short-lived. The scheduler will try to keep the thread on the same CPU, and possibly even on the same core, to minimize performance losses due to cold caches. This means it is useful to look at the current CPU and node information; one only must avoid assuming the association will not change.

libNUMA provides two interfaces to query the node information for a given virtual address space range:

#include <libNUMA.h>

int NUMA_mem_get_node_idx(void *addr);
int NUMA_mem_get_node_mask(void *addr,
                           size_t size,
                           size_t __destsize,
                           memnode_set_t *dest);

NUMA_mem_get_node_mask sets in dest the bits for all memory nodes on which the pages in the range [addr, addr+size) are (or would be) allocated, according to the governing policy. NUMA_mem_get_node only looks at the address addr and returns the index of the memory node on which this address is (or would be) allocated. These interfaces are simpler to use than get_mempolicy and probably should be preferred.

The CPU currently used by a thread can be queried using sched_getcpu (see Section 6.4.3). Using this information, a program can determine the memory node(s) which are local to the CPU using the NUMA_cpu_to_memnode interface from libNUMA:

#include <libNUMA.h>

int NUMA_cpu_to_memnode(size_t cpusetsize,
                        const cpu_set_t *cpuset,
                        size_t memnodesize,
                        memnode_set_t *
                        memnodeset);

A call to this function will set (in the memory node set pointed to by the fourth parameter) all the bits corresponding to memory nodes which are local to any of the CPUs in the set pointed to by the second parameter. Just like CPU information itself, this information is only correct until the configuration of the machine changes (for instance, CPUs get removed and added).

The bits in the memnode_set_t objects can be used in calls to the low-level functions like get_mempolicy. It is more convenient to use the other functions in libNUMA. The reverse mapping is available through:

#include <libNUMA.h>

int NUMA_memnode_to_cpu(size_t memnodesize,
                        const memnode_set_t *
                        memnodeset,
                        size_t cpusetsize,
                        cpu_set_t *cpuset);

The bits set in the resulting cpuset are those of the CPUs local to any of the memory nodes with corresponding bits set in memnodeset. For both interfaces, the programmer has to be aware that the information can change over time (especially with CPU hot-plugging). In many situations, a single bit is set in the input bit set, but it is also meaningful, for instance, to pass the entire set of CPUs retrieved by a call to sched_getaffinity to NUMA_cpu_to_memnode to determine which are the memory nodes the thread ever can have direct access to.

6.5.6 CPU and Node Sets

Adjusting code for SMP and NUMA environments by changing the code to use the interfaces described so far might be prohibitively expensive (or impossible) if the sources are not available. Additionally, the system administrator might want to impose restrictions on the resources a user and/or process can use. For these situations the Linux kernel supports so-called CPU sets. The name is a bit misleading since memory nodes are also covered. They also have nothing to do with the cpu_set_t data type.

The interface to CPU sets is, at the moment, a special filesystem. It is usually not mounted (so far at least). This can be changed with

     mount -t cpuset none /dev/cpuset

Of course the mount point /dev/cpuset must exist. The content of this directory is a description of the default (root) CPU set. It comprises initially all CPUs and all memory nodes. The cpus file in that directory shows the CPUs in the CPU set, the mems file the memory nodes, the tasks file the processes.

To create a new CPU set one simply creates a new directory somewhere in the hierarchy. The new CPU set will inherit all settings from the parent. Then the CPUs and memory nodes for new CPU set can be changed by writing the new values into the cpus and mems pseudo files in the new directory.

If a process belongs to a CPU set, the settings for the CPUs and memory nodes are used as masks for the affinity and memory policy bitmasks. That means the program cannot select any CPU in the affinity mask which is not in the cpus file for the CPU set the process is using (i.e., where it is listed in the tasks file). Similarly for the node masks for the memory policy and the mems file.

The program will not experience any errors unless the bitmasks are empty after the masking, so CPU sets are an almost-invisible means to control program execution. This method is especially efficient on large machines with lots of CPUs and/or memory nodes. Moving a process into a new CPU set is as simple as writing the process ID into the tasks file of the appropriate CPU set.

The directories for the CPU sets contain a number of other files which can be used to specify details like behavior under memory pressure and exclusive access to CPUs and memory nodes. The interested reader is referred to the file Documentation/cpusets.txt in the kernel source tree.

6.5.7 Explicit NUMA Optimizations

All the local memory and affinity rules cannot help out if all threads on all the nodes need access to the same memory regions. It is, of course, possible to simply restrict the number of threads to a number supportable by the processors which are directly connected to the memory node. This does not take advantage of SMP NUMA machines, though, and is therefore not a real option.

If the data in question is read-only there is a simple solution: replication. Each node can get its own copy of the data so that no inter-node accesses are necessary. Code to do this can look like this:

void *local_data(void) {
  static void *data[NNODES];
  int node =
    NUMA_memnode_self_current_idx();
  if (node == -1)
    /* Cannot get node, pick one.  */
    node = 0;
  if (data[node] == NULL)
    data[node] = allocate_data();
  return data[node];
}
void worker(void) {
  void *data = local_data();
  for (...)
    compute using data
}

In this code the function worker prepares by getting a pointer to the local copy of the data by a call to local_data. Then it proceeds with the loop, which uses this pointer. The local_data function keeps a list of the already allocated copies of the data around. Each system has a limited number of memory nodes, so the size of the array with the pointers to the per-node memory copies is limited in size. The NUMA_memnode_system_count function from libNUMA returns this number. If the pointer for the current node, as determined by the NUMA_memnode_self_current_idx call, is not yet known a new copy is allocated.

It is important to realize that nothing terrible happens if the threads get scheduled onto another CPU connected to a different memory node after the sched_getcpu system call. It just means that the accesses using the data variable in worker access memory on another memory node. This slows the program down until data is computed anew, but that is all. The kernel will always avoid gratuitous rebalancing of the per-CPU run queues. If such a transfer happens it is usually for a good reason and will not happen again for the near future.

Things are more complicated when the memory area in question is writable. Simple duplication will not work in this case. Depending on the exact situation there might a number of possible solutions.

For instance, if the writable memory region is used to accumulate results, it might be possible to first create a separate region for each memory node in which the results are accumulated. Then, when this work is done, all the per-node memory regions are combined to get the total result. This technique can work even if the work never really stops, but intermediate results are needed. The requirement for this approach is that the accumulation of a result is stateless, i.e., it does not depend on the previously collected results.

It will always be better, though, to have direct access to the writable memory region. If the number of accesses to the memory region is substantial, it might be a good idea to force the kernel to migrate the memory pages in question to the local node. If the number of accesses is really high, and the writes on different nodes do not happen concurrently, this could help. But be aware that the kernel cannot perform miracles: the page migration is a copy operation and as such it is not cheap. This cost has to be amortized.

6.5.8 Utilizing All Bandwidth

The numbers in Figure 5.4 show that access to remote memory when the caches are ineffective is not measurably slower than access to local memory. This means a program could possibly save bandwidth to the local memory by writing data it does not have to read again into memory attached to another processor. The bandwidth of the connection to the DRAM modules and the bandwidth of the interconnects are mostly independent, so parallel use could improve overall performance.

Whether this is really possible depends on many factors. One really has to be sure that caches are ineffective since otherwise the slowdown related to remote accesses is measurable. Another big problem is whether the remote node has any needs for its own memory bandwidth. This possibility must be examined in detail before the approach is taken. In theory, using all the bandwidth available to a processor can have positive effects. A family 10h Opteron processor can be directly connected to up to four other processors. Utilizing all that additional bandwidth, perhaps coupled with appropriate prefetches (especially prefetchw) could lead to improvements if the rest of the system plays along.

Comments (13 posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds