User: Password:
|
|
Subscribe / Log in / New account

Leading items

The future of AppArmor

By Jake Edge
October 17, 2007

Late last month, Novell laid off the development team for the AppArmor security tool. AppArmor is widely deployed by SUSE Linux users to restrict programs from accessing things that they shouldn't. Novell intends to keep shipping AppArmor, while two other distributions are adding support for it, which makes this move a bit puzzling. Reasons are hard to come by when a "reduction in force" (a common euphemism for layoff) happens, but Novell did clearly indicate that they had no plans to stop using AppArmor as the "core security technology in SUSE Linux Enterprise."

When a project team is laid off, it is common for the team to lose interest in the project – go off to find other things to do – but that does not appear to be the case here. Some of the laid-off team members have formed Mercenary Linux to do AppArmor consulting. They intend to work with Novell and others to guide AppArmor through the kernel submission process, with the goal of getting merged into the mainline. There are some hurdles to clear before that can happen – if it does – but AppArmor does not have the look of a project being abandoned, at least yet.

AppArmor was originally a proprietary program, which Novell acquired in 2005 when they bought Immunix, the company that developed it. In January 2006, Novell released it under the GPL and in April of that year, submitted it as a patch for inclusion in the kernel. The reaction was rather unfavorable, with the main issue being the reliance on paths, rather than information stored in the filesystem inode, to determine security policy. The main advantage cited by AppArmor proponents is that it is much easier to understand and manage compared to SELinux, its main competitor in the Linux security module arena.

AppArmor is included in SUSE Linux and has become popular, so much so that both Mandriva and Ubuntu are shipping it in their next releases. Because of that, Crispin Cowan, founder of Immunix and former AppArmor team lead at Novell, guesses that "by early 2008 a majority of all Linux users will have AppArmor running on their desktop."

After letting the developers go, Novell has no plans to stop shipping AppArmor according to Kevan Barney, senior public relations manager:

We remain committed to AppArmor as our application security solution inside SUSE Linux Enterprise. We have no plans to change to SELinux or another alternative technology, although we always reserve the right to evaluate market conditions to provide the maximum value to our customers.

AppArmor is shifting to an open source development model, where Novell will still be participating as part of the community. As Barney puts it:

[...] we partner with the community to provide a part of the innovation and testing efforts, which we complement with our own focused efforts and investments. Novell will continue its maintenance of the core kernel code and will continue in our efforts to move this upstream. We will also invest in key new features as driven by market need.

Cowan agrees that the project is moving away from a one-company model: "AppArmor is becoming a truly distributed open source project, and Mercenary Linux hopes to be the hub of that community." He and the other former team members who formed Mercenary Linux are poised to assist with AppArmor development:

We have an ongoing commitment to the community that we will work to fulfill - distribution vendors needing integration help, consulting firms looking for even better management tools, and bug fixes for the distributions that AppArmor is deployed in.

Both Novell and Mercenary will be pushing to get AppArmor into the kernel, with another patch submission from Novell expected soon. The impediments to getting those patches accepted are outlined by Cowan:

The barriers to acceptance are both technical and political. Technical is "the way you want to do something conflicts with the way I want to do something" and political is "... and mine is more important than yours" :-) An unfortunate resolution to that is a slugfest of whose really is more important, and an adroit solution is to find a way to achieve both that doesn't conflict. Developers at Novell and Mercenary are working on that latter path.

AppArmor provides some amount of protection against programs trying to access files or perform actions that they shouldn't. Just how much protection it provides is the subject of much debate. There are valid concerns that it papers over the complexities of securing Linux, providing a false sense of security, but it would appear that there is a clear path for it to be included in the kernel. After Linus Torvalds's recent pronouncement that the Linux Security Modules API would stay in the kernel, one potential barrier to AppArmor acceptance has fallen.

It remains to be seen if Novell, Mercenary, and the AppArmor community can work with the kernel hackers to resolve some outstanding issues. The path-based architecture of AppArmor, while contentious, is not likely to keep it out of the kernel. It has been a year and a half since the first submission, though; it will require a concerted effort to work through the process. With three distributions shipping it and minimal impact on those who do not enable it, it seems pretty unlikely that it will stay out forever.

Comments (2 posted)

A visit from the trolls

By Jonathan Corbet
October 15, 2007
We have been hearing the warnings for years: sooner or later, software patents were destined to be used against free software. When dire warnings are repeated over a long period of time, it can become easy to shrug them off and assume that nothing will ever really come of them. But complacency does not make the problem go away. And now we have, in the form of a lawsuit filed against Red Hat and Novell by IP Innovation LLC, a reminder that the software patent threat is real.

Three patents are named in the brief complaint [PDF]:

  • #5,072,412, "User interface with multiple workspaces for sharing display system objects". Filed in March, 1987.

  • #5,533,183, which has the same title. Filed in February, 1995.

  • #5,394,521, again with the same title. Filed in May, 1993.

As might be imagined, the three patents all read about the same. Those who are not afraid of patentese can get a feel for what has been patented by reading the first claim of #5,072,412 - one of the claims alleged to be violated by the defendants:

A system comprising:
  • a display;

  • first and second workspace data structures relating respectively to first and second workspaces that can be presented on the display; each of the first and second workspaces including a respective set of display objects; each of the display objects being perceptible as a distinct, coherent set of display features; the display objects of each respective set being perceptible as having spatial positions relative to each other when the respective workspace is presented on the display;

  • display object means for generating first and second display objects; the first workspace data structure being linked to the display object means so that the first display object is in the respective set of display objects of the first workspace; the second workspace data structure being linked to the display object means so that the second display object is in the respective set of display objects of the second workspace; and

  • control means for accessing the first workspace data structure to cause the display to present the first workspace including the first display object; the control means further being for accessing the second workspace data structure to cause the display to present the second workspace including the second display object; the display object means generating the first and second display objects so that the second display object is perceptible as the same tool as the first display object when the second workspace is presented after the first workspace.

This claim seems like a fairly straightforward description of a window manager which provides multiple virtual desktops. It does not take a whole lot of imagination to extend this reading to describe the behavior of two windows on the same desktop. Finding software within a Linux system which can be said to infringe upon these patents is probably not all that hard to do. Eliminating all code which could be said to infringe, instead, could be difficult indeed. (Bear in mind, though, that your editor is fortunate enough not to be a patent attorney; anybody needing a definitive interpretation of this patent should consult people who know what they are talking about).

The defense against this attack will require either (1) the location of sufficient prior art to invalidate the patents, or (2) an argument that, by the allegedly tightened definition of "obviousness" in the U.S., the technology patented is not sufficiently innovative. Red Hat and Novell have not shared their defensive strategy with the world, and they are unlikely to do so in the near future. We will almost certainly have to wait and see how they answer the charges in court.

As an alternative, the two companies could pay the troll in exchange for an agreement allowing the patented technology to be used in GPL-licensed software. Assuming an agreement could be reached, this approach would solve the immediate problem. But it would also encourage every other patent troll out there to head to court in search of a turn at the trough. It would be far better to defeat this attack if at all possible. Regardless of how this case plays out, though, we can be sure that it will not be the last. There is no shortage of software patents in the U.S. and no shortage of lawyers willing to turn them into lawsuits. The system encourages this sort of litigation.

For this reason, your editor feels that the current focus on finding links between this suit and Microsoft is misplaced. It may well be that Microsoft is lurking in the shadows somewhere, directing the entire operation. Your editor has no way of knowing. But there's a couple of things which should be kept in mind when trying to make that connection.

The first is that Microsoft's presence is in no way necessary to explain this series of events. Patent trolls are not in short supply, and neither are patent infringement lawsuits. It was a certainty that one of these trolls was going to turn its attention to free software companies sooner or later. IP Innovations, owned by well-known patent troll Acacia, is no stranger to this sort of litigation; it could have easily decided on this course of action on its own.

Second, it's not clear that this attack, at this time, is in Microsoft's interest. For all the talk of the safety provided by Novell's purchase of a patent non-license from Microsoft, Novell, too, has been sued. No users have been sued, but, should the plaintiffs decide to target Linux users, Novell's customers will be just as exposed as Red Hat's customers. Any other company which might be considering the purchase of a "covenant not to sue" from Microsoft need only look at this case to see that the covenant has not solved the problem: the company which bought the covenant is in the same position as the company which refused to do so. This attack can also only serve to clarify the problems with software patents in parts of the world which do not currently allow software to be patented.

In other words, this lawsuit has driven home the fact that, with regard to software patents in the U.S., Microsoft is not the problem. Microsoft's own experience on the receiving end of patent infringement lawsuits should also make that clear. Whether or not Microsoft is behind this suit, the real problem is the current software patent regime in the U.S. and the litigation-friendly environment which supports it. If Microsoft were to vanish tomorrow, the threat would not be reduced in any appreciable way. So putting the focus on Microsoft is a mistake; we have a much bigger problem than that.

Comments (78 posted)

Memory part 4: NUMA support

October 17, 2007

This article was contributed by Ulrich Drepper

[Editor's note: welcome to part 4 of Ulrich Drepper's "What every programmer should know about memory"; this section discusses the particular challenges associated with non-uniform memory access (NUMA) systems. Those who have not read part 1, part 2, and part 3 may wish to do so now. As always, please send typo reports and the like to lwn@lwn.net rather than posting them as comments here.]

5 NUMA Support

In Section 2 we saw that, on some machines, the cost of access to specific regions of physical memory differs depending on where the access originated. This type of hardware requires special care from the OS and the applications. We will start with a few details of NUMA hardware, then we will cover some of the support the Linux kernel provides for NUMA.

5.1 NUMA Hardware

Non-uniform memory architectures are becoming more and more common. In the simplest form of NUMA, a processor can have local memory (see Figure 2.3) which is cheaper to access than memory local to other processors. The difference in cost for this type of NUMA system is not high, i.e., the NUMA factor is low.

NUMA is also—and especially—used in big machines. We have described the problems of having many processors access the same memory. For commodity hardware all processors would share the same Northbridge (ignoring the AMD Opteron NUMA nodes for now, they have their own problems). This makes the Northbridge a severe bottleneck since all memory traffic is routed through it. Big machines can, of course, use custom hardware in place of the Northbridge but, unless the memory chips used have multiple ports—i.e. they can be used from multiple busses—there still is a bottleneck. Multiport RAM is complicated and expensive to build and support and, therefore, it is hardly ever used.

The next step up in complexity is the model AMD uses where an interconnect mechanism (Hypertransport in AMD's case, technology they licensed from Digital) provides access for processors which are not directly connected to the RAM. The size of the structures which can be formed this way is limited unless one wants to increase the diameter (i.e., the maximum distance between any two nodes) arbitrarily.

Figure 5.1: Hypercubes

An efficient topology for the nodes is the hypercube, which limits the number of nodes to 2C where C is the number of interconnect interfaces each node has. Hypercubes have the smallest diameter for all systems with 2n CPUs. Figure 5.1 shows the first three hypercubes. Each hypercube has a diameter of C which is the absolute minimum. AMD's first-generation Opteron processors have three hypertransport links per processor. At least one of the processors has to have a Southbridge attached to one link, meaning, currently, that a hypercube with C=2 can be implemented directly and efficiently. The next generation is announced to have four links, at which point C=3 hypercubes will be possible.

This does not mean, though, that larger accumulations of processors cannot be supported. There are companies which have developed crossbars allowing larger sets of processors to be used (e.g., Newisys's Horus). But these crossbars increase the NUMA factor and they stop being effective at a certain number of processors.

The next step up means connecting groups of CPUs and implementing a shared memory for all of them. All such systems need specialized hardware and are by no means commodity systems. Such designs exist at several levels of complexity. A system which is still quite close to a commodity machine is IBM x445 and similar machines. They can be bought as ordinary 4U, 8-way machines with x86 and x86-64 processors. Two (at some point up to four) of these machines can then be connected to work as a single machine with shared memory. The interconnect used introduces a significant NUMA factor which the OS, as well as applications, must take into account.

At the other end of the spectrum, machines like SGI's Altix are designed specifically to be interconnected. SGI's NUMAlink interconnect fabric is very fast and has a low latency; both of these are requirements for high-performance computing (HPC), especially when Message Passing Interfaces (MPI) are used. The drawback is, of course, that such sophistication and specialization is very expensive. They make a reasonably low NUMA factor possible but with the number of CPUs these machines can have (several thousands) and the limited capacity of the interconnects, the NUMA factor is actually dynamic and can reach unacceptable levels depending on the workload.

More commonly used are solutions where clusters of commodity machines are connected using high-speed networking. But these are not NUMA machines; they do not implement a shared address space and therefore do not fall into any category which is discussed here.

5.2 OS Support for NUMA

To support NUMA machines, the OS has to take the distributed nature of the memory into account. For instance, if a process is run on a given processor, the physical RAM assigned to the process's address space should come from local memory. Otherwise each instruction has to access remote memory for code and data. There are special cases to be taken into account which are only present in NUMA machines. The text segment of DSOs is normally present exactly once in a machine's physical RAM. But if the DSO is used by processes and threads on all CPUs (for instance, the basic runtime libraries like libc) this means that all but a few processors have to have remote accesses. The OS ideally would “mirror” such DSOs into each processor's physical RAM and use local copies. This is an optimization, not a requirement, and generally hard to implement. It might not be supported or only in a limited fashion.

To avoid making the situation worse, the OS should not migrate a process or thread from one node to another. The OS should already try to avoid migrating processes on normal multi-processor machines because migrating from one processor to another means the cache content is lost. If load distribution requires migrating a process or thread off of a processor, the OS can usually pick an arbitrary new processor which has sufficient capacity left. In NUMA environments the selection of the new processor is a bit more limited. The newly selected processor should not have higher access costs to the memory the process is using than the old processor; this restricts the list of targets. If there is no free processor matching that criteria available, the OS has no choice but to migrate to a processor where memory access is more expensive.

In this situation there are two possible ways forward. First, one can hope the situation is temporary and the process can be migrated back to a better-suited processor. Alternatively, the OS can also migrate the process's memory to physical pages which are closer to the newly-used processor. This is quite an expensive operation. Possibly huge amounts of memory have to be copied, albeit not necessarily in one step. While this is happening the process, at least briefly, has to be stopped so that modifications to the old pages are correctly migrated. There are a whole list of other requirements for page migration to be efficient and fast. In short, the OS should avoid it unless it is really necessary.

Generally, it cannot be assumed that all processes on a NUMA machine use the same amount of memory such that, with the distribution of processes across the processors, memory usage is also equally distributed. In fact, unless the applications running on the machines are very specific (common in the HPC world, but not outside) the memory use will be very unequal. Some applications will use vast amounts of memory, others hardly any. This will, sooner or later, lead to problems if memory is always allocated local to the processor where the request is originated. The system will eventually run out of memory local to nodes running large processes.

In response to these severe problems, memory is, by default, not allocated exclusively on the local node. To utilize all the system's memory the default strategy is to stripe the memory. This guarantees equal use of all the memory of the system. As a side effect, it becomes possible to freely migrate processes between processors since, on average, the access cost to all the memory used does not change. For small NUMA factors, striping is acceptable but still not optimal (see data in Section 5.4).

This is a pessimization which helps the system avoid severe problems and makes it more predictable under normal operation. But it does decrease overall system performance, in some situations significantly. This is why Linux allows the memory allocation rules to be selected by each process. A process can select a different strategy for itself and its children. We will introduce the interfaces which can be used for this in Section 6.

5.3 Published Information

The kernel publishes, through the sys pseudo file system (sysfs), information about the processor caches below

    /sys/devices/system/cpu/cpu*/cache

In Section 6.2.1 we will see interfaces which can be used to query the size of the various caches. What is important here is the topology of the caches. The directories above contain subdirectories (named index*) which list information about the various caches the CPU possesses. The files type, level, and shared_cpu_map are the important files in these directories as far as the topology is concerned. For an Intel Core 2 QX6700 the information looks as in Table 5.1.

typelevelshared_cpu_map
cpu0 index0Data100000001
index1Instruction1 00000001
index2 Unified 2 00000003
cpu1 index0 Data 1 00000002
index1 Instruction 1 00000002
index2 Unified 2 00000003
cpu2 index0 Data 1 00000004
index1 Instruction 1 00000004
index2 Unified 2 0000000c
cpu3 index0 Data 1 00000008
index1 Instruction 1 00000008
index2 Unified 2 0000000c

Table 5.1: sysfs Information for Core 2 CPU Caches

What this data means is as follows:

  • Each core {The knowledge that cpu0 to cpu3 are cores comes from another place that will be explained shortly.} has three caches: L1i, L1d, L2.

  • The L1d and L1i caches are not shared with anybody—each core has its own set of caches. This is indicated by the bitmap in shared_cpu_map having only one set bit.

  • The L2 cache on cpu0 and cpu1 is shared, as is the L2 on cpu2 and cpu3.

If the CPU had more cache levels, there would be more index* directories.

For a four-socket, dual-core Opteron machine the cache information looks like Table 5.2:

typelevelshared_cpu_map
cpu0 index0Data100000001
index1Instruction100000001
index2Unified200000001
cpu1 index0Data100000002
index1Instruction100000002
index2Unified200000002
cpu2 index0Data100000004
index1Instruction100000004
index2Unified200000004
cpu3 index0Data100000008
index1Instruction100000008
index2Unified200000008
cpu4 index0Data100000010
index1Instruction100000010
index2Unified200000010
cpu5 index0Data100000020
index1Instruction100000020
index2Unified200000020
cpu6 index0Data100000040
index1Instruction100000040
index2Unified200000040
cpu7 index0Data100000080
index1Instruction100000080
index2Unified200000080

Table 5.2: sysfs Information for Opteron CPU Caches

As can be seen these processors also have three caches: L1i, L1d, L2. None of the cores shares any level of cache. The interesting part for this system is the processor topology. Without this additional information one cannot make sense of the cache data. The sys file system exposes this information in the files below

    /sys/devices/system/cpu/cpu*/topology

Table 5.3 shows the interesting files in this hierarchy for the SMP Opteron machine.

physical_
package_id
core_id core_
siblings
thread_
siblings
cpu0000000000300000001
cpu1 10000000300000002
cpu2100000000c00000004
cpu3 10000000c00000008
cpu4200000003000000010
cpu5 10000003000000020
cpu630000000c000000040
cpu7 1000000c000000080

Table 5.3: sysfs Information for Opteron CPU Topology

Taking Table 5.2 and Table 5.3 together we can see that no CPU has hyper-threads (the thread_siblings bitmaps have one bit set), that the system in fact has four processors (physical_package_id 0 to 3), that each processor has two cores, and that none of the cores share any cache. This is exactly what corresponds to earlier Opterons.

What is completely missing in the data provided so far is information about the nature of NUMA on this machine. Any SMP Opteron machine is a NUMA machine. For this data we have to look at yet another part of the sys file system which exists on NUMA machines, namely in the hierarchy below

    /sys/devices/system/node

This directory contains a subdirectory for every NUMA node on the system. In the node-specific directories there are a number of files. The important files and their content for the Opteron machine described in the previous two tables are shown in Table 5.4.

cpumapdistance
node00000000310 20 20 20
node10000000c20 10 20 20
node20000003020 20 10 20
node3000000c020 20 20 10

Table 5.4: sysfs Information for Opteron Nodes

This information ties all the rest together; now we have a complete picture of the architecture of the machine. We already know that the machine has four processors. Each processor constitutes its own node as can be seen by the bits set in the value in cpumap file in the node* directories. The distance files in those directories contains a set of values, one for each node, which represent a cost of memory accesses at the respective nodes. In this example all local memory accesses have the cost 10, all remote access to any other node has the cost 20. {This is, by the way, incorrect. The ACPI information is apparently wrong since, although the processors used have three coherent HyperTransport links, at least one processor must be connected to a Southbridge. At least one pair of nodes must therefore have a larger distance.} This means that, even though the processors are organized as a two-dimensional hypercube (see Figure 5.1), accesses between processors which are not directly connected is not more expensive. The relative values of the costs should be usable as an estimate of the actual difference of the access times. The accuracy of all this information is another question.

5.4 Remote Access Costs

The distance is relevant, though. In [amdccnuma] AMD documents the NUMA cost of a four socket machine. For write operations the numbers are shown in Figure 5.3.

Figure 5.3: Read/Write Performance with Multiple Nodes

Writes are slower than reads, this is no surprise. The interesting parts are the costs of the 1- and 2-hop cases. The two 1-hop cases actually have slightly different costs. See [amdccnuma] for the details. The fact we need to remember from this chart is that 2-hop reads and writes are 30% and 49% (respectively) slower than 0-hop reads. 2-hop writes are 32% slower than 0-hop writes, and 17% slower than 1-hop writes. The relative position of processor and memory nodes can make a big difference. The next generation of processors from AMD will feature four coherent HyperTransport links per processor. In that case a four socket machine would have diameter of one. With eight sockets the same problem returns, with a vengeance, since the diameter of a hypercube with eight nodes is three.

All this information is available but it is cumbersome to use. In Section 6.5 we will see an interface which helps accessing and using this information easier.

The last piece of information the system provides is in the status of a process itself. It is possible to determine how the memory-mapped files, Copy-On-Write (COW) pages and anonymous memory are distributed over the nodes in the system. {Copy-On-Write is a method often used in OS implementations when a memory page has one user at first and then has to be copied to allow independent users. In many situations the copying is unnecessary, at all or at first, in which case it makes sense to only copy when either user modifies the memory. The operating system intercepts the write operation, duplicates the memory page, and then allows the write instruction to proceed.} Each process has a file /proc/PID/numa_maps, where PID is the ID of the process, as shown in Figure 5.2.

00400000 default file=/bin/cat mapped=3 N3=3
00504000 default file=/bin/cat anon=1 dirty=1 mapped=2 N3=2
00506000 default heap anon=3 dirty=3 active=0 N3=3
38a9000000 default file=/lib64/ld-2.4.so mapped=22 mapmax=47 N1=22
38a9119000 default file=/lib64/ld-2.4.so anon=1 dirty=1 N3=1
38a911a000 default file=/lib64/ld-2.4.so anon=1 dirty=1 N3=1
38a9200000 default file=/lib64/libc-2.4.so mapped=53 mapmax=52 N1=51 N2=2
38a933f000 default file=/lib64/libc-2.4.so
38a943f000 default file=/lib64/libc-2.4.so anon=1 dirty=1 mapped=3 mapmax=32 N1=2 N3=1
38a9443000 default file=/lib64/libc-2.4.so anon=1 dirty=1 N3=1
38a9444000 default anon=4 dirty=4 active=0 N3=4
2b2bbcdce000 default anon=1 dirty=1 N3=1
2b2bbcde4000 default anon=2 dirty=2 N3=2
2b2bbcde6000 default file=/usr/lib/locale/locale-archive mapped=11 mapmax=8 N0=11
7fffedcc7000 default stack anon=2 dirty=2 N3=2

Figure 5.2: Content of /proc/PID/numa_maps

The important information in the file is the values for N0 to N3, which indicate the number of pages allocated for the memory area on nodes 0 to 3. It is a good guess that the program was executed on a core on node 3. The program itself and the dirtied pages are allocated on that node. Read-only mappings, such as the first mapping for ld-2.4.so and libc-2.4.so as well as the shared file locale-archive are allocated on other nodes.

As we have seen in Figure 5.3 the read performance across nodes falls by 9% and 30% respectively for 1- and 2-hop reads. For execution, such reads are needed and, if the L2 cache is missed, each cache line incurs these additional costs. All the costs measured for large workloads beyond the size of the cache would have to be increased by 9%/30% if the memory is remote to the processor.

Figure 5.4: Operating on Remote Memory

To see the effects in the real world we can measure the bandwidth as in Section 3.5.1 but this time with the memory being on a remote node, one hop away. The result of this test when compared with the data for using local memory can be seen in Figure 5.4. The numbers have a few big spikes in both directions which are the result of a problem of measuring multi-threaded code and can be ignored. The important information in this graph is that read operations are always 20% slower. This is significantly slower than the 9% in Figure 5.3, which is, most likely, not a number for uninterrupted read/write operations and might refer to older processor revisions. Only AMD knows.

For working set sizes which fit into the caches, the performance of write and copy operations is also 20% slower. For working sets exceeding the size of the caches, the write performance is not measurably slower than the operation on the local node. The speed of the interconnect is fast enough to keep up with the memory. The dominating factor is the time spent waiting on the main memory.

Comments (6 posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds