By Jake Edge
October 10, 2007
Linux is embedded in a huge number of different devices, but the
high-profile gadgets – the Nokia N800 and OpenMoko phone for example
– get all the attention. Those gadgets use Linux as a selling point
and a differentiator, trying to appeal to developers and power users who
can customize the devices to do exactly what they want. Meanwhile, the
vast majority of embedded Linux devices are running it quietly. For those
systems, Linux provides a stable platform, with support for a large number
of architectures and devices, that just works.
Consumer electronics makers are turning to Linux in a big way.
Sony provides an eye-opening list of
their devices that run Linux, which covers a
wide spectrum of the products that Sony makes; things like digital cameras,
video cameras, televisions, audio gear, and professional video equipment.
Other companies also have Linux web pages, none quite like the one that
Sony provides, and the Consumer
Electronics Linux Forum (CELF) has started gathering links to those
sites on their embedded Linux wiki.
Linux provides a number of advantages for embedded developers, the most
obvious is its price. Commercial embedded operating systems typically
charge royalties on a per-device basis, which can be a significant factor,
especially for low-end devices. For high-end devices, especially professional
grade equipment, the cost of the OS is much less of an issue, but the Linux
feature set and hardware support still make it an attractive choice.
One of the advantages that Linux provides is virtual memory – commercial
embedded operating systems often rely on a flat memory layout.
Virtual memory has a number of useful benefits: process isolation, shared code segments, as well
as the ability to overcommit memory. Tim Bird, Sony kernel hacker and
architecture group chair for CELF, puts it this way:
[...] one of the important reasons for using Linux
is its support for virtual memory. There is one product where
the amount of RAM allocated for an application is 10MB, but the
application couldn't fit in this. By using Linux, and
over-committing memory, the OS, libs and the application and
its data were able to fit into the RAM budget. Obviously,
Linux itself and the support programs and libs (primarily
busybox and glibc) must be as trim as possible in order to
make this work. So Sony is always concerned about kernel
and program size.
Another area of importance to the embedded world is the variety of hardware
supported by Linux. Not only is there support for many of the CPU
architectures used in embedded devices, but there are drivers for all
kinds of peripherals that might be used in the design. If a driver does not
exist for the specific peripheral device being used, there is probably a driver for something
similar that can be used as a template for a new driver.
Alternatively, the Linux Driver Project – which we covered last week – is
willing to write the driver given some information about how the device
operates.
Support for multiple network protocols, different bus architectures, and
various peripheral connections (USB, Firewire, etc.) are also very important, depending on the
intent of the device. Because the Linux community is large, and growing,
it can support many more options than the commercial
embedded OS vendors can. In addition, as the development staff at a
consumer electronics company come up to speed on Linux, it will make sense
to use it in more devices. This will lead to more supported devices, CPUs,
bus architectures, network protocols, and so on.
The added features of Linux do come with a cost, which Bird refers to, of
larger kernels and libraries. The linux-tiny project is an effort
to reduce the size of the kernel, so-called kernel bloat, for embedded and
small systems, which has been pretty well received by kernel hackers.
It is a testament to the Linux developers that
it already runs on everything from mainframes to mobile phones; it is a
rare OS indeed that will make the effort to scale over that wide a range,
with the inherent tensions between the needs of the various
constituencies.
It probably isn't possible to get a shell prompt on your television or
video camera, as these devices aren't meant to be user serviceable, at
least at the OS level. User modifications or updates to the code may well be
difficult or impossible as well. For devices that are not connected to the
internet, which is presumably the case for the majority of them, this is
probably a non-issue. Presumably the device manufacturers have ways of
upgrading if bugs or security flaws are discovered, but those would be
handled by service centers or the like. Those who are opposed to the
"Tivo-ization" of the kernel will be less than pleased, but most kernel
hackers seem willing to have it used that way, so long as the code is made
available.
We will be seeing Linux in more places as time goes on as it is proving a
robust solution for all kinds of applications. Perhaps you stare at Linux
all day, hacking on code, working on graphics, or running a word processor,
it is quite possible that you may also be
staring at Linux at night, inside the television in your living room.
Comments (10 posted)
October 10, 2007
This article was contributed by Glyn Moody
Chris DiBona started using free
software on his home PC when he was at college, and for the same reason
that Linus wrote Linux: because he couldn't get on the machines in the
computer lab. Later, DiBona joined VA Linux, which sold open source-based
hardware systems, and ended up as one of the editors on Slashdot. After an
ill-fated attempt to start a game company, he joined Google in August 2004,
and is Open Source Programs Manager there. He explains to Glyn Moody why
open source is good for Google's business – and its soul.
What platform was Google running on when it started at Stanford?
We have this
picture that circulates, Google's first quote datacenter unquote: it's a jokey thing - you see things like disc drives housed in Legos. That was mostly Linux, but it also had a couple of Sun workstations – it was Stanford, so Suns were everywhere. But by the time that we became a real company it was pretty much Linux through and through for the operating system.
Was that a conscious decision, or did it just happen?
We were all engineers, and it was a very natural thing to say: well, this stuff just works, let's use it. It wasn't so much a religious as a practical decision.
To what extent did that early use of GNU/Linux drive the uptake of other open source programs?
I think that it was really healthy because when we consider creating new services and products we're not afraid of considering the open source options. Open source gets exactly the kind of treatment that it deserves here. What's funny about Google, though, is it's not open source versus proprietary, so much as it's open source versus let's create it ourselves.
[PULL QUOTE:
You can do all the things that you can do to your own code to open source: you can ship it, you don't have to pay any money or anything, you can fix any bugs, you can have a new feature. These all sound like trivial things, but you can do them all without getting permission, without having to check with anybody, without having to go to your legal team. Once the code's in your company, people are going to be able to use it like their own. And that's incredibly powerful.
END QUOTE]
What's the advantage for Google in choosing open source rather than writing it yourself?
The thing about open source, it's kind of like it is yours. You can do all the things that you can do to your own code to open source: you can ship it, you don't have to pay any money or anything, you can fix any bugs, you can have a new feature. These all sound like trivial things, but you can do them all without getting permission, without having to check with anybody, without having to go to your legal team. Once the code's in your company, people are going to be able to use it like their own. And that's incredibly powerful.
Considering that Google does an insane amount of software development, if we had to have some of the restrictions that heavily proprietary [code] would present us, we couldn't develop at the speed that we do.
What open source does Google use?
We've been real clear that we use Linux, and that we use MySQL. We actually don't use the Apache web server very much at all, although when we buy companies, usually they're running Apache, so there's this transition period where they're running Apache. But the real strength for open source software at Google is actually not anything beyond Linux and MySQL so much as the libraries themselves.
The OpenSSL stuff is super important; as a company, the last thing you want to be doing is creating cryptography software. It saves us a lot of work, but more importantly, it's mature code. It's a very bad idea to use new cryptography code, because it hasn't been proven, it hasn't been attacked.
What were you tasked with doing when you joined?
I was hired at Google because they decided they needed an open source person, quote unquote, and they didn't really know what that meant yet. They wanted somebody to look after their use of open source, and also, uniquely, they wanted somebody whose job it was to ensure that open source itself was healthy. We broke that down into three main parts: helping out people who were making code; creating new people to create open source code; and releasing open source code ourselves.
The bread and butter of the [open source] group [is that] we make sure that Google continues to use open source in a very healthy way, in line with the licenses, and [that] we're able to interact with the outside world and open source developers very efficiently. There are literally hundreds of Google engineers who are patching and sending in bug reports, sending in bug fixes, all the time.
Where did the idea of taking on key hackers like Andrew Morton, Jeremy Allison and Guido van Rossum come from?
I think it was organic more than anything else. Google has been very public in the fact that we have three primary languages, and that's C++, Java and Python. So as part of that we try to bring on staff people who are the world leaders in those projects - Josh Bloch and Neil Gafter for Java, Guido on Python, Ian Lance Taylor and Matt Austern. We do that because having those people on staff, those projects can continue to move forward, and that's good for us; and also our use of the projects informs the directions sometimes where these projects can go.
So, seeing Linux in an environment like Google informs the direction of Linux in a lot of ways, because you get to see it in an extremely high-load, high-availability environment that you don't really see that often, and you see it on commodity hardware here. So that's really good for the outside world that Andrew [Morton] gets to see that, and that Andrew can really code whatever he wants.
What's the deal for them in terms of time to work on their projects?
Contracts per employee are very different. What we've done in general in the company is that any Googler can use their 20% time on open source software, and we administer that work here in our group. Andrew Morton and Jeremy Allison are really interesting because they came in with a pre-established duty to their software, and we want to make sure that the code they release is considered above board and beyond reproach. So folks like that, we go a little bit extra, and we say: OK, what do you actually need to be able to continue to patch and be comfortable working here?
What it comes down to is we have incredible policies in place to allow Googlers to patch out, and release code, and that serves everybody in the company, and not just the stars – the open source stars I should say – we're all stars.
How did the Summer of Code scheme came about?
It's funny. Greg Stein and I had presented
code.google.com to the [Google] founders, and Larry Page said: Listen, I want you to do me a favor. I want you to go and find all the computer science students who aren't coding over the summer, and I want you to make sure they code. Would you do that for me? I was kind of dumbstruck. I was like, Well, I'll try. And so I came up with Summer of Code as a way of doing it.
I knew that I personally couldn't interact with that many people, but I knew that the open source movement was very good at dealing with lots of people and code. So I contacted some of my friends, on different projects, and we spun it up
And how has that evolved?
This year we had a sync-up period, so once a student's application was accepted, there was actually a long period of time, almost two months, where they could get used to their organization, and sort of understand how they do things, how they can commit code. And that was incredibly useful – not because they had extra time to code, but just because this community-bonding period is really important.
It also made the midterm judging period more efficient. So with Summer of Code, we've got initial application acceptance, a midterm and a final [judgment], where the organizations say if the student is actually doing what they said they were going to do, or if they just dropped off the face of the earth. It made those judgments way clearer. It didn't really change the failure rate all that much, it just made it faster.
What's the success rate?
I think we failed 19% of our students this year, so 81% success.
What does this scheme achieve for Google?
We really do take advantage of literally many hundreds of libraries, and hundreds of projects, and so we get a direct benefit that code is being written for projects that are important to us. One of the things that does happen is open source developers come here, and sometimes they stop working on open source. And we see it as being incumbent on us to replace those that we take. Also it's good to take students to show them what the real world of computer science is like.
Another side benefit is that because of Summer of Code, Google now knows all the people working on all these software projects, on which it depends. That's incredibly useful to us. Every once in a while we'll come out with a new API and there'll be some projects in the open source world that might be useful in either using that API or being a customer. You can just call them up and say, Hey guys, it's Google, we're you're pal, and let them just check it out. And then there's a small benefit to recruiting, too.
It would be hard to justify the $5 million that we spent this year for any one of them, but you put them all together and it seems to make a lot of sense. Also, we do want to give back – we do take a lot of code.
What about for the projects – what are the advantages for them?
There's two things that happen. There's the code that they get, often very, very useful; there's the students themselves - they get new developers from this. Even though a lot students go back to school and you don't hear from them again, a bunch of them stick around and keep on coding.
But more importantly, if you look at organizations that have done Summer of Code, something happens to them, they become suddenly really professional in bringing people into their project. And this is a huge benefit for them, because these projects are able to be healthy and to grow.
One of the most significant open source releases from Google is Gears. Could you say a little about what it is, and why you decided to open source it?
Gears is an open source browser extension that enables developers to build web applications that can work offline. We knew that we could just release a plugin and make it good for our apps, but with open source other people can use it and feel safe to use it, and know that people can't just abandon the technology, because they have it too. It's a very powerful message that a lot of people don't always get, but it's a big deal. A lot of people have software toolkits now, but what it comes down to is ours is really popular because everyone has it, the same reason why we use open source code.
Another side of the Gears thing that's really important is that Google can't port it to more than, say, three or four browsers. But there are enthusiast communities who want that stuff, or have an alternative browser that isn't supported directly by us, and they can do it themselves.
Which licenses do you prefer to use?
We generally release under the Apache license. We've also released under BSD, GPL, LGPL, CPL, MPL.
How do you choose?
Usually just whatever is best for the software - where the code is going, who we want to have use the code. For instance, we released our signaling specification for voice calls in Google Talk, and we wanted that to be able to used by both GAIM and by Adium on a Macintosh, and so that was one of the few times when we actually did a dual license, which was BSD and LGPL. That was also released as a Jabber enhancement proposal. We try not to be too religious about this stuff, and that's one of the reasons why we pick Apache: it is very friendly for both open source and for proprietary software.
There's been some discussion about whether Google should give back more of the code derived from open source that it uses internally to power its web services: what's your view?
A lot of people assume we use code in ways that we don't actually. We are releasing a ton of patches into the Linux kernel, so I'm really happy we're way beyond what we need to do there, and that's also true of things like OpenSSL. I have to tell you I can't get that worked up about this issue, I wish I could. We're doing so much more than the licenses have required now for so many years, that this argument kind of falls on deaf ears.
What people are really saying is: Why don't Google release more code? And I think that's a valid thing to say. I think it's kind of important they want to see Google release more code, and I agree with them.
So what open source code will we see coming from Google in the future?
I know that we're getting into it in a bigger way - I'm not at liberty to talk about it. There is more coming. But even if we just did what we've got today, which is enable Google engineers to release code whenever practical, that would be a pretty great and healthy thing.
What do you see as the long-term importance of open source for Google?
I think that open source has a very important role in our culture. As we grow the company, I want to make sure that the ideals that make Google so very appealing to me, continue, because I think it keeps Google in a lot of ways good, and healthy, and pure. I know that sounds really strange, and idealistic, but that's OK. So the more open source that we use in the company, the more open source people that we attract, and the more open source code that we release, I think that it's really good for Google spiritually.
Glyn Moody writes about open source at opendotdotdot.
Comments (2 posted)
October 9, 2007
This article was contributed by Ulrich Drepper
[
Editor's note: this the third installment of Ulrich Drepper's "What
every programmer should know about memory" document; this section talks
about virtual memory, and TLB performance in particular. Those who have
not read part 1 and part 2 may wish to do so
now. As always, please send typo reports and the like to lwn@lwn.net
rather than posting them as comments here.]
4 Virtual Memory
The virtual memory subsystem of a processor implements the virtual
address spaces provided to each process. This makes each process
think it is alone in the system. The list of advantages of virtual
memory are described in detail elsewhere so they will not be
repeated here. Instead this section concentrates on the actual
implementation details of the virtual memory subsystem and the
associated costs.
A virtual address space is implemented by the Memory Management Unit
(MMU) of the CPU. The OS has to fill out the page table data
structures, but most CPUs do the rest of the work themselves. This is
actually a pretty complicated mechanism; the best way to understand it
is to introduce the data structures used to describe the virtual
address space.
The input to the address translation performed by the MMU is a virtual
address. There are usually few—if any—restrictions on its value.
Virtual addresses are 32-bit values on 32-bit systems, and 64-bit values on
64-bit systems.
On some systems, for instance x86 and x86-64, the addresses
used actually involve another level of indirection: these architectures use
segments which simply cause an offset to be added to every logical
address. We can ignore this part of address generation, it is
trivial and not something that programmers have to care about with respect to
performance of memory handling. {Segment limits on x86 are
performance-relevant but that is another story.}
4.1 Simplest Address Translation
The interesting part is the translation of the virtual address to a
physical address. The MMU can remap addresses on a page-by-page
basis. Just as when addressing cache lines, the virtual address is
split into distinct parts. These parts are used to index into various
tables which are used in the construction of the final physical
address. For the simplest model we have only one level of tables.
Figure 4.1: 1-Level Address Translation
Figure 4.1 shows how the different parts of the virtual
address are used. A top part is used to select an entry in a Page
Directory; each entry in that directory can be individually set by the OS.
The page directory
entry determines the address of a physical memory page; more than one
entry in the page directory can point to the same physical address.
The complete physical address of the memory cell is determined by combining the
page address from the page directory with the low bits from the
virtual address. The page directory entry also contains some
additional information about the page such as access permissions.
The data structure for the page directory is stored in memory. The
OS has to allocate contiguous physical memory and store the base
address of this memory region in a special register. The appropriate
bits of the virtual address are then used as an index into the page
directory, which is actually an array of directory entries.
For a concrete example, this is the layout used for 4MB pages on x86
machines. The Offset part of the virtual address is 22 bits in size,
enough to address every byte in a 4MB page. The remaining 10 bits of
the virtual address select one of the 1024 entries in the page
directory. Each entry contains a 10 bit base address of a 4MB page
which is combined with the offset to form a complete 32 bit address.
4.2 Multi-Level Page Tables
4MB pages are not the norm, they would waste a lot of memory since
many operations an OS has to perform require alignment to memory
pages. With 4kB pages (the norm on 32-bit machines and, still, often
on 64-bit machines), the Offset part of the virtual address is
only 12 bits in size. This leaves 20 bits as the selector of the page
directory. A table with 220 entries is not practical. Even if
each entry would be only 4 bytes the table would be 4MB in size. With
each process potentially having its own distinct page directory much
of the physical memory of the system would be tied up for these page
directories.
The solution is to use multiple levels of page tables. These can then
represent a sparse huge page directory where regions which are not
actually used do not require allocated memory. The representation
is therefore much more compact, making it possible to have the page
tables for many processes in memory without impacting performance
too much.
Today the most complicated page table structures comprise four levels.
Figure 4.2 shows the schematics of such an implementation.
Figure 4.2: 4-Level Address Translation
The virtual address is, in this example, split into at least five parts.
Four of these parts are indexes into the various directories. The
level 4 directory is referenced using a special-purpose register in
the CPU. The content of the level 4 to level 2 directories is a
reference to next lower level directory. If a directory entry is marked
empty it obviously need not point to any lower directory. This way the page
table tree can be sparse and compact. The entries of the level 1
directory are, just like in Figure 4.1, partial physical
addresses, plus auxiliary data like access permissions.
To determine the physical address corresponding to a virtual address
the processor first determines the address of the highest level
directory. This address is usually stored in a register. Then the
CPU takes the index part of the virtual address corresponding to this
directory and uses that index to pick the appropriate entry. This entry is the
address of the next directory, which is indexed using the next part of
the virtual address. This process continues until it reaches the level 1 directory,
at which point the value of the directory entry is the high part of the
physical address. The physical address is completed by adding the
page offset bits from the virtual address. This process is called
page tree walking. Some processors (like x86 and x86-64) perform this
operation in hardware, others need assistance from the OS.
Each process running on the system might need its own page table
tree. It is possible to partially share trees but this is rather the
exception. It is therefore good for performance and scalability if the
memory needed by the page table trees is as small as possible. The
ideal case for this is to place the used memory close together in
the virtual address space; the actual physical addresses used do not
matter. A small program might get by with using just one directory at
each of levels 2, 3,
and 4 and a few level 1 directories. On x86-64 with 4kB
pages and 512 entries per directory this allows the addressing of 2MB with a
total of 4 directories (one for each level). 1GB of contiguous memory
can be addressed with one directory for levels 2 to 4 and 512
directories for level 1.
Assuming all memory can be allocated contiguously is too simplistic,
though. For flexibility reasons the stack and the heap area of a
process are, in most cases, allocated at pretty much opposite ends of
the address space. This allows either area to grow as much as
possible if needed. This means that there are most likely two
level 2 directories needed and correspondingly more lower level
directories.
But even this does not always match current practice. For security
reasons the various parts of an executable (code, data, heap, stack,
DSOs, aka shared libraries) are mapped at randomized addresses
[nonselsec]. The randomization extends
to the relative position of the various parts; that implies
that the various memory regions in use in a process are
widespread throughout the virtual address space. By applying some
limits to the number of bits of the address which are randomized the
range can be restricted, but it certainly, in most cases, will not allow
a process to run with just one or two directories for levels 2 and 3.
If performance is really much more important than security,
randomization can be turned off. The OS will then usually at least
load all DSOs contiguously in virtual memory.
4.3 Optimizing Page Table Access
All the data structures for the page tables are kept in the main
memory; this is where the OS constructs and updates the tables. Upon
creation of a process or a change of a page table the CPU is notified.
The page tables are used to resolve every virtual address into a
physical address using the page table walk described above. More to the
point: at least one directory for each level is used in the process of
resolving a virtual address. This requires up to four memory accesses (for
a single access by the running process)
which is slow. It is possible to treat these directory table
entries as normal data and cache them in L1d, L2, etc., but this would
still be far too slow.
From the earliest days of virtual memory, CPU designers have used a different
optimization. A simple computation can show that only keeping the
directory table entries in the L1d and higher cache would lead to horrible performance.
Each absolute address computation would require a number
of L1d accesses corresponding to the page table depth. These accesses
cannot be parallelized since they depend on the previous lookup's
result. This alone would, on a machine with four page table levels,
require at the very least 12 cycles. Add to that the probability of
an L1d miss and the result is nothing the instruction pipeline can
hide. The additional L1d accesses also steal precious bandwidth to
the cache.
So, instead of just caching the directory table entries, the
complete computation of the address of the physical page is cached.
For the same reason that code and data caches work, such a cached
address computation is effective. Since the page offset part of the
virtual address does not play any part in the computation of the
physical page address, only the rest of the virtual address is used as the tag
for the cache. Depending on the page size this means hundreds or
thousands of instructions or data objects share the same tag and
therefore same physical address prefix.
The cache into which the computed values are stored is called the
Translation Look-Aside Buffer (TLB). It is usually a small cache
since it has to be extremely fast. Modern CPUs provide multi-level TLB caches,
just as for the other caches; the higher-level
caches are larger and slower. The small size of the L1TLB is often
made up for by making the cache fully associative, with an LRU
eviction policy. Recently, this cache has been growing in size and, in
the process, was changed to be set associative.
As a result, it might not be the oldest entry which gets evicted
and replaced whenever a new entry has to be added.
As noted above, the tag used to access the TLB is a part of the virtual
address. If the tag has a match in the cache, the final physical
address is computed by adding the page offset from the virtual address
to the cached value. This is a very fast process; it has to be since the
physical address must be available for every instruction using
absolute addresses and, in some cases, for L2 look-ups which use the
physical address as the key.
If the TLB lookup misses the processor has to perform a page table
walk; this can be quite costly.
Prefetching code or data through software or hardware could implicitly
prefetch entries for the TLB if the address is on another page.
This cannot be allowed for hardware prefetching because the
hardware could initiate page table walks that are invalid.
Programmers therefore cannot rely on hardware prefetching to prefetch
TLB entries. It has to be done explicitly using prefetch
instructions.
TLBs, just like data and instruction caches, can appear in multiple
levels. Just as for the data cache, the TLB usually appears in two
flavors: an instruction TLB (ITLB) and a data TLB (DTLB). Higher-level
TLBs such as the L2TLB are usually unified, as is the case with the
other caches.
4.3.1 Caveats Of Using A TLB
The TLB is a processor-core global resource. All threads and
processes executed on the processor core use the same TLB. Since the
translation of virtual to physical addresses depends on which page table
tree is installed, the CPU cannot blindly reuse the cached entries if the
page table is changed. Each process has a different page table tree
(but not the threads in the same process) as does the kernel and the
VMM (hypervisor) if present. It is also possible that the address space layout of a process
changes. There are two ways to deal with this problem:
- The TLB is flushed whenever the page table tree is changed.
- The tags for the TLB entries are extended to additionally and
uniquely identify the page table tree they refer to.
In the first case the TLB is flushed whenever a context switch is
performed. Since, in most OSes, a switch from one thread/process to
another one requires executing some kernel code, TLB flushes are
restricted to entering and leaving the kernel address space. On
virtualized systems it also happens when the kernel has to call the
VMM and on the way back.
If the kernel and/or VMM does not have to use virtual
addresses, or can reuse the same virtual addresses as the process or
kernel which made the system/VMM call, the TLB only has to be
flushed if, upon leaving the
kernel or VMM, the processor resumes execution of a different
process or kernel.
Flushing the TLB is effective but expensive. When executing a system
call, for instance, the kernel code might be restricted to a few
thousand instructions which touch, perhaps, a handful of new pages (or
one huge page, as is the case for Linux on some architectures). This
work would replace only as many TLB entries as pages are touched. For
Intel's Core2 architecture with its 128 ITLB and 256 DTLB entries,
a full flush would mean that more than 100 and 200 entries (respectively)
would be flushed
unnecessarily. When the system call returns to the same process, all
those flushed TLB entries could be used again, but they will be gone. The
same is true for often-used code in the kernel or VMM. On
each entry into the kernel the TLB has to be filled from scratch even
though the page tables for the kernel and VMM usually do not
change and, therefore, TLB entries could, in theory, be preserved for a
very long time. This also explains why the TLB caches in today's
processors are not bigger: programs most likely will not run long
enough to fill all these entries.
This fact, of course, did not escape the CPU architects. One possibility
to optimize the cache flushes is to individually
invalidate TLB entries. For instance, if the kernel code and data
falls into a specific address range, only the pages falling into this
address range have to evicted from the TLB. This only requires
comparing tags and, therefore, is not very expensive.
This method is also useful in case a part of the address space is
changed, for instance, through a call to munmap.
A much better solution is to extend the tag used for the TLB access. If,
in addition to the part of the virtual address, a unique identifier
for each page table tree (i.e., a process's address space) is added,
the TLB does not have to be completely flushed at all. The kernel,
VMM, and the individual processes all can have unique
identifiers. The only issue with this scheme is that the number of
bits available for the TLB tag is severely limited, while the number
of address spaces is not. This means some identifier reuse is necessary.
When this happens the TLB has to be partially flushed (if this is
possible). All entries with the reused identifier must be flushed
but this is, hopefully, a much smaller set.
This extended TLB tagging is of general advantage when multiple
processes are running on the system. If the memory use (and hence
TLB entry use) of each of the runnable processes is limited, there is
a good chance the most recently used TLB entries for a process are
still in the TLB when it gets scheduled again. But there are two
additional advantages:
- Special address spaces, such as those used by the kernel and VMM, are
often only entered for a short time; afterward control is often
returned to the address space which initiated the call. Without tags,
two TLB flushes are performed. With tags the calling address
space's cached translations are preserved and, since the kernel and
VMM address space do not often change TLB entries at all, the
translations from previous system calls, etc. can still be used.
- When switching between two threads of the same process no TLB
flush is necessary at all. Without extended TLB tags the entry into
the kernel destroys the first thread's TLB entries, though.
Some processors have, for some time, implemented these extended tags. AMD
introduced a 1-bit tag extension with the Pacifica virtualization
extensions. This 1-bit Address Space ID (ASID) is, in the context of
virtualization, used to distinguish the VMM's address space from
that of the guest domains. This allows the OS to avoid flushing the
guest's TLB entries every time the VMM is entered (for
instance, to handle a page fault) or the VMM's TLB entries when
control returns to the guest. The architecture will allow the
use of more bits in the future. Other mainstream processors will
likely follow suit and support this feature.
4.3.2 Influencing TLB Performance
There are a couple of factors which influence TLB performance. The
first is the size of the pages. Obviously, the larger a page is, the
more instructions or data objects will fit into it. So a larger page size reduces the
overall number of address translations which are needed, meaning that
fewer entries in the TLB cache are needed. Most architectures allow
the use of multiple different page sizes; some sizes can be used
concurrently. For instance, the x86/x86-64 processors have a normal
page size of 4kB but they can also use 4MB and 2MB pages
respectively. IA-64 and PowerPC allow sizes like 64kB as the base
page size.
The use of large page sizes brings some problems with it, though.
The memory regions used for the large pages must be contiguous in
physical memory. If the unit size for the administration of physical
memory is raised to the size of the virtual memory pages, the amount of
wasted memory will grow. All kinds of memory operations (like
loading executables) require alignment to page boundaries. This means, on
average, that each mapping wastes half the page size in physical memory for
each mapping. This waste can easily add up; it thus puts an upper limit on the
reasonable unit size for physical memory allocation.
It is certainly not practical to increase the unit size to 2MB to
accommodate large pages on x86-64. This is just too large a size.
But this in turn means that each large page has to be comprised of
many smaller pages. And these small pages have to be contiguous in
physical memory. Allocating 2MB of contiguous physical memory
with a unit page size of 4kB can be challenging. It requires finding
a free area with 512 contiguous pages. This can be extremely difficult (or impossible)
after the system runs for a while and physical memory becomes
fragmented.
On Linux it is therefore necessary to pre-allocate these big pages at
system start time using the special hugetlbfs filesystem. A
fixed number of physical pages are reserved for exclusive use as big
virtual pages. This ties down resources which might not always be
used. It also is a limited pool; increasing it normally means
restarting the system. Still, huge pages are the way to go in
situations where performance is a premium, resources are plenty, and
cumbersome setup is not a big deterrent. Database servers are an
example.
Increasing the minimum virtual page size (as opposed to optional big
pages) has its problems, too. Memory mapping operations (loading
applications, for example) must conform to these page sizes. No smaller mappings
are possible. The location of the various parts of an executable have,
for most architectures, a fixed relationship. If the page size is
increased beyond what has been taken into account when the executable
or DSO was built, the load operation cannot be performed. It is
important to keep this limitation in mind. Figure 4.3 shows
how the alignment requirements of an ELF binary can be determined. It
is encoded in the ELF program header.
$ eu-readelf -l /bin/ls
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
...
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x0132ac 0x0132ac R E 0x200000
LOAD 0x0132b0 0x00000000006132b0 0x00000000006132b0 0x001a71 0x001a71 RW 0x200000
...
Figure 4.3: ELF Program Header Indicating Alignment Requirements
In this example, an x86-64 binary,
the value is 0x200000 = 2,097,152 = 2MB which corresponds to the
maximum page size supported by the processor.
There is a second effect of using larger page sizes: the number of
levels of the page table tree is reduced. Since the part of the
virtual address corresponding to the page offset increases, there are
not that many bits left which need to be handled through page
directories. This means that, in case of a TLB miss, the amount of work
which has to be done is reduced.
Beyond using large page sizes, it is possible to reduce the number of
TLB entries needed by moving data which is used at the same time to fewer
pages. This is similar to some optimizations for cache use we talked
about above. Only now the alignment required is large. Given that the
number of TLB entries is quite small this can be an important
optimization.
4.4 Impact Of Virtualization
Virtualization of OS images will become more and more prevalent;
this means another layer of memory handling is added to the picture.
Virtualization of processes (basically jails) or OS containers do not
fall into this category since only one OS is involved. Technologies
like Xen or KVM enable—with or without help from the processor—the
execution of independent OS images. In these situations there is one
piece of software alone which directly controls access to the physical
memory.
Figure 4.4: Xen Virtualization Model
In the case of Xen (see Figure 4.4) the Xen VMM is that piece of
software. The VMM does
not implement many of the other hardware controls itself, though.
Unlike VMMs on other, earlier systems (and the first release of the
Xen VMM) the hardware outside of memory and processors is controlled
by the privileged Dom0 domain. Currently, this is basically the same
kernel as the unprivileged DomU kernels and, as far as memory handling
is concerned, they do not differ. Important here is that the VMM hands
out physical memory to the Dom0 and DomU kernels which, themselves, then
implement the usual memory handling as if they were
running directly on a processor.
To implement the separation of the domains which is required for the
virtualization to be complete, the memory handling in the Dom0 and DomU
kernels does not have unrestricted access to physical
memory. The VMM does not hand out memory by giving
out individual physical pages and letting the guest OSes handle the addressing;
this would not provide any protection against faulty or rogue guest
domains. Instead, the VMM creates its own page table tree for each
guest domain and hands out memory using these data structures. The
good thing is that access to the administrative information of the
page table tree can be controlled. If the code does not have
appropriate privileges it cannot do anything.
This access control is exploited in the virtualization Xen provides,
regardless of whether para- or hardware (aka full) virtualization is
used. The guest domains construct their page table trees for each
process in a way which is intentionally quite similar for para-
and hardware virtualization. Whenever the guest OS modifies its page
tables the VMM is invoked. The VMM then uses the updated information in
the guest domain to update its own shadow page tables. These are the
page tables which are actually used by the hardware. Obviously, this
process is quite expensive: each modification of the page table tree
requires an invocation of the VMM. While changes to the memory
mapping are not cheap without virtualization they become even more
expensive now.
The additional costs can be really large, considering that the changes
from the guest OS to the VMM and back themselves are already quite expensive.
This is why the processors are starting to have additional functionality to avoid the
creation of shadow page tables. This is good not only because of
speed concerns but it also reduces memory consumption by the VMM.
Intel has Extended Page Tables (EPTs) and AMD calls it Nested Page
Tables (NPTs). Basically both technologies have the page tables of
the guest OSes produce virtual physical addresses. These addresses
must then be further translated, using the per-domain EPT/NPT trees, into
actual physical addresses. This will allow memory handling at almost
the speed of the no-virtualization case since most VMM entries for
memory handling are removed. It also reduces the memory use
of the VMM since now only one page table tree for each domain (as
opposed to process) has to be maintained.
The results of the additional address translation steps are also
stored in the TLB. That means the TLB does not store the virtual
physical address but, instead, the complete result of the lookup. It
was already explained that AMD's Pacifica extension introduced the ASID
to avoid TLB flushes on each entry. The number of bits for the ASID is
one in the initial release of the processor extensions; this is just
enough to differentiate VMM and guest OS. Intel has virtual processor
IDs (VPIDs) which serve the same purpose, only there are more of them.
But the VPID is fixed for each guest domain and therefore it
cannot be used to mark separate processes and avoid TLB flushes at
that level, too.
The amount of work needed for each address space modification is one
problem with virtualized OSes. There is another problem inherent in
VMM-based virtualization, though: there is no way around
having two layers of memory handling. But memory handling is hard
(especially when taking complications like NUMA into account, see
Section 5). The Xen approach of using a separate VMM makes optimal (or
even good) handling hard since all the complications of a memory management
implementation, including trivial things like discovery of memory
regions, must be duplicated in the VMM. The OSes have fully-fledged
and optimized implementations; one really wants to avoid
duplicating them.
Figure 4.5: KVM Virtualization Model
This is why carrying the VMM/Dom0 model to its conclusion is such an
attractive alternative. Figure 4.5 shows how the KVM Linux
kernel extensions try to solve the problem. There is no
separate VMM running directly on the hardware and controlling all the
guests; instead, a normal Linux kernel takes over this functionality.
This means the complete and sophisticated memory handling
functionality in the Linux kernel is used to manage the memory of the
system. Guest domains run alongside the normal user-level
processes in what the creators call guest mode. The
virtualization functionality, para- or full virtualization, is
controlled by another user-level process, the KVM VMM. This is just
another process which happens to control a guest domain using the
special KVM device the kernel implements.
The benefit of this model over the separate VMM of the Xen model is
that, even though there are still two memory handlers at work when
guest OSes are used, there only needs to be one implementation, that
in the Linux kernel. It is not necessary to duplicate the same
functionality in another piece of code like the Xen VMM. This leads
to less work, fewer bugs, and, perhaps, less friction where the two
memory handlers touch since the memory handler in a Linux guest makes
the same assumptions as the memory handler in the outer Linux kernel
which runs on the bare hardware.
Overall, programmers must be aware that, with virtualization used, the
cost of memory operations is even higher than without virtualization.
Any optimization which reduces this work will pay off even more in
virtualized environments. Processor designers will, over time, reduce
the difference more and more through technologies like EPT and NPT but
it will never completely go away.
Comments (16 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: Home routers and security flaws; New vulnerabilities in imagemagick, ruby, Tk, xen...
- Kernel: PF_CAN, Playing with printk(), The design of preemptible RCU.
- Distributions: Who made Gentoo Linux, and when? A commit analysis; new releases of openSUSE 10.3, Mandriva Linux 2008, Linspire 6.0, Fedora 8 Test 3, EnGarde Secure Linux 3.0.17; new distribution Vixta
- Development: Recovering data from a laptop drive using Linux,
SQL-Ledger security issue, An overview of linking,
new versions of SQLite, SDTConnector, synfig, JasperReports, Lcal, PyX,
Fluxbox, LyX, gsch2pcb, OpenSceneGraph, PatientOS, Task Coach, GCC, GIT.
- Press: Report on storing credit card numbers, Linux phishing botnet statistics
questioned, Making free software more user-friendly, maddog on regional events,
tension between Novell and OO.o, New York Times looks at desktop Linux,
Ballmer fires patent salvo, security with Bastille, setting system time,
reviews of Adobe Flex Builder, Novell Open Enterprise 2.0 and
OLPC XO laptop.
- Announcements: Mark Webbink joins SFLC board, Oracle open-sources OCI8 driver,
Scalix 11.2, KDE 4.0 Release Event Contest - January, SFLC
Legal Summit for Software Freedom, NLUUG 25 year anniversary conf,
Free Software Directory web site makeover, Planet CentOS,
Samba and Google Summer of Code Podcast.
Next page:
Security>>