July 25, 2006
This article was contributed by John Richard Moser
Everyone wants their computer to start faster. Distribution maintainers
are thus always looking for things to decrease boot time and program load
time, using mechanisms such as readahead, boot reordering, even parallel
init to more efficiently distribute CPU and disk I/O and get the user to
the login screen faster.
One such tool used to decrease boot time and program load time is prelink [PDF], developed by Jakub Jelínek
at Red Hat and discussed in an
earlier LWN article. Prelinking is not the only way to improve start-up times, however; there
are also a number of existing and developing optimizations the GNU linker
can apply to give substantial gains. Besides existing GNU linker hash
table optimizations, Michael
Meeks has done substantial works on three
separate optimizations in the GNU linker in a quest to make OpenOffice.org load
quickly [PDF].
1.0 The -Wl,-O1 Linker Options
In typical operation, symbols are stored in hash tables in ELF binaries;
these hash tables are kept small, and symbols that fall into the same
bucket—hash to the same value—are compared by a simple string
comparison. Unfortunately, symbols in the same bucket with the same prefix
will need a long string comparison, which can be slow; if there are lots of
symbols in the same bucket, then the dynamic linker has to string compare
with each of them until it finds a match (Fig. 1.0-1).
|
Fig. 1.0-1: A hash table of symbols in a library. The bucket
0x00b1 contains three symbols; the red characters will have to be tested
against the symbol name before the symbol 'foodragon' is resolved.
|
It was recommended in a paper [PDF] by Ulrich Drepper to utilize a
GNU linker optimization that focuses more on producing short hash chains
than a small hash table size. Although the hash table may grow by a few
kilobytes, the shortened length of hash chains means that symbol look-ups
do not have to perform as many string comparisons. Further, the chances of
symbols sharing long prefixes landing in the same bucket are greatly
reduced, making walking the chains faster (Fig. 1.0-2).
|
Fig. 1.0-2: A hash table of symbols in a library. This table is optimized, so symbols hash to buckets containing shorter hash chains; this reduces the number of string comparisons done.
The table is a bit bigger, though.
|
This linker optimization can be activated by passing -Wl,-O1 to
gcc at link time. This flag can also be passed during any compilation;
however, optimizing hash tables can be slow, and there are no benefits to
doing this when creating object files that are just going to be linked into
a main executable or shared object.
Currently this linker optimization is used
by GNOME in its official builds. It has also been discussed on the Gentoo forums; and is
used by
the Ubuntu distribution, to name a few. An old but good technique.
2.0 -Bdirect Linking
Michael Meeks also proposed an
optimization
to GNU binutils and glibc which functions similar to direct binding in
Solaris. By passing -Bdirect at link time, the build process
can cause many symbols to
be directly linked, allowing the dynamic linker to severely decrease the
search space during lookup.
Libraries have unresolved symbols when they use functions or global
variables from other libraries; these symbols are resolved during dynamic
linking. The dynamic linker locates every unresolved symbol each library
needs by searching the symbol table of each loaded library in order from
first loaded to last loaded. This can become quite time consuming,
especially in cases such as GNOME or OpenOffice.org where 50-150 libraries
are loaded and up to 200,000 symbols must be resolved.
|
Fig. 2.0-1: Normal binding
is done by checking through all libraries until the symbol is found.
|
Direct binding shortens the path the linker has to take to resolve a symbol
by drawing it straight from whatever ELF object needs the symbol to the ELF
object containing the symbol. A section is added to the ELF header which
associates each symbol with an entry in the DT_NEEDED table—the list
of libraries needed by the ELF object. The linker uses this information to
go straight to the library containing the symbol and search only there,
instead of searching through each library sequentially.
|
Fig. 2.0-2: Direct binding
is done by checking the library that the symbol table says the binding is located in. This gives a much shorter path for any symbol lookup; three are shown above.
|
Meeks' -Bdirect linking method operates slightly differently from
Solaris direct binding. With the Solaris linker, direct binding binds all
symbols directly to the relevant libraries. With Michael Meeks'
implementation of -Bdirect linking, vague symbols are detected and
linked as normal. This is particularly important with C++, as vague C++
symbols are fairly common. It is not possible to guarantee that a
vague symbol is linked to a specific library, so direct binding of vague
symbols is not technically feasible.
-Bdirect binding can, to a degree, work with shared objects that are loaded with dlopen() rather than run-time linking; this is reported to greatly aid with OpenOffice.org and KDE start-up times. This is interesting because the performance improvements per library are comparable to but not quite as substantial as prelink, which does not affect libraries loaded in this manner.
The current implementation unfortunately has a few rough edges. Some areas could be better optimized; and there is an issue with certain libraries and programs breaking due to direct binding. Sometimes multiple libraries supply the same symbol, and the first loaded is the one used; this is called interposing, and is
the reason why Meeks' -Bdirect linking patch does not optimize vague symbols. When direct binding is used, the symbol is chosen by the GNU linker at link time rather than by the dynamic linker at run time. This can be both a good and a bad thing, depending on the situation.
In most cases direct binding is a good thing, as it guarantees that libraries dependent on other libraries will always find the proper symbols, even if multiple libraries are loaded with the same symbols. Unfortunately, sometimes this very functionality also changes the way existing programs link, which just happen to work fine based on using the symbol from the first library supplying it.
This is usually disastrous, as the symbol now used is typically a
variable or function meant for a different purpose; the program no
longer does what the programmer intended, and likely ends up simply
crashing.
Not everybody believes that the use of interposing is the right thing to
do. Still, getting things like this working would really require finding a
work-around that does not break compatibility with existing code. Michael
Meeks suggested during e-mail discussion a link-map extension to handle
direct binding per symbol:
I would instead say: writing code that explicitely relies on
interposing is not sensible. However - to make this work all
that is needed is to implement an extension to link-maps (as
done in the Solaris linker) to allow certain symbols to be
marked as interposers (and hence not direct linked). That's
most useful for (eg.) in glibc's use of pthreads - done by
interposing.
This would allow direct binding to be used with libraries that currently
fail with it, by only doing a long symbol search for vague symbols.
Further, Meeks' finterpose tool can be
used to find interposers between libraries; this not only shows where
-Bdirect binding may fail, but also has exposed a few bugs in
software such as GTK+
and gstreamer.
The unfortunately large number of interposers presents a slight roadblock,
but should not stop Meeks' -Bdirect linking from eventually
becoming fully functional.
This optimization has an uncertain future, however. Relevant glibc code has so far been
rejected
by Ulrich Drepper. Despite this, -Bdirect linking has already
found uses in Gentoo (with a portage overlay), Pardus Linux, and
Solaris/OpenSolaris.
The patch has also been committed to OpenSUSE, although it is uncertain if it will be used to build upcoming OpenSUSE releases.
3.0 dynsort
Another optimization posted by Michael Meeks adds the dynsort keyword to the GNU
linker. By passing -zdynsort at link time, the .dynsym
and .dynstr sections as well as relocations are all sorted by ELF
hash and by the position where they land in the hash bucket.
When symbols are looked up in an ELF object, a hash table has to be
searched. Hash collisions are effectively stored as a linked list, which
then has to be walked to find the appropriate entry. With
dynsort, the symbols that have to
be examined while walking a bucket are all adjacent to each other. This
reduces the number of L1 and L2 cache misses, allowing the CPU to utilize
its facilities much more efficiently during dynamic linking.
The dynsort optimization has gotten a good bit of attention since
it was
unveiled. Meeks is working on improving it in various ways, such as moving
data for undefined symbols away from the rest of the data. It was this
patch that got Meeks
offered a branch in binutils cvs, although as of this writing one
hasn't been set up. It can be
used in Gentoo with an overlay, but may soon
be available everywhere via the --hash-style option.
4.0 Precomputed Hash Values
A few weeks after the dynsort patch, Meeks posted another optimization the binutils mailing list. This one adds pre-computed ELF hash values to the elf header when the -hashvals switch is given. Normally the dynamic linker has to compute a
hash value of a symbol and then use that to find the index in the symbol tables
of the other libraries. The -hashvals optimization pre-computes these,
removing a series of cache misses and mathematical computations from the process.
The visible effect of the -hashvals switch is a large speedup in symbol look-up; Meeks measured dlopen() on libsvx as requiring 40% less time, and in some places a 51% reduction was realized. This is due not only to avoiding the calculation of symbol hash values; but also to much more efficient L2 cache utilization.
Only a single 32-bit value is accessed for each symbol, rather than a long string; and locality is increased, since only the .hashvals section is accessed instead of both .dynsym and .dynstr, which reduces L2 cache misses.
This patch has a very favorable future; it almost immediately gained support for inclusion in upstream binutils. Several months later, a rewrite done by Ulrich Drepper and Jakub Jelínek was posted, implementing sorting and precomputed hashvals via --hash-style. And of course, Gentoo users have had access to it for a while with various portage overlays.
Conclusion
There are a lot of optimizations possible to reduce dynamic linking time, creating substantial gains in application load time. Future versions of the GNU linker and glibc may allow distributions to boot and load programs much faster, getting users to the desktop and on their applications in substantially less time.
(
Log in to post comments)