June 10, 2009
This article was contributed by Goldwyn Rodrigues
Reducing the memory footprint of a binary is important for improving
performance. Poke-a-hole (pahole) and other binary object file
analysis programs developed by Arnaldo Carvalho de Melo help in
analyzing the object files for finding inefficiencies such as holes in
data structures, or functions declared inlined being eventually
un-inlined functions in the object code.
Poke-a-hole
Poke-a-hole (pahole) is an object-file analysis tool to find the size
of the data structures, and the holes caused due to aligning the data
elements to the word-size of the CPU by the compiler. Consider a simple
data structure:
struct sample {
char a[2];
long l;
int i;
void *p;
short s;
};
Adding the size of individual elements of the structure, the expected size
of the sample data structure is:
2*1 (char) + 4 (long) + 4 (int) + 4 (pointer) + 2 (short) = 16 bytes
Compiling this on a 32-bit architecture (ILP32, or Int-Long-Pointer 32
bits) reveals that the size is actually
20 bytes. The additional
bytes are inserted by the compiler to make the data elements aligned
to word size of the CPU. In this case, two bytes padding is added after
char a[2], and another two bytes are added after
short s. Compiling the
same program on a 64-bit machine (LP64, or Long-Pointer 64 bits) results
in struct sample occupying 40 bytes. In this case, six bytes are added
after char a[2], four bytes after int i, and
six bytes after short 2.
Pahole was developed to narrow down on such holes
created by word-size alignment by the compiler. To analyze the object files,
the source must be compiled with the debugging flag "-g". In the
kernel, this is activated by CONFIG_DEBUG_INFO, or "Kernel Hacking >
Compile the kernel with debug info".
Analyzing the object file generated by the program with struct sample
on a i386 machine results in:
i386$ pahole sizes.o
struct sample {
char c[2]; /* 0 2 */
/* XXX 2 bytes hole, try to pack */
long int l; /* 4 4 */
int i; /* 8 4 */
void * p; /* 12 4 */
short int s; /* 16 2 */
/* size: 20, cachelines: 1, members: 5 */
/* sum members: 16, holes: 1, sum holes: 2 */
/* padding: 2 */
/* last cacheline: 20 bytes */
};
Each data element of the structure has two numbers listed in C-style
comments. The first number represents the offset of the data element from
the start of the structure and the second number represents the size in
bytes. At the end of the structure, pahole summarizes the details of the
size and the holes in the structure.
Similarly, analyzing the object file generated by the program with
struct sample on a x86_64 machine results in:
x86_64$ pahole sizes.o
struct sample {
char c[2]; /* 0 2 */
/* XXX 6 bytes hole, try to pack */
long int l; /* 8 8 */
int i; /* 16 4 */
/* XXX 4 bytes hole, try to pack */
void * p; /* 24 8 */
short int s; /* 32 2 */
/* size: 40, cachelines: 1, members: 5 */
/* sum members: 24, holes: 2, sum holes: 10 */
/* padding: 6 */
/* last cacheline: 40 bytes */
};
Notice that there is a new hole introduced after int i, which was not
present in the object compiled for the 32-bit machine. Compiling a source
code developed
on i386 but compiled on x86_64 might be wasting more space because of
such alignment problems because long and pointer graduated to being
eight bytes wide while integer remained as four bytes. Ignoring data structure
re-structuring is a common mistake developers do when porting
applications from i386 to x86_64. This results in larger memory
footprint of the program than expected. A larger data structure leads
to more cacheline reads than required and hence decreasing
performance.
Pahole is capable of suggesting an alternate compact data structure
reorganizing the data elements in the data structure, by
using the --reorganize option. Pahole also accepts an optional
--show_reorg_steps to show the steps taken to compress the data
structure.
x86_64$ pahole --show_reorg_steps --reorganize -C sample sizes.o
/* Moving 'i' from after 'l' to after 'c' */
struct sample {
char c[2]; /* 0 2 */
/* XXX 2 bytes hole, try to pack */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
short int s; /* 24 2 */
/* size: 32, cachelines: 1, members: 5 */
/* sum members: 24, holes: 1, sum holes: 2 */
/* padding: 6 */
/* last cacheline: 32 bytes */
}
/* Moving 's' from after 'p' to after 'c' */
struct sample {
char c[2]; /* 0 2 */
short int s; /* 2 2 */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
/* size: 24, cachelines: 1, members: 5 */
/* last cacheline: 24 bytes */
}
/* Final reorganized struct: */
struct sample {
char c[2]; /* 0 2 */
short int s; /* 2 2 */
int i; /* 4 4 */
long int l; /* 8 8 */
void * p; /* 16 8 */
/* size: 24, cachelines: 1, members: 5 */
/* last cacheline: 24 bytes */
}; /* saved 16 bytes! */
The --reorganize algorithm tries to compact the structure by moving
the data elements from the end of the struct to fill holes. It makes
an attempt to move the padding at the end of the struct. Pahole demotes
the bit fields to a smaller basic type when the type being used has
more bits that required by the element in the bit field. For example,
int flag:1 will be demoted to char.
Being over-zealous in compacting a data structure sometimes may reduce
performance. Writes to data elements may flush the cachelines of other
data elements being read from the same cacheline. So, some structures
are defined with ____cacheline_aligned in order to force them to start
from the beginning of a fresh cacheline. An example output of structure
which used ____cacheline_aligned from drivers/net/e100.c is:
struct nic {
/* Begin: frequently used values: keep adjacent for cache
* effect */
u32 msg_enable ____cacheline_aligned;
struct net_device *netdev;
struct pci_dev *pdev;
struct rx *rxs ____cacheline_aligned;
struct rx *rx_to_use;
struct rx *rx_to_clean;
struct rfd blank_rfd;
enum ru_state ru_running;
spinlock_t cb_lock ____cacheline_aligned;
spinlock_t cmd_lock;
<output snipped>
Analyzing the nic structure using pahole results in holes just before
the cacheline boundary, the data elements before rxs and cb_lock.
x86_64$ pahole -C nic /space/kernels/linux-2.6/drivers/net/e100.o
struct nic {
u32 msg_enable; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct net_device * netdev; /* 8 8 */
struct pci_dev * pdev; /* 16 8 */
/* XXX 40 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct rx * rxs; /* 64 8 */
struct rx * rx_to_use; /* 72 8 */
struct rx * rx_to_clean; /* 80 8 */
struct rfd blank_rfd; /* 88 16 */
enum ru_state ru_running; /* 104 4 */
/* XXX 20 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
spinlock_t cb_lock; /* 128 4 */
spinlock_t cmd_lock; /* 132 4 */
<output snipped>
Besides finding holes, pahole can be used for the data field sitting
at a particular offset from the start of the data structure. Pahole
can also list the sizes of all the data structures:
x86_64$ pahole --sizes linux-2.6/vmlinux | sort -k3 -nr | head -5
tty_struct 1328 10
vc_data 432 9
request_queue 2272 8
net_device 1536 8
mddev_s 792 8
The first field represents data structure name, the second represents
the current size of the data structure and the final field represents
the number of holes present in the structure.
Similarly, to get the summary of possible data structure that can be
packed to save the size of the data structure:
x86_64$ pahole --packable sizes.o
sample 40 24 16
The first field represents the data structure, the second represents
the current size, the third represents the packed size and the fourth
field represents the total number of bytes saved by packing the holes.
Pfunct
The pfunct tool shows the function aspects in the object code. It is
capable of showing the number of goto labels used, number of
parameters to the functions, the size of the functions etc. Most
popular usage however is finding the number of functions declared inline but
not inlined, or the number of function declared uninlined but are
eventually inlined. The compiler tends to optimize the functions by
inlining or uninlining the functions depending on the size.
x86_64$ pfunct --cc_inlined linux-2.6/vmlinux | tail -5
run_init_process
do_initcalls
zap_identity_mappings
clear_bss
copy_bootdata
The compiler may also choose to uninline functions which have been
specifically declared inline. This may be caused by multiple
reasons, such as recursive functions for which inlining will cause
infinite regress. pfunct --cc_uninlined shows functions which are
declared inline but have been uninlined by the compiler. Such functions are
good
candidates for a second look, or for removing the inline declaration altogether.
Fortunately, pfunct --cc_uninlined on vmlinux (only) did not list
any functions.
Debug Info
The utilities rely on the debug_info section of the object file, when
the source code is compiled using the debug option. These utilities
rely on the DWARF standard or Compact
C-Type Format (CTF) which are common debugging file format used by
most compilers. Gcc uses the DWARF format.
The debugging data is organized under the debug_info section of ELF
(Executable and Linkage Format), in the form of tags with values such
as representing variables, parameters of a function, placed in
hierarchical nested format. To read raw information, you may use
readelf provided by binutils, or eu-readelf provided by elfutils.
Common standard distributions do not compile the packages with
debuginfo because it tends to make the binaries pretty big. Instead
they include this information as debuginfo packages, which
contain the debuginfo information which can be analyzed through these
tools or gdb.
Utilities discussed in this article were initially developed to
analyze kernel object files. However, these utilities are not limited to kernel
object files and can be used with any userspace programs generating
debug information. The source code of pahole utilities are maintained at
git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git
More information about pahole and other utilities to analyze debug
object files can be found in the PDF about 7
dwarves.
(
Log in to post comments)