Our bloat problem
Our bloat problem
Posted Aug 4, 2005 2:15 UTC (Thu) by hp (guest, #5220)In reply to: Our bloat problem by jg
Parent article: Our bloat problem
I don't know how to get very useful numbers out of top myself - what one would want for a bloat metric is "malloc'd RAM unique to this process" or something, perhaps "plus the size of each in-use shared page divided by number of apps currently sharing it," perhaps "plus resources allocated on the X server side on behalf of this app." Instead top has your choice of arcane numbers that aren't too useful. What you want is a number that will go down when genuine improvements are made, go up when things are genuinely made worse, and show in a fair way how each app contributes to the total.
memprof is pretty good for getting better numbers, when it hasn't been busted by yet another ABI change (it mucks around in internals that aren't in the public ABI). But it only works for a single process.
Maybe this is similar to the boot speed problem, where Owen's suggestion to create a good visualization resulted in http://www.bootchart.org/ which in turn led to lots of truly useful enhancements.
Anyone want to make a "bloat chart"?
If it included X resources and mallocs and accounted for shared libs somehow that would be pretty good, though it still wouldn't be handling other daemons that allocate resources on behalf of apps (e.g. gconfd uses more memory as more apps ask it to load more settings).
There are some real limits on bloat improvements, though. Things like terminal scrollback buffer, emails, web pages, icons, background images are going to be big, and they're going to be bigger the more you have of them.
(In using a "bloat chart" you'd have to be careful to canonicalize this stuff, e.g. always set the same scrollback buffer when comparing two terminals, always have the same RAM cache size when comparing two browsers, that type of thing.)
Posted Aug 4, 2005 5:07 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (4 responses)
Before we can really improve things, we need better measurement tools. Remember how much faster the boot times got after some guy did bootchart?
Back in the early 80s, VMS had a "graphical" (termcap-style) tool that would show a real-time animation of every page in memory, tracing it to the corresponding executable or shared library or kernel. The modern extension would show memory-mapped files, cached files, etc. as well. If we could just have those pictures, people's attention would quickly focus on the hot spots.
Posted Aug 4, 2005 7:13 UTC (Thu)
by freddyh (guest, #21133)
[Link]
FreddyH
Posted Aug 4, 2005 12:45 UTC (Thu)
by MathFox (guest, #6104)
[Link] (2 responses)
$ cat /proc/self/maps
This shows a quite lean program, in order of the lines:
(each page on this machine is 4 kb)
Most of the apparent bloat for cat (850 pages * 4 k = 3.4 Mb) is in localisation and the shared C library. All but 16 pages (64 k) are shared with other processes.
Who writes a nice graphical frontend to /proc/*/maps?
Posted Aug 4, 2005 14:59 UTC (Thu)
by Yorick (guest, #19241)
[Link] (1 responses)
It is also not possible to find the set of actually shared pages from a mapped file - just because
We definitely need better memory usage metrics from the kernel, or bloat like this will be difficult
Posted Aug 6, 2005 14:00 UTC (Sat)
by sandmann (subscriber, #473)
[Link]
http://www.daimi.au.dk/~sandmann/freon.c
You have to be root to run it.
Unfortunately a similar thing can't be done for malloc()d memory because mincore doesn't work for anonymous pages.
Posted Aug 4, 2005 7:15 UTC (Thu)
by eru (subscriber, #2753)
[Link] (9 responses)
Actually this "payload data" is often minuscule. Take those
terminal scrollback buffers. Assuming each line contains 60 characters
on the average (probably an over-estimate) and you have them in a
linked list with 8 bytes for links to the previous and next line,
storing a 1000 lines needs just 66.4 Kb. Where does the rest of the
21 Mb of gnome-terminal go?
Similarly in emails, a single piece of mail might typically need
of the order of 10 Kb for one textual message.
Images and sound files are of course inevitably large, but not
most applications don't deal with them.
Posted Aug 4, 2005 8:25 UTC (Thu)
by davidw (guest, #947)
[Link] (4 responses)
Posted Aug 4, 2005 14:33 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link] (3 responses)
Posted Aug 12, 2005 13:45 UTC (Fri)
by ringerc (subscriber, #3071)
[Link] (2 responses)
That said - it's only double. For text, that's not a big deal, and really doesn't explain the extreme memory footprints we're seeing.
Posted Aug 13, 2005 2:53 UTC (Sat)
by hp (guest, #5220)
[Link]
UTF-8 has the huge advantage that ASCII is a subset of it, which is why everyone uses it for UNIX.
Posted Aug 20, 2005 6:24 UTC (Sat)
by miallen (guest, #10195)
[Link]
I donno about that. First, it is a rare thing that you would say "I want 6 *characters*". The only case that I can actually think of would be if you were printing characters in a terminal which has a fixed number of positions for characters. In this case UCS-2 is easier to use but even then I'm not convinced it's actually faster. It your using Cyrillic, yeah, it will probably be faster but if it's 90% ascii I would have to test that. Consider that UTF-8 occupies almost half the space of UCS-2 and that CPU cache misses account for a LOT of overhead. If you have large collections of strings like from say a big XML file the CPU will do a lot more of waiting for data with UCS-2 as opposed to UTF-8.
In truth the encoding of strings is an ant compared to the elephant of data structures and algorithms. If you design your code well and adapt interfaces so that modules can be reused you can improve the efficiency of your code much more than petty compiler options, changing character encodings, etc.
Posted Aug 4, 2005 13:35 UTC (Thu)
by elanthis (guest, #6227)
[Link] (3 responses)
So if you have an 80 character wide display with 100 lines of scrollback, and we assume something like 8 bytes per character (4 for character data, 4 for attributes and padding) we get 8*80*100 = 640000. And that's just 100 lines. Assuming you get rid of any extraneous padding (using one of several tricks), you might be able to cut down to 6 bytes per character, resulting in 6*80*100 = 480000. Almost half a megabyte for 100 lines of scrollback.
More features requires more memory. If you want a terminal that supports features that many people *need* these days, you just have to suck it up and accept the fact that it'll take more memory. If you can't handle that, go find a really old version of xterm limited to ASCII characters without 256-color support and then you might see a nice reduction in memory usage. The default will never revert to such a terminal, however, because it flat out can't support the workload of many people today, if for no other reason than the requirement for UNICODE display.
Posted Aug 4, 2005 13:44 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link]
Posted Aug 4, 2005 16:21 UTC (Thu)
by eru (subscriber, #2753)
[Link]
But that is a very wasteful implementation choice. There are several other
ways of doing it (like the linked list I proposed) that are not much more
complex to program. I forgot about attributes in my original post, but they,
too can easily be represented in ways that average much less than 4 bytes
per character. And as another poster pointed out, you can store Unicode
with less than 4 bytes per character. In today's computers the CPU is
so much faster than the memory that it may not pay to optimize data
structures for fast access at the cost of increased size.
I think this difference illustrates a major reason for the bloat problem:
using naive data structures and code without sufficient thought for
efficiency. Maybe OK for prototypes, but not after that. I am not advocating
cramming data into few bits in complex ways (as used to be common in the
days of 8-bit microcomputers), but simply avoid wasting storage
whenever it can be easily done. Like, don't store boolean flags
or known-to-be small numbers in full-size ints, allocate space
for just the useful data (like in the scroll-back case), don't
replicate data redundantly.
I wonder if the well-known GNU coding guidelines (see Info node "standards" in Emacs
installations) may be partly to blame for bloat problems in free software... To quote:
If a program typically uses just a few meg of memory, don't bother
making any effort to reduce memory usage. For example, if it is
impractical for other reasons to operate on files more than a few meg
long, it is reasonable to read entire input files into core to operate
on them.
Right, but what when you have lots of programs open at the same
time, each using "just a few meg of memory"? (I recognize Stallman
wrote that before GUI's became common on *nix systems).
Posted Aug 4, 2005 19:29 UTC (Thu)
by berntsen (guest, #4650)
[Link]
/\/
Posted Aug 7, 2005 20:56 UTC (Sun)
by oak (guest, #2786)
[Link] (3 responses)
I don't know how to get very useful numbers out of top myself -
what one
would want for a bloat metric is "malloc'd RAM unique to this process" or
something,
Malloc => heap. If the program has written to the allocated heap page,
it's private. I don't see why program would allocate memory without
writing to it, so in practice all of heap can be considered private to the
process.
You can already see heap usage from /proc and with Valgrind you can
actually get a graph where it goes.
perhaps "plus the size of each in-use shared page divided by number
of apps currently sharing it,"
During his Guadec 2005 speech, Robert Love mentioned a kernel patch
which will produce the information about how much memory is private (dirty
= allocated heap that has been written to, shared library relocation
tables etc.) to a process. He promised to add a link to it on to the
Gnome memory reduction page.
perhaps "plus resources allocated on the X server side on behalf of
this app."
XresTop tells this. Some programs can push amazing amounts of memory
to the Xserver (the huge number number shown in 'top' for Xserver comes
from memory mapping the framebuffer though I think).
Posted Aug 7, 2005 21:59 UTC (Sun)
by hp (guest, #5220)
[Link] (1 responses)
Aggregating this stuff into a snapshot of the whole system at a point in time would let you really point fingers in terms of bloat and figure out where to concentrate efforts.
It's not easy enough now, which is why people just use "top" and its misleading numbers.
Even better of course would be to take multiple snapshots over time allowing observation of what happens during specific operations such as log in, load a web page, click on the panel menu, etc.
A tool like this would probably be pretty handy for keeping an eye on production servers, as well.
Posted Aug 18, 2005 20:10 UTC (Thu)
by oak (guest, #2786)
[Link]
Posted Aug 11, 2005 12:24 UTC (Thu)
by anton (subscriber, #25547)
[Link]
So if you want to know the real memory usage, counting private anonymous mappings is not good enough.
One of the more embarrassing problems is that we don't really have tools that can give us an accurate picture of the problem. Everyone will tell you that "top" doesn't give an accurate picture, and everyone is right. The difficulty is that there is at present no tool that really shows what is going on.
Our measurement problem
I do agree that there is not really a tool that *easily* shows the memory problem. Atkins however can teach you exactly what your application is doing, including its memory usage.Our measurement problem
Unfortunately you'll have to dig in yourself, and the first time will cost you quite some time.
Well "cat /proc/<pid>/maps" allready gives you a lot of information:Our measurement problem
08048000-0804c000 r-xp 00000000 21:06 32615 /bin/cat
0804c000-0804d000 rw-p 00003000 21:06 32615 /bin/cat
0804d000-0804e000 rwxp 00000000 00:00 0
40000000-40014000 r-xp 00000000 21:06 23031 /lib/ld-2.3.2.so
40014000-40015000 rw-p 00014000 21:06 23031 /lib/ld-2.3.2.so
40027000-40156000 r-xp 00000000 21:06 23037 /lib/libc.so.6
40156000-4015a000 rw-p 0012f000 21:06 23037 /lib/libc.so.6
4015a000-4015e000 rw-p 00000000 00:00 0
4015e000-4035e000 r--p 00000000 21:06 23404 /usr/lib/locale/locale-archive
bfffe000-c0000000 rwxp fffff000 00:00 0
3 pages of application code (potentially sharable)
1 page of initialised data (theoreticly sharable)
1 page of uninitialised data (not sharable)
21 pages for the dynamic loader code (shared)
1 page of dynamic loader initialised data (potentially shared)
303 pages of shared C library code (shared)
4 pages of C library data (may be shared or private)
4 pages with private data
512 pages of shared data (locale-archive)
and finally 2 pages of stack. (unshared)
It is not possible from /proc/$pid/maps to find out how many pages of a mapped file have been Our measurement problem
privately modified (.data sections etc). This would be a useful addition, but because of the lack of
forethought it can be difficult to add it to this pseudo-file without breaking existing programs.
two processes both map a common shared object, it does not mean they both use all pages from
it, or that they use the same pages. It is quite common to see entire DSOs being mapped without
being used by the process at all.
to profile.
The kernel actually has a system call - mincore - that tells you which pages of a mapped file are in memory. I wrote a small program a long time ago that uses it to print more details about memory usage:Our measurement problem
There are some real limits on bloat improvements, though. Things like terminal scrollback buffer, emails, web pages, icons, background images are going to be big, and they're going to be bigger the more you have of them.
Our bloat problem
8 byte characters are becomming a thing of the past, in user-facing applications... Still though, your point is taken - 66K multiplied by a factor of 4 still isn't that much.8 byte characters?
> 8 byte characters are becomming a thing of the past, in user-facing applications...8 byte characters?
I think you meant 8-bit, but isn't that what UTF-8 is about?
Why not just use that throughout the system?
Footprint may grow, but simplicity is often worth the fight.
R,
C
Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 . With utf-8, to take the first 6 characters of a buffer you must decode the UTF-8 data (you don't know if each character is one, two, or four bytes long). With UCS-2, you just return the first 12 bytes of the buffer.8 byte characters?
Unicode doesn't fit in 16 bits anymore; most apps using 16-bit encodings would be using UTF-16, which has the same variable-length properties as UTF-8. If you pretend each-16-bits-is-one-character then either you're using a broken encoding that can't handle all of Unicode, or you're using UTF-16 in a buggy way. To have one-array-element-is-one-character you have to use a 32-bit encoding.8 byte characters?
Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 .
8 byte characters?
That isn't generally how a terminal scrollback buffer works, however. You generally work with blocks of memory, so even blank areas on lines are filled in with data. That's required due to how terminals work in regarding to character attributes. Which also brings up the point that you have more than just character data per cell, you also have attribute data. And then let's get to the fact that in 2005, people use more than just ASCII, and you actually can't use only a byte per character, but have to use something like 4 bytes per character in order to store UNICODE characters.Our bloat problem
You don't need 4 bytes per character for Unicode in most places. A brief examination of the unicode xterm shows that, as expected it doesn't actually store everything as 32-bit ultra-wide characters. Most strings can be stored as UTF-8, a few places might deal with the actual code point and have a 32-bit integer temporarily, but certainly not huge strings of them.Our bloat problem
That isn't generally how a terminal scrollback buffer works, however. You generally work with blocks of memory, so even blank areas on lines are filled in with data.
Effect of Implementation Choices
Memory Usage
============
If people malloc like you multiply, I see where the bloat is comming from, you have a factor of 10 wrong ;-)Our bloat problem
Our bloat problem
Of course you can figure this out painstakingly. What you can't do though is get all that info aggregated for 1) each process and 2) a current snapshot of all processes, in an easy way.Our bloat problem
> Aggregating this stuff into a snapshot of the whole system at a point in Our bloat problem
> time would let you really point fingers in terms of bloat and figure out
> where to concentrate efforts.
If the patch Robert mentioned is the same which results I have seen
(post-processed), you can already get that information from Linux kernel
by patching it a bit. The results I saw included machine wide statistics
and process specific stats of how many pages reserved for the process were
ro/rw/dirty, how much each linked library accounts for the process memory
usage (relocation table sizes etc., not how much each library allocates
heap for the process[1]). This is quite useful for system overview,
whereas valgrind/Massif/XResTop/top tell enough of individual
applications.
[1] that you can get just with a simple malloc wrapper that gets stack
traces for each alloc and some data post-processing heuristics to decided
which item in the stack trace to assign the guilt. The hard part is
actually the post-processing, deciding who in e.g. the allocation chain of
App->GtkLabel->Pango->Xft2->FontConfig->Freetype should be blamed for the
total of the allocations done at any point in the chain as you don't know
whether the reason for the total allocations is valid or not without
looking at the code...
Best would be an interactive allocation browser similar to Kcachegrind
with which one could view also the source along with accumulated
allocation percentages.
Our bloat problem
I don't see why program would allocate memory without writing to it
I do this when I need contiguous memory, but don't know how much. Then I allocate lots of space; unused memory is cheap. The result looks like this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
anton 17100 0.0 0.2 6300 1196 pts/0 S+ 14:16 0:00 gforth
The large VSZ is caused mainly by unused allocated memory.