Our bloat problem

Posted Aug 4, 2005 11:28 UTC (Thu) by dmantione (guest, #4640)
Parent article: Our bloat problem

Some sources of bloat on a Linux system:

Unicode - While this technology has been great to make software suitable for people in the far east, it comes at great cost.
- String operations become more expensive. Processing an UTF-8 string is usually n times more expensive than processing an ASCII string, because there is no match anymore between byte numbers an character numbers.
- Conversion tables need to be in memory. Most applications load their own conversion tables, which means they are n times in memory. These tables are also loaded on startup, decreasing application startup times.
- Because there are so much more characters possible, we need to load much more symbols in memory.
Java, Mono, Python, Perl, TCL - These programming languages require runtime environments. The runtime environment needs to be loaded, can be slow itself and most importantly can use quite a bit of memory. It especially becomes bad if one multiple runtime environents get loaded on one desktop. Script languages can be good for scripts, but are bad for the desktop. The popularity of Java and Mono is propably a bad thing regarding the bloat on our machine.
C++ - Even the de facto standard C++ is sometimes a problem. Especially if templates are being used C++ compilers output large amounts of code behind a programmers back. This can cause huge libraries and executables.
Shared libraries - Many programmers are under the impression that use of shared libraries is free. WRONG. They need to be loaded, resolved and even if you only use part of them a large amount of code is executed within them before you know it. Some libraries have become quite a bit larger than a low bloat system requires:
- Libc is just a C runtime library but has unfortunately grown to several megabytes.
- QT is over 8 megabytes, and often just used to display a window with a few buttons on it.
- Also don't forget the small ones, libpng is over 200kb on my system, just to interpret some metadata grouped into fourcc chunks. This exclusive decompression, that is in libz (better, 80kb, but I remeber Pkzip being 30kb on my MS-Dos system). I'm sure the Commodore Amiga was able to read iff files (on which png is based) in less code.

Our bloat problem

Posted Aug 4, 2005 17:54 UTC (Thu) by error27 (subscriber, #8346) [Link] (2 responses)

Perl runtime takes less than 2M. Python takes slightly over 2M. I doubt you have tested Mono or TCL. I'm willing to believe that some JVMs take a lot of RAM.

Our bloat problem

Posted Aug 4, 2005 19:15 UTC (Thu) by man_ls (guest, #15091) [Link]

I'm willing to believe that some JVMs take a lot of RAM.

That is probably why the most important efforts in Java on Linux (e.g. Red Hat's) are improvements in gcj: compiled Java, to do away with the JVM. Well, to be honest free Java is also a strong motivation.

gcj-Java shared libraries still carry a lot of bloat and take up many megabytes of RAM, but again I'm not sure it is actually used.

Our bloat problem

Posted Aug 5, 2005 9:37 UTC (Fri) by dmantione (guest, #4640) [Link]

2M for a script. Try to measure again for a real app, say Mandrake's
urpmi and you'll see the Perl overhead starts to cost.

I'm coding a lot in TCL by the way, but Pascal is my main language, as
I'm developer for Free Pascal. Incidentally Free Pascal is a very good
tool for fighting bloat.

Our bloat problem

Posted Aug 4, 2005 19:32 UTC (Thu) by henning (guest, #13406) [Link]

The problem with qt (one big lib for small tasks) is identified, and resolved in qt-4 . Now you must only load the needed parts for the gui, or the network specific lib for example.

Our bloat problem

Posted Aug 5, 2005 19:56 UTC (Fri) by roelofs (guest, #2599) [Link] (2 responses)

Also don't forget the small ones, libpng is over 200kb on my system, just to interpret some metadata grouped into fourcc chunks. This exclusive decompression,

That's a rather simplistic view. libpng supports 19 pixel formats, depth-scaling, a moderately complex 2D interlacing scheme, alpha compositing, CRC verification, and various other transformations and ancillary bits. It also includes row-filtering, which is a fundamental component of the compression codec. I won't defend everything that's gone into libpng, but it's highly misleading to refer to all of it as bloat. If libpng (or a higher-level library built on top of it) didn't include it, all of your PNG-supporting applications would have to.

that is in libz (better, 80kb, but I remeber Pkzip being 30kb on my MS-Dos system).

Your memory is slightly faulty there. PKZIP 2.04g was 42166 bytes, and if all you care about is compressing your files, I can do far better with my 3712-byte trunc utility--and at 100% compression, too! But if you'd actually like to decompress them again someday, you'd better add PKUNZIP 2.04g, which rang in at 29378 bytes. IOW, the PKWARE codec was 71544 bytes (plus a number of other standalone utilities), while my copy of libz.so.1.2.3 is 71004 bytes--and 1.2.2 was 66908 bytes. Keep in mind also that significant parts of PKZIP and PKUNZIP were written in assembler, which, though capable of producing much smaller binaries, is generally not considered by most of us to be the most productive or maintainable of programming languages. (And, btw, libpng includes additional MMX assembler code for decoding row filters and for expanding interlace passes.)

I'm sure the Commodore Amiga was able to read iff files (on which png is based) in less code.

PNG was not based on IFF. The gross structure may have been suggested by someone familiar with IFF, but IFF itself was considered and rejected as a basis for PNG.

Greg

Our bloat problem

Posted Aug 6, 2005 11:22 UTC (Sat) by dmantione (guest, #4640) [Link] (1 responses)

In other words, it is bloat. All that people want fom libpng is read and
write png files and I doubt it is being used for more than that in the
majority of situations.

I know the differences between iff and png. I'd say png is even easier to
read than iff.

libraries

Posted Aug 12, 2005 14:03 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Actually, as far as I know a well written lib won't have much non-read-only static data (so it can be shared efficiently), and should incur only a very small memory overhead for any unused portions. If I recall correctly unused parts of the library aren't even read from disk.

There are many things to complain about with shared libraries, but their on-disk size is not one of them unless you're building embedded systems. If you are, you can build a cut down version of most libraries quite easily.

Our bloat problem

Posted Aug 6, 2005 22:46 UTC (Sat) by nix (subscriber, #2304) [Link]

I'm sorry, but much of this post is just plain wrong.

String operations become more expensive. Processing an UTF-8 string is usually n times more expensive than processing an ASCII string, because there is no match anymore between byte numbers an character numbers.

Actually, only some operations (random access, basically) become more expensive. Random access inside strings is rare, and apps that do it a lot (like Emacs and vim) have had to face this problem for years.

We can't make the world speak English, no matter how much we might like to. Half the world's population speaks languages that require Unicode.

o Conversion tables need to be in memory. Most applications load their own conversion tables, which means they are n times in memory. These tables are also loaded on startup, decreasing application startup times.

I think you meant `increasing' there. :) but yes, this is a potential problem. It would be nice if there were some unicode daemon and corresponding library (that talked to the daemon) which could cache such things... but then again, the extra context switches to talk to it might just slow thigns right back down again.

* Java, Mono, Python, Perl, TCL - These programming languages require runtime environments. The runtime environment needs to be loaded, can be slow itself and most importantly can use quite a bit of memory. It especially becomes bad if one multiple runtime environents get loaded on one desktop. Script languages can be good for scripts, but are bad for the desktop. The popularity of Java and Mono is propably a bad thing regarding the bloat on our machine.

Actually, Python, Perl and Tcl have very small runtime environments (especially by comparison with the rampaging monster which is Sun's JRE). The problem with these languages is that their data representations, by explicit design decision, trade off size for speed. With the ever-widening gulf between L1 cache and RAM speeds, maybe some of these tradeoffs need to be revisited.

* C++ - Even the de facto standard C++ is sometimes a problem. Especially if templates are being used C++ compilers output large amounts of code behind a programmers back. This can cause huge libraries and executables.

Now that's just wrong. The use of templates in C++ only leads to huge code sizes if you don't know what you're doing: and you can write crap code in any language.

The size problem with C++ at present is the (un-prelinkable) relocations, two per virtual method table entry... this can end up *huge*, even more so in apps like OOo which can't be effectively prelinked because (for reasons beyond my comprehension) they dlopen() everything and so most of them goes unprelinked.

There was a paper recently on reducing OOo memory consumption which suggested radical changes to the C++ ABI. Sorry, but that isn't going to fly :)

* Shared libraries - Many programmers are under the impression that use of shared libraries is free. WRONG. They need to be loaded, resolved and even if you only use part of them a large amount of code is executed within them before you know it.

Wrong. Most shared libraries contain no, or very few, constructors, and so no code is executed within them until you call a function in them. (Now many libraries do a lot of initialization work when you call that function, but that'd also be true if the library were statically linked...)

The dynamic loader has to do more work the more libraries are loaded, but ld-linux.so has been optimized really rather hard. :)

Oh, and it doesn't have to load and resolve things immediately: read Ulrich Drepper's paper on DSO loading. PLT relocations (i.e. the vast majority, that correspond to callable functions, rather than those which correspond to data) are normally processed lazily, incurring a CPU time and memory hit only when the function is (first) called.

o Libc is just a C runtime library but has unfortunately grown to several megabytes.

That's because it also implements all of POSIX that isn't implemented by the kernel, and passes the rest through to the kernel. Oh, and it also supports every app built against it since glibc2 was released, which means that old interfaces must be retained (even the bugs in them must be retained!)

This nceessarily costs memory, but most of it won't be paged in (and thus won't cost anything) unless you run apps that need it.

You do rather seem to have forgotten that binaries and shared libraries are demand-paged!