Our bloat problem

Posted Aug 4, 2005 7:15 UTC (Thu) by eru (subscriber, #2753)
In reply to: Our bloat problem by hp
Parent article: Our bloat problem

There are some real limits on bloat improvements, though. Things like terminal scrollback buffer, emails, web pages, icons, background images are going to be big, and they're going to be bigger the more you have of them.

Actually this "payload data" is often minuscule. Take those terminal scrollback buffers. Assuming each line contains 60 characters on the average (probably an over-estimate) and you have them in a linked list with 8 bytes for links to the previous and next line, storing a 1000 lines needs just 66.4 Kb. Where does the rest of the 21 Mb of gnome-terminal go?

Similarly in emails, a single piece of mail might typically need of the order of 10 Kb for one textual message.

Images and sound files are of course inevitably large, but not most applications don't deal with them.

8 byte characters?

Posted Aug 4, 2005 8:25 UTC (Thu) by davidw (guest, #947) [Link] (4 responses)

8 byte characters are becomming a thing of the past, in user-facing applications... Still though, your point is taken - 66K multiplied by a factor of 4 still isn't that much.

8 byte characters?

Posted Aug 4, 2005 14:33 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link] (3 responses)

> 8 byte characters are becomming a thing of the past, in user-facing applications...
I think you meant 8-bit, but isn't that what UTF-8 is about?
Why not just use that throughout the system?
Footprint may grow, but simplicity is often worth the fight.
R,
C

8 byte characters?

Posted Aug 12, 2005 13:45 UTC (Fri) by ringerc (subscriber, #3071) [Link] (2 responses)

Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 . With utf-8, to take the first 6 characters of a buffer you must decode the UTF-8 data (you don't know if each character is one, two, or four bytes long). With UCS-2, you just return the first 12 bytes of the buffer.

That said - it's only double. For text, that's not a big deal, and really doesn't explain the extreme memory footprints we're seeing.

8 byte characters?

Posted Aug 13, 2005 2:53 UTC (Sat) by hp (guest, #5220) [Link]

Unicode doesn't fit in 16 bits anymore; most apps using 16-bit encodings would be using UTF-16, which has the same variable-length properties as UTF-8. If you pretend each-16-bits-is-one-character then either you're using a broken encoding that can't handle all of Unicode, or you're using UTF-16 in a buggy way. To have one-array-element-is-one-character you have to use a 32-bit encoding.

UTF-8 has the huge advantage that ASCII is a subset of it, which is why everyone uses it for UNIX.

8 byte characters?

Posted Aug 20, 2005 6:24 UTC (Sat) by miallen (guest, #10195) [Link]

Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 .

I donno about that. First, it is a rare thing that you would say "I want 6 *characters*". The only case that I can actually think of would be if you were printing characters in a terminal which has a fixed number of positions for characters. In this case UCS-2 is easier to use but even then I'm not convinced it's actually faster. It your using Cyrillic, yeah, it will probably be faster but if it's 90% ascii I would have to test that. Consider that UTF-8 occupies almost half the space of UCS-2 and that CPU cache misses account for a LOT of overhead. If you have large collections of strings like from say a big XML file the CPU will do a lot more of waiting for data with UCS-2 as opposed to UTF-8.

In truth the encoding of strings is an ant compared to the elephant of data structures and algorithms. If you design your code well and adapt interfaces so that modules can be reused you can improve the efficiency of your code much more than petty compiler options, changing character encodings, etc.

Our bloat problem

Posted Aug 4, 2005 13:35 UTC (Thu) by elanthis (guest, #6227) [Link] (3 responses)

That isn't generally how a terminal scrollback buffer works, however. You generally work with blocks of memory, so even blank areas on lines are filled in with data. That's required due to how terminals work in regarding to character attributes. Which also brings up the point that you have more than just character data per cell, you also have attribute data. And then let's get to the fact that in 2005, people use more than just ASCII, and you actually can't use only a byte per character, but have to use something like 4 bytes per character in order to store UNICODE characters.

So if you have an 80 character wide display with 100 lines of scrollback, and we assume something like 8 bytes per character (4 for character data, 4 for attributes and padding) we get 8*80*100 = 640000. And that's just 100 lines. Assuming you get rid of any extraneous padding (using one of several tricks), you might be able to cut down to 6 bytes per character, resulting in 6*80*100 = 480000. Almost half a megabyte for 100 lines of scrollback.

More features requires more memory. If you want a terminal that supports features that many people *need* these days, you just have to suck it up and accept the fact that it'll take more memory. If you can't handle that, go find a really old version of xterm limited to ASCII characters without 256-color support and then you might see a nice reduction in memory usage. The default will never revert to such a terminal, however, because it flat out can't support the workload of many people today, if for no other reason than the requirement for UNICODE display.

Our bloat problem

Posted Aug 4, 2005 13:44 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

You don't need 4 bytes per character for Unicode in most places. A brief examination of the unicode xterm shows that, as expected it doesn't actually store everything as 32-bit ultra-wide characters. Most strings can be stored as UTF-8, a few places might deal with the actual code point and have a 32-bit integer temporarily, but certainly not huge strings of them.

Effect of Implementation Choices

Posted Aug 4, 2005 16:21 UTC (Thu) by eru (subscriber, #2753) [Link]

That isn't generally how a terminal scrollback buffer works, however. You generally work with blocks of memory, so even blank areas on lines are filled in with data.

But that is a very wasteful implementation choice. There are several other ways of doing it (like the linked list I proposed) that are not much more complex to program. I forgot about attributes in my original post, but they, too can easily be represented in ways that average much less than 4 bytes per character. And as another poster pointed out, you can store Unicode with less than 4 bytes per character. In today's computers the CPU is so much faster than the memory that it may not pay to optimize data structures for fast access at the cost of increased size.

I think this difference illustrates a major reason for the bloat problem: using naive data structures and code without sufficient thought for efficiency. Maybe OK for prototypes, but not after that. I am not advocating cramming data into few bits in complex ways (as used to be common in the days of 8-bit microcomputers), but simply avoid wasting storage whenever it can be easily done. Like, don't store boolean flags or known-to-be small numbers in full-size ints, allocate space for just the useful data (like in the scroll-back case), don't replicate data redundantly.

I wonder if the well-known GNU coding guidelines (see Info node "standards" in Emacs installations) may be partly to blame for bloat problems in free software... To quote:

Memory Usage
============
If a program typically uses just a few meg of memory, don't bother making any effort to reduce memory usage. For example, if it is impractical for other reasons to operate on files more than a few meg long, it is reasonable to read entire input files into core to operate on them.

Right, but what when you have lots of programs open at the same time, each using "just a few meg of memory"? (I recognize Stallman wrote that before GUI's became common on *nix systems).

Our bloat problem

Posted Aug 4, 2005 19:29 UTC (Thu) by berntsen (guest, #4650) [Link]

If people malloc like you multiply, I see where the bloat is comming from, you have a factor of 10 wrong ;-)

/\/