8 byte characters?

Posted Aug 4, 2005 14:33 UTC (Thu) by smitty_one_each (subscriber, #28989)
In reply to: 8 byte characters? by davidw
Parent article: Our bloat problem

> 8 byte characters are becomming a thing of the past, in user-facing applications...
I think you meant 8-bit, but isn't that what UTF-8 is about?
Why not just use that throughout the system?
Footprint may grow, but simplicity is often worth the fight.
R,
C

8 byte characters?

Posted Aug 12, 2005 13:45 UTC (Fri) by ringerc (subscriber, #3071) [Link] (2 responses)

Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 . With utf-8, to take the first 6 characters of a buffer you must decode the UTF-8 data (you don't know if each character is one, two, or four bytes long). With UCS-2, you just return the first 12 bytes of the buffer.

That said - it's only double. For text, that's not a big deal, and really doesn't explain the extreme memory footprints we're seeing.

8 byte characters?

Posted Aug 13, 2005 2:53 UTC (Sat) by hp (guest, #5220) [Link]

Unicode doesn't fit in 16 bits anymore; most apps using 16-bit encodings would be using UTF-16, which has the same variable-length properties as UTF-8. If you pretend each-16-bits-is-one-character then either you're using a broken encoding that can't handle all of Unicode, or you're using UTF-16 in a buggy way. To have one-array-element-is-one-character you have to use a 32-bit encoding.

UTF-8 has the huge advantage that ASCII is a subset of it, which is why everyone uses it for UNIX.

8 byte characters?

Posted Aug 20, 2005 6:24 UTC (Sat) by miallen (guest, #10195) [Link]

Many apps use UCS-2 internally, because it's *MUCH* faster to work with for many things than UTF-8 .

I donno about that. First, it is a rare thing that you would say "I want 6 *characters*". The only case that I can actually think of would be if you were printing characters in a terminal which has a fixed number of positions for characters. In this case UCS-2 is easier to use but even then I'm not convinced it's actually faster. It your using Cyrillic, yeah, it will probably be faster but if it's 90% ascii I would have to test that. Consider that UTF-8 occupies almost half the space of UCS-2 and that CPU cache misses account for a LOT of overhead. If you have large collections of strings like from say a big XML file the CPU will do a lot more of waiting for data with UCS-2 as opposed to UTF-8.

In truth the encoding of strings is an ant compared to the elephant of data structures and algorithms. If you design your code well and adapt interfaces so that modules can be reused you can improve the efficiency of your code much more than petty compiler options, changing character encodings, etc.