User: Password:
|
|
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 23, 2014 3:49 UTC (Thu) by Tara_Li (subscriber, #26706)
In reply to: Of bytes and encoded strings by HelloWorld
Parent article: Of bytes and encoded strings

> And the fact that people *still* don't understand that bytes and text are
> **utterly different things** and instead bullshit about “pragmatism” and
> “power” (like this issue has anything to do with that) makes me sick.

If Unicode had stayed a reasonable project, and not turned into - it appears - a full-fledged programming language capable of hosting exploits - then the difference between bytes and text would be much less of a problem. I *LIKED* the proposal for a simple system which just expanded a character to a 2-byte entity, from a one-byte one. Now, we've got three or four different systems that all have to be checked for, since trusting the encoding information presented to us from outside sources is a fool's mistake. Too many linguistics academics and typographers got into the whole mess with defining the character set.


(Log in to post comments)

Of bytes and encoded strings

Posted Jan 23, 2014 4:26 UTC (Thu) by dlang (subscriber, #313) [Link]

well, the problem is that turning each character into a 2 byte entity just doesn't work

it won't hold all the characters

for the common case (ASCII compatible characters), it's wasteful enough of space to actually cause a performance impact

after several tries they actually got it right with UTF8, it devolves to the same as plain ASCII, but can represent the other characters as needed. It's ownly drawback is that you can't know how long the string will be when it's displayed without walking the entire string (but then again, unless you use a fixed width font, you can't know this whatever the encoding, so it's not really a significant issue)

Of bytes and encoded strings

Posted Jan 23, 2014 7:24 UTC (Thu) by khim (subscriber, #9252) [Link]

but then again, unless you use a fixed width font, you can't know this whatever the encoding, so it's not really a significant issue

Even in that case you can't know that. There are halfwidth and fullwidth forms including halfwidth and fullwidth latin characters! And these were not invented by Unicode Consortium.

Of bytes and encoded strings

Posted Jan 23, 2014 10:36 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Accents and other diacritics can be decomposed to separate characters, too.

Of bytes and encoded strings

Posted Jan 23, 2014 14:17 UTC (Thu) by paulj (subscriber, #341) [Link]

You can know the length of the string, in terms of some unit of visible-character widths, you just need to use a better/more abstract data-structure than a byte-array + terminator. ;)

Of bytes and encoded strings

Posted Jan 24, 2014 3:37 UTC (Fri) by dlang (subscriber, #313) [Link]

> You can know the length of the string, in terms of some unit of visible-character widths, you just need to use a better/more abstract data-structure than a byte-array + terminator. ;)

My point was that unless you use a fixed-width font (which is _very_ unusual outside of programmer editors nowadays), there is no "unit of visible-character widths" that's meaningful

Of bytes and encoded strings

Posted Jan 25, 2014 8:21 UTC (Sat) by alankila (guest, #47141) [Link]

The fact that UTF-8 doesn't provide an efficient access at character level is arguably another downside. This could have surprising quadratic behavior effects when iterating a string char by char in hands of a naive programmer. The variable length representation is also making things more difficult with databases: is char(10) 10 bytes or 10 characters? Some databases choose 10 bytes, which means difficulties at UI level; others just throw away advantages of fixed length representation and make good on the promise to store 10 characters, even if it took 70 bytes to store it.

Of bytes and encoded strings

Posted Jan 25, 2014 10:33 UTC (Sat) by lsl (subscriber, #86508) [Link]

> The fact that UTF-8 doesn't provide an efficient access at character level is arguably another downside.

There's no practical way around it. Using UCS-4 gives you random access to code points. The problem with that is that the user of your application probably has a totally different conception of a character. When you actually care to get the nth user-perceived character being able to index the nth code point doesn't buy you much.

In the (I'd say much more common) case where this ability is not that important you lose all the nice properties of UTF-8.

There are definitely places where dealing with strings of code points is the appropriate thing to do. It's just that UTF-8 encoded strings are the more useful thing most of the time.

Of bytes and encoded strings

Posted Jan 25, 2014 21:50 UTC (Sat) by HelloWorld (guest, #56129) [Link]

> There's no practical way around it. Using UCS-4 gives you random access to code points. The problem with that is that the user of your application probably has a totally different conception of a character.
What problems are there other than combining characters? Those can probably dealt with by normalizing strings to use precombined characters. I don't understand the need for combining characters anyway, but I'd like to...

Of bytes and encoded strings

Posted Jan 26, 2014 2:43 UTC (Sun) by foom (subscriber, #14868) [Link]

No, you can't actually precombine. Not all possible combinations exist, nor can they.

Look up "grapheme".

Of bytes and encoded strings

Posted Jan 26, 2014 10:18 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

Sometimes, people want to use Unicode to write languages whose orthography has not been blessed by a national standards committee; it is not the job of a bunch of rich industrialized people to tell the rest of the world what combinations of diacriticals they're allowed to use and which letters they're allowed to use them on.

Of bytes and encoded strings

Posted Jan 27, 2014 10:51 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

At least as importantly, the set of precomposed characters is now "locked". The committee has committed to not adding further such characters to the standard because it causes problems (e.g. for Apple). If tomorrow the whole world takes to using some composition that was previously not included in any standards document (say, snowman acute) then too bad, the _only_ way to write it in Unicode is by composing the relevant elements together which is still better than those previous systems that couldn't write it at all.

Of bytes and encoded strings

Posted Jan 27, 2014 14:41 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

What problem does it cause? The need to update fonts?

Of bytes and encoded strings

Posted Jan 27, 2014 16:32 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link]

If the software were to know about the new character, it could fake it in an old font by combining ☃ and ◌́ manually into ☃́.

However, if you want to do any processing on the text (such as searching for snowmen while ignoring accents), nothing can help you if the software does not know the new character.

Of bytes and encoded strings

Posted Jan 27, 2014 16:35 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

OK, that makes sense. I was wondering why the callout to Apple specifically, but I guess it was just an example of an entity that wouldn't be happy with it.

Of bytes and encoded strings

Posted Jan 27, 2014 17:16 UTC (Mon) by njwhite (guest, #51848) [Link]

Apple have also apparently had good character combining code for a long time, so whereas other companies might prefer precomposed characters because their combined ones look crap, Apple would just prefer combining characters.

Of bytes and encoded strings

Posted Jan 31, 2014 12:06 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Apple's APIs are explicitly defined in terms of a canonicalisation strategy that requires knowing the entire set of (de)compositions. The caller is not obliged to care about canonicalisation, but the implementation promises that it cares, behind the scenes it normalises supplied parameters where necessary. This is particularly important for file naming. For any particular string in Unicode, Apple's normalisation results in just one possible codepoint sequence, the other possibilities are forbidden and must not exist on disk.

The Linux kernel does not care whether you have files with names that are canonically equivalent in Unicode, so long as the name's byte sequence is distinct. This is convenient for programmers, but it means that you, the user, may be faced with a situation in which you know precisely the name of the file, but you cannot (without trial and error) specify that name to Linux because you need to guess which of potentially many possible "spellings" in Unicode were selected. Suppose the file is named café - was that last character U+00E9 or U+0065 U+0301? Unicode says they're equivalent, but Linux just treats all the names as a series of bytes without comprehending and so calling open() with the wrong "spelling" will fail, whereas on OS X it would always succeed.

In an OS that's case insensitive the Apple choice here makes sense, you're already carrying around huge case conversion tables so why not do full normalisation while you're at it? So what if your filesystem is now a tiny bit slower, the vast majority of your customers will never notice. But it does mean that new precomposed characters in Unicode are a big problem, because either you can never support the new Unicode release, or the set of filenames permitted to exist on disk changes from one release to another, which is a nightmare.

Of bytes and encoded strings

Posted Feb 2, 2014 23:09 UTC (Sun) by sdalley (subscriber, #18550) [Link]

And of course there's the fun and games you can have with hyphens (breaking or non-breaking), hyphen-minus, en-dash, and em-dash, quotation-dash, and an Armenian hyphen for good measure. http://www.cs.tut.fi/~jkorpela/dashes.html

I was puzzling for ages why the same file name of an attachment that I'd copy-pasted from the mail window didn't overwrite the same name I'd typed into the "Save File" dialog. The copy-paste copied a "real" hyphen (breaking or nonbreaking I can't recall), the typing produced the oldfashioned ersatz ASCII hyphen. They differed in appearance by about one pixel.

Of bytes and encoded strings

Posted Jan 25, 2014 11:05 UTC (Sat) by khim (subscriber, #9252) [Link]

What this has to do with anything?

Consider:

$ python2
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len('𝔸𝔹ℂ')
11

$ python3
Python 3.2.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len('𝔸𝔹ℂ')
5

Why is it such a big deal when length of the three characters string is eleven but “everything is fine in the world” when it's length is five?

P.S. Yes, I know that PEP 393 solved this problem for python. But since the complains above includes “some databases” and, presumably, other programs and libraries it's still a problem in practice. The difference is that with UTF-8 you hit problem pretty soon and with UTF-16 bugs survive till angry customers start asking for your head.

Re: with UTF-16 bugs survive till angry customers start asking for your head

Posted Jan 31, 2014 7:45 UTC (Fri) by ldo (guest, #40946) [Link]

UTF-16 is a bug. It is used only by certain 1990s-vintage systems that prematurely standardized on Unicode before it was mature (*cough* Sun *cough* Microsoft).

Of bytes and encoded strings

Posted Jan 23, 2014 12:07 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

It was never simple. I mean, it looks simple to someone already versed in ASCII text systems so long as you don't mind being hopelessly wrong for everything outside basic Latin for a US centric locale. So, for a lot of English-only users, and indeed for much of Western Europe, it's sort-of fine, they already tolerate such nonsense elsewhere.

But yeah, not simple. You don't get to blame anybody still alive because the lack of simplicity is present in the writing systems which are centuries old, and the whole _point_ is to represent existing writing systems. If you don't need the existing writing systems then the whole ball of wax is irrelevant, starting with ASCII.

Sometimes people say "But everything works for Latin, clearly these other writing systems are garbage". There is a tiny speck of truth here, perhaps the Chinese writing system (Note: not the topolects) will die out in the next century or so because it is so much clumsier than an alphabet (Chinese academics vary both in the extent to which they think this likely, and whether its desirable). But mostly this is false because it misses how incredibly distorted existing text APIs are as a result of Latin. Case shifting and insensitivity? Wacky assumptions about character width? The idea of "punctuation"? Several distinct yet largely interchangeable kinds of "white space"? Blame Latin.

Of bytes and encoded strings

Posted Jan 23, 2014 12:20 UTC (Thu) by anselm (subscriber, #2796) [Link]

Blame Latin.

It could be worse. It could be Arabic. Or Khmer.

Of bytes and encoded strings

Posted Jan 23, 2014 21:15 UTC (Thu) by hummassa (subscriber, #307) [Link]

I understand that khmer alphabet is actually an alphasillabary, but why would Arabic be worse?

Of bytes and encoded strings

Posted Jan 23, 2014 21:29 UTC (Thu) by anselm (subscriber, #2796) [Link]

Different letter shapes depending on where in the word you are (think »ſ on steroids«)? No proper vowels? Makes Latin script look almost straightforward by comparison :^)

Of bytes and encoded strings

Posted Jan 25, 2014 18:23 UTC (Sat) by jengelh (subscriber, #33263) [Link]

Store it as splines, that'll "fix" it :D

Re: Different letter shapes depending on where in the word

Posted Jan 31, 2014 7:46 UTC (Fri) by ldo (guest, #40946) [Link]

Those different letter shapes are a rendering issue, not an encoding issue. Treating them as an encoding issue is a recipe for madness. It is avoided by all sensible text-handling systems.

Of bytes and encoded strings

Posted Jan 23, 2014 13:11 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Punctuation is older than the Latin alphabet, and the notion of case distinctions appears to have developed in the Greek alphabet at about the same time and for much the same reasons as it did in the Latin alphabet.

Of bytes and encoded strings

Posted Jan 23, 2014 21:42 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Sure, sorry it wasn't my intention to say that Latin is responsible for introducing these crazy requirements _to the world_ but only that Latin set the requirements for early computer systems.

And further I mean Latin as-in the modern writing system derived from ancient Latin and used by many European languages today, not the actual ancient Roman writing system or the dead language.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds