User: Password:
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 25, 2014 21:50 UTC (Sat) by HelloWorld (guest, #56129)
In reply to: Of bytes and encoded strings by lsl
Parent article: Of bytes and encoded strings

> There's no practical way around it. Using UCS-4 gives you random access to code points. The problem with that is that the user of your application probably has a totally different conception of a character.
What problems are there other than combining characters? Those can probably dealt with by normalizing strings to use precombined characters. I don't understand the need for combining characters anyway, but I'd like to...

(Log in to post comments)

Of bytes and encoded strings

Posted Jan 26, 2014 2:43 UTC (Sun) by foom (subscriber, #14868) [Link]

No, you can't actually precombine. Not all possible combinations exist, nor can they.

Look up "grapheme".

Of bytes and encoded strings

Posted Jan 26, 2014 10:18 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

Sometimes, people want to use Unicode to write languages whose orthography has not been blessed by a national standards committee; it is not the job of a bunch of rich industrialized people to tell the rest of the world what combinations of diacriticals they're allowed to use and which letters they're allowed to use them on.

Of bytes and encoded strings

Posted Jan 27, 2014 10:51 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

At least as importantly, the set of precomposed characters is now "locked". The committee has committed to not adding further such characters to the standard because it causes problems (e.g. for Apple). If tomorrow the whole world takes to using some composition that was previously not included in any standards document (say, snowman acute) then too bad, the _only_ way to write it in Unicode is by composing the relevant elements together which is still better than those previous systems that couldn't write it at all.

Of bytes and encoded strings

Posted Jan 27, 2014 14:41 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

What problem does it cause? The need to update fonts?

Of bytes and encoded strings

Posted Jan 27, 2014 16:32 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link]

If the software were to know about the new character, it could fake it in an old font by combining ☃ and ◌́ manually into ☃́.

However, if you want to do any processing on the text (such as searching for snowmen while ignoring accents), nothing can help you if the software does not know the new character.

Of bytes and encoded strings

Posted Jan 27, 2014 16:35 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

OK, that makes sense. I was wondering why the callout to Apple specifically, but I guess it was just an example of an entity that wouldn't be happy with it.

Of bytes and encoded strings

Posted Jan 27, 2014 17:16 UTC (Mon) by njwhite (guest, #51848) [Link]

Apple have also apparently had good character combining code for a long time, so whereas other companies might prefer precomposed characters because their combined ones look crap, Apple would just prefer combining characters.

Of bytes and encoded strings

Posted Jan 31, 2014 12:06 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Apple's APIs are explicitly defined in terms of a canonicalisation strategy that requires knowing the entire set of (de)compositions. The caller is not obliged to care about canonicalisation, but the implementation promises that it cares, behind the scenes it normalises supplied parameters where necessary. This is particularly important for file naming. For any particular string in Unicode, Apple's normalisation results in just one possible codepoint sequence, the other possibilities are forbidden and must not exist on disk.

The Linux kernel does not care whether you have files with names that are canonically equivalent in Unicode, so long as the name's byte sequence is distinct. This is convenient for programmers, but it means that you, the user, may be faced with a situation in which you know precisely the name of the file, but you cannot (without trial and error) specify that name to Linux because you need to guess which of potentially many possible "spellings" in Unicode were selected. Suppose the file is named café - was that last character U+00E9 or U+0065 U+0301? Unicode says they're equivalent, but Linux just treats all the names as a series of bytes without comprehending and so calling open() with the wrong "spelling" will fail, whereas on OS X it would always succeed.

In an OS that's case insensitive the Apple choice here makes sense, you're already carrying around huge case conversion tables so why not do full normalisation while you're at it? So what if your filesystem is now a tiny bit slower, the vast majority of your customers will never notice. But it does mean that new precomposed characters in Unicode are a big problem, because either you can never support the new Unicode release, or the set of filenames permitted to exist on disk changes from one release to another, which is a nightmare.

Of bytes and encoded strings

Posted Feb 2, 2014 23:09 UTC (Sun) by sdalley (subscriber, #18550) [Link]

And of course there's the fun and games you can have with hyphens (breaking or non-breaking), hyphen-minus, en-dash, and em-dash, quotation-dash, and an Armenian hyphen for good measure.

I was puzzling for ages why the same file name of an attachment that I'd copy-pasted from the mail window didn't overwrite the same name I'd typed into the "Save File" dialog. The copy-paste copied a "real" hyphen (breaking or nonbreaking I can't recall), the typing produced the oldfashioned ersatz ASCII hyphen. They differed in appearance by about one pixel.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds