User: Password:
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 27, 2014 10:51 UTC (Mon) by tialaramex (subscriber, #21167)
In reply to: Of bytes and encoded strings by mpr22
Parent article: Of bytes and encoded strings

At least as importantly, the set of precomposed characters is now "locked". The committee has committed to not adding further such characters to the standard because it causes problems (e.g. for Apple). If tomorrow the whole world takes to using some composition that was previously not included in any standards document (say, snowman acute) then too bad, the _only_ way to write it in Unicode is by composing the relevant elements together which is still better than those previous systems that couldn't write it at all.

(Log in to post comments)

Of bytes and encoded strings

Posted Jan 27, 2014 14:41 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

What problem does it cause? The need to update fonts?

Of bytes and encoded strings

Posted Jan 27, 2014 16:32 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link]

If the software were to know about the new character, it could fake it in an old font by combining ☃ and ◌́ manually into ☃́.

However, if you want to do any processing on the text (such as searching for snowmen while ignoring accents), nothing can help you if the software does not know the new character.

Of bytes and encoded strings

Posted Jan 27, 2014 16:35 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

OK, that makes sense. I was wondering why the callout to Apple specifically, but I guess it was just an example of an entity that wouldn't be happy with it.

Of bytes and encoded strings

Posted Jan 27, 2014 17:16 UTC (Mon) by njwhite (guest, #51848) [Link]

Apple have also apparently had good character combining code for a long time, so whereas other companies might prefer precomposed characters because their combined ones look crap, Apple would just prefer combining characters.

Of bytes and encoded strings

Posted Jan 31, 2014 12:06 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Apple's APIs are explicitly defined in terms of a canonicalisation strategy that requires knowing the entire set of (de)compositions. The caller is not obliged to care about canonicalisation, but the implementation promises that it cares, behind the scenes it normalises supplied parameters where necessary. This is particularly important for file naming. For any particular string in Unicode, Apple's normalisation results in just one possible codepoint sequence, the other possibilities are forbidden and must not exist on disk.

The Linux kernel does not care whether you have files with names that are canonically equivalent in Unicode, so long as the name's byte sequence is distinct. This is convenient for programmers, but it means that you, the user, may be faced with a situation in which you know precisely the name of the file, but you cannot (without trial and error) specify that name to Linux because you need to guess which of potentially many possible "spellings" in Unicode were selected. Suppose the file is named café - was that last character U+00E9 or U+0065 U+0301? Unicode says they're equivalent, but Linux just treats all the names as a series of bytes without comprehending and so calling open() with the wrong "spelling" will fail, whereas on OS X it would always succeed.

In an OS that's case insensitive the Apple choice here makes sense, you're already carrying around huge case conversion tables so why not do full normalisation while you're at it? So what if your filesystem is now a tiny bit slower, the vast majority of your customers will never notice. But it does mean that new precomposed characters in Unicode are a big problem, because either you can never support the new Unicode release, or the set of filenames permitted to exist on disk changes from one release to another, which is a nightmare.

Of bytes and encoded strings

Posted Feb 2, 2014 23:09 UTC (Sun) by sdalley (subscriber, #18550) [Link]

And of course there's the fun and games you can have with hyphens (breaking or non-breaking), hyphen-minus, en-dash, and em-dash, quotation-dash, and an Armenian hyphen for good measure.

I was puzzling for ages why the same file name of an attachment that I'd copy-pasted from the mail window didn't overwrite the same name I'd typed into the "Save File" dialog. The copy-paste copied a "real" hyphen (breaking or nonbreaking I can't recall), the typing produced the oldfashioned ersatz ASCII hyphen. They differed in appearance by about one pixel.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds