GNU ed 1.6 released
GNU ed 1.6 released
Posted Jan 3, 2012 0:35 UTC (Tue) by geuder (subscriber, #62854)In reply to: GNU ed 1.6 released by andrel
Parent article: GNU ed 1.6 released
UTF-8 is not a character set at all, but a variable length encoding of a 16-bit or 32-bit character set.
8 bit cleanliness was a big issue for most Europeans wanting to write anything correctly on a computer in their mother tongue in the early 90s. Most editors could only handle 7 bit character sets without quirks.
8 bit character sets were an intermediate step in the 90s. 8 bits per character are enough for most bigger European languages (But not a single 8 bit character set for all of them)
But 8 bits don't help the Asians. They need 16 bits per character at least, and because different sets where needed in Europe it wasn't ideal even there.
I think the majority of software in use has been basically 16 bit Unicode for may years now. Windows, Symbian, and Java use true 16 bit wide characters, while all Linux distributions I have used use UTF-8 encoding by default. The nice thing with UTF-8 is that you even can't tell the difference to the old ASCII as long as you stick to 7 bit ASCII characters, because their encoding is identical, 8 bits with the most significant bit being 0.
Whether ed supports UTF-8 or not is not said in the announcement. IMHO 8 bit cleanliness defines support of 8 bit character sets, not stripping away or clearing the most significant bit. UTF-8 is more than this, the editor must be able to handle the variable length encoding.
But whether it can or cannot be used for writing texts in European languages I use regularly, I don't see a reason why I personally would do it using ed. As long as all American programmers remember every day that the world is not 7 bit and not even all ASCII characters are reachable on the keyboard without using modifier key I'm happy. The original question shows that there is work to do, so excuse my long comment.
Posted Jan 3, 2012 0:56 UTC (Tue)
by Karellen (subscriber, #67644)
[Link] (2 responses)
Anyway, whatever that wikipedia article means, it's kind of a red herring, as "ed" being 8-bit clean means that it can handle both 8 bit character sets (e.g. ISO-8859-*) and 8-bit character encodings (e.g. UTF-8).
Posted Jan 3, 2012 19:39 UTC (Tue)
by blitzkrieg3 (guest, #57873)
[Link] (1 responses)
Posted Jan 3, 2012 20:01 UTC (Tue)
by khim (subscriber, #9252)
[Link]
UTF-8 is not just run-of-the-mill variable-length encoding. Ken Thompson modified original IBM's proposal to make sure most algorithms which treat strings as sequence of 8-bit characters were still usable with UTF-8. This means that yes, you can easily use UTF-8 with programs like GNU ED or GNU M4 which know absolutely nothing about UTF-8 but correctly support 8bit characters in strings.
Posted Jan 3, 2012 3:28 UTC (Tue)
by wahern (subscriber, #37304)
[Link] (9 responses)
For the time being people have low expectations. But political and technical movements like Simplified Chinese will eventually hit substantial cultural barriers and the push back will require that software handle locales which didn't adapt to western syntax. That will mean following the Unicode rules to a T. To follow the Unicode rules you have to use an API for even simple things like "character" iteration, etc, unless the programming language supports the proper semantic text operations, like Perl6 can over graphemes using it's neat NFG hack. Scripts like Thai have no mandatory punctuation, so again you need to use accessors with a complex built-in rule base to detect, e.g., end-of-sentence. There's no hacking in this kind of support after the fact; it has to be baked into the code.
APIs like ICU are huge, but in many cases can make the code more clear. Unfortunately ICU doesn't get used much because the rule tables are so gargantuan that virtual memory explodes (though most of that is mmap'd straight from disk), and programmers are still beholden to their notion of low-level C-like character strings.
In 10-20 years we are going to see a surge in demand for I18N and L10N programmers to refactor all the crap hacks that came out of the 1990s, heralded by Microsoft's and Sun's half-hearted adoption of UTF-16.
Posted Jan 3, 2012 3:34 UTC (Tue)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
"UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value"
So UTF-32 isn't variable length. The sudden rise in the use of emoji and other non-BMP characters means that ignoring the variable length of UTF-16 is already broken in real-world cases in non-CJK markets, too.
Posted Jan 3, 2012 4:02 UTC (Tue)
by wahern (subscriber, #37304)
[Link] (2 responses)
Question: do all combining sequences have precomposed equivalents. I think all the Latin ones do, but what about other scripts?
Posted Jan 3, 2012 4:04 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
Q: Doesn’t it cause a problem to have only UTF-16 string APIs, instead of UTF-32 char APIs?
A: Almost all international functions (upper-, lower-, titlecasing, case folding, drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) should take string parameters in the API, not single code-points (UTF-32). Single code-point APIs almost always produce the wrong results except for very simple languages, either because you need more context to get the right answer, or because you need to generate a sequence of characters to return the right answer, or both.
(Source: http://unicode.org/faq/utf_bom.html)
Posted Jan 3, 2012 12:02 UTC (Tue)
by mpr22 (subscriber, #60784)
[Link]
Posted Jan 3, 2012 5:56 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link] (1 responses)
Yet somehow, this is what always happens :D
Posted Jan 3, 2012 12:55 UTC (Tue)
by sorpigal (guest, #36106)
[Link]
There, fixed it for ya.
Posted Jan 3, 2012 23:52 UTC (Tue)
by cmccabe (guest, #60281)
[Link] (2 responses)
UTF-8 works great for what I need. My only wish is that it had been invented sooner, so that people didn't come up with N+1 different subtly defective, backwards incompatible "wide character" solutions.
If I were performing fancy operations on text, I would probably do it in a higher level language with built-in unicode support. At that point the encoding should be a non-issue (right?) because the high level language abstracts that away.
Posted Jan 4, 2012 2:38 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
In practice, I can't think of any languages like that. Many of them are built by people who at best assumed other writing systems are just like Latin except with differently shaped squiggles. They often mandate that "text" means "UTF-16 strings" and then blunder into all sorts of problems with filenames, URLs, streams of bytes some idiot stashed in a "text" field on a database, and other things that definitely aren't UTF-16 strings. There may be built-in assumptions about writing direction, the meaning of "character" (a very, very tricky issue) and so on.
As a rule of thumb if the language claims to be "high level" and yet it has a "character" data type that's distinct from a string, or can be treated meaningfully as an integer, or it has the same data type for binary data and text, then either they're yanking your chain or they had no idea about Unicode. C has the excuse that Unicode literally didn't exist back then. Languages like Python will have to provide their own excuses.
Some more bad signs:
• Mentions of the "length" of a string that don't either include or point at a multi-paragraph discussion of what "length" means in this context.
• Discussion of collation or "sorting" strings that doesn't mention locale.
• A string equality operator or comparison method that doesn't come with a multi-paragraph discussion of Unicode equivalence.
Of course a lot of this stuff can be /fixed/ in theory. But fixes after the fact are often messy. The can involve things like deprecated methods on core objects, parallel APIs replacing every mention of character with "string", or even inventing another type "Unicode string" and then going around replacing all the other APIs in the system with Unicode-friendly ones, leaving maintenance programmers to handle the debris.
Posted Jan 4, 2012 16:33 UTC (Wed)
by cmccabe (guest, #60281)
[Link]
Posted Jan 3, 2012 13:15 UTC (Tue)
by bjartur (guest, #67801)
[Link]
GNU ed 1.6 released
GNU ed 1.6 released
This is not true...
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
Even if you ignore IPA, not all Latin-alphabet combining sequences used in the orthography of natural languages have precomposed code points. For example, as far as I know there is still no precomposed code point for n̈ - and yes, this does have a use other than correctly representing the name of a certain fictional heavy metal band.
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
GNU ed 1.6 released
