GNU ed 1.6 released

Posted Jan 3, 2012 0:35 UTC (Tue) by geuder (subscriber, #62854)
In reply to: GNU ed 1.6 released by andrel
Parent article: GNU ed 1.6 released

True, but I don't think the wikipedia article you link to is very clear.

UTF-8 is not a character set at all, but a variable length encoding of a 16-bit or 32-bit character set.

8 bit cleanliness was a big issue for most Europeans wanting to write anything correctly on a computer in their mother tongue in the early 90s. Most editors could only handle 7 bit character sets without quirks.

8 bit character sets were an intermediate step in the 90s. 8 bits per character are enough for most bigger European languages (But not a single 8 bit character set for all of them)

But 8 bits don't help the Asians. They need 16 bits per character at least, and because different sets where needed in Europe it wasn't ideal even there.

I think the majority of software in use has been basically 16 bit Unicode for may years now. Windows, Symbian, and Java use true 16 bit wide characters, while all Linux distributions I have used use UTF-8 encoding by default. The nice thing with UTF-8 is that you even can't tell the difference to the old ASCII as long as you stick to 7 bit ASCII characters, because their encoding is identical, 8 bits with the most significant bit being 0.

Whether ed supports UTF-8 or not is not said in the announcement. IMHO 8 bit cleanliness defines support of 8 bit character sets, not stripping away or clearing the most significant bit. UTF-8 is more than this, the editor must be able to handle the variable length encoding.

But whether it can or cannot be used for writing texts in European languages I use regularly, I don't see a reason why I personally would do it using ed. As long as all American programmers remember every day that the world is not 7 bit and not even all ASCII characters are reachable on the keyboard without using modifier key I'm happy. The original question shows that there is work to do, so excuse my long comment.

GNU ed 1.6 released

Posted Jan 3, 2012 0:56 UTC (Tue) by Karellen (subscriber, #67644) [Link] (2 responses)

I think the only problem with the article is that it means 8-bit character *encodings*, where the encoding of the character set uses 8 bits (e.g. UTF-8), rather than 7 (e.g. UTF-7), regardless of how many bits are required by the character *set* (20 for all proper unicode encodings such as both UTF-7 and UTF-8).

Anyway, whatever that wikipedia article means, it's kind of a red herring, as "ed" being 8-bit clean means that it can handle both 8 bit character sets (e.g. ISO-8859-*) and 8-bit character encodings (e.g. UTF-8).

GNU ed 1.6 released

Posted Jan 3, 2012 19:39 UTC (Tue) by blitzkrieg3 (guest, #57873) [Link] (1 responses)

UTF-8 is a variable width encoding. It sounds like they now have support for something like extended ASCII.

This is not true...

Posted Jan 3, 2012 20:01 UTC (Tue) by khim (subscriber, #9252) [Link]

UTF-8 is not just run-of-the-mill variable-length encoding. Ken Thompson modified original IBM's proposal to make sure most algorithms which treat strings as sequence of 8-bit characters were still usable with UTF-8.

This means that yes, you can easily use UTF-8 with programs like GNU ED or GNU M4 which know absolutely nothing about UTF-8 but correctly support 8bit characters in strings.

GNU ed 1.6 released

Posted Jan 3, 2012 3:28 UTC (Tue) by wahern (subscriber, #37304) [Link] (9 responses)

Both UTF-16 and UTF-32 are also variable length encodings, it's just that most Windows, Java, etc programmers treat them like fixed length encodings. It mostly works for the time being, but will begin breaking when non-Western markets mature and people start requiring more feature parity when slicing and dicing text without buying special-purpose editing packages.

For the time being people have low expectations. But political and technical movements like Simplified Chinese will eventually hit substantial cultural barriers and the push back will require that software handle locales which didn't adapt to western syntax. That will mean following the Unicode rules to a T. To follow the Unicode rules you have to use an API for even simple things like "character" iteration, etc, unless the programming language supports the proper semantic text operations, like Perl6 can over graphemes using it's neat NFG hack. Scripts like Thai have no mandatory punctuation, so again you need to use accessors with a complex built-in rule base to detect, e.g., end-of-sentence. There's no hacking in this kind of support after the fact; it has to be baked into the code.

APIs like ICU are huge, but in many cases can make the code more clear. Unfortunately ICU doesn't get used much because the rule tables are so gargantuan that virtual memory explodes (though most of that is mmap'd straight from disk), and programmers are still beholden to their notion of low-level C-like character strings.

In 10-20 years we are going to see a surge in demand for I18N and L10N programmers to refactor all the crap hacks that came out of the 1990s, heralded by Microsoft's and Sun's half-hearted adoption of UTF-16.

GNU ed 1.6 released

Posted Jan 3, 2012 3:34 UTC (Tue) by mjg59 (subscriber, #23239) [Link] (3 responses)

From the spec:

"UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value"

So UTF-32 isn't variable length. The sudden rise in the use of emoji and other non-BMP characters means that ignoring the variable length of UTF-16 is already broken in real-world cases in non-CJK markets, too.

GNU ed 1.6 released

Posted Jan 3, 2012 4:02 UTC (Tue) by wahern (subscriber, #37304) [Link] (2 responses)

My fault for being lazy with the terminology.

Question: do all combining sequences have precomposed equivalents. I think all the Latin ones do, but what about other scripts?

GNU ed 1.6 released

Posted Jan 3, 2012 4:04 UTC (Tue) by wahern (subscriber, #37304) [Link]

Also, it's worth point out, from the FAQ:

Q: Doesn’t it cause a problem to have only UTF-16 string APIs, instead of UTF-32 char APIs?

A: Almost all international functions (upper-, lower-, titlecasing, case folding, drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) should take string parameters in the API, not single code-points (UTF-32). Single code-point APIs almost always produce the wrong results except for very simple languages, either because you need more context to get the right answer, or because you need to generate a sequence of characters to return the right answer, or both.

(Source: http://unicode.org/faq/utf_bom.html)

GNU ed 1.6 released

Posted Jan 3, 2012 12:02 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

Even if you ignore IPA, not all Latin-alphabet combining sequences used in the orthography of natural languages have precomposed code points. For example, as far as I know there is still no precomposed code point for n̈ - and yes, this does have a use other than correctly representing the name of a certain fictional heavy metal band.

GNU ed 1.6 released

Posted Jan 3, 2012 5:56 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (1 responses)

>...There's no hacking in this kind of support after the fact;...

Yet somehow, this is what always happens :D

GNU ed 1.6 released

Posted Jan 3, 2012 12:55 UTC (Tue) by sorpigal (guest, #36106) [Link]

> There's no *nice* way to hack in this kind of support after the fact...

There, fixed it for ya.

GNU ed 1.6 released

Posted Jan 3, 2012 23:52 UTC (Tue) by cmccabe (guest, #60281) [Link] (2 responses)

I don't see a reason to use ICU unless you need the functionality that ICU provides. Nearly all of the C programs I've ever written just treat strings as opaque blobs and don't try to do "proper semantic text operations."

UTF-8 works great for what I need. My only wish is that it had been invented sooner, so that people didn't come up with N+1 different subtly defective, backwards incompatible "wide character" solutions.

If I were performing fancy operations on text, I would probably do it in a higher level language with built-in unicode support. At that point the encoding should be a non-issue (right?) because the high level language abstracts that away.

GNU ed 1.6 released

Posted Jan 4, 2012 2:38 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

If the higher level language and its associated string manipulation APIs were created in the last decade or so, and by someone with a thorough understanding on Unicode itself and the general problems of different text systems, then yes, maybe.

In practice, I can't think of any languages like that. Many of them are built by people who at best assumed other writing systems are just like Latin except with differently shaped squiggles. They often mandate that "text" means "UTF-16 strings" and then blunder into all sorts of problems with filenames, URLs, streams of bytes some idiot stashed in a "text" field on a database, and other things that definitely aren't UTF-16 strings. There may be built-in assumptions about writing direction, the meaning of "character" (a very, very tricky issue) and so on.

As a rule of thumb if the language claims to be "high level" and yet it has a "character" data type that's distinct from a string, or can be treated meaningfully as an integer, or it has the same data type for binary data and text, then either they're yanking your chain or they had no idea about Unicode. C has the excuse that Unicode literally didn't exist back then. Languages like Python will have to provide their own excuses.

Some more bad signs:

• Mentions of the "length" of a string that don't either include or point at a multi-paragraph discussion of what "length" means in this context.

• Discussion of collation or "sorting" strings that doesn't mention locale.

• A string equality operator or comparison method that doesn't come with a multi-paragraph discussion of Unicode equivalence.

Of course a lot of this stuff can be /fixed/ in theory. But fixes after the fact are often messy. The can involve things like deprecated methods on core objects, parallel APIs replacing every mention of character with "string", or even inventing another type "Unicode string" and then going around replacing all the other APIs in the system with Unicode-friendly ones, leaving maintenance programmers to handle the debris.

GNU ed 1.6 released

Posted Jan 4, 2012 16:33 UTC (Wed) by cmccabe (guest, #60281) [Link]

Well, I guess you are right. Unicode support, even in higher-level languages, still is not perfect. Disappointing.

GNU ed 1.6 released

Posted Jan 3, 2012 13:15 UTC (Tue) by bjartur (guest, #67801) [Link]

I think ed is even simpler than that: ed doesn't bother with encoding at all, but leaves the mess to terminals. You should be fine as long as you refrain from inputting partial characters.