Filesystems and case-insensitivity

Posted Nov 28, 2018 16:11 UTC (Wed) by willy (subscriber, #9762)
Parent article: Filesystems and case-insensitivity

Chinese characters use a 3-byte encoding, not 4. The CJK ideographs are U+4E00 to U+9FFF.

There are Extended blocks in U+20000 space which will use 4 bytes, but my understanding is that those are rare characters (the most common 27,000 characters are below FFFF).

The language groups who were worst affected by UTF-8 were Cyrillic and Greek who now need two bytes for every letter. But I don't see what better choice there was.

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:17 UTC (Thu) by andrewsh (subscriber, #71043) [Link]

There wasn’t, since before UTF-8 there were at least three popular one-byte encoding with totally incompatible character layouts.

Historically, DOS systems had three encodings for Russian, of which two were almost never used since one of them wasn’t compatible with line/block graphics characters (so no Norton Commander for its users), and another one was developed after the third one gained widespread usage. Of that remaining one, which Microsoft branded as CP866, there were multiple versions differing in certain characters missing or present (e.g. ё vs ± or Ў vs ÷).

Next, Windows came with this wonderful distinction between ‘ANSI’, ‘OEM’ and ‘Unicode’. ‘OEM’, of course, was CP866 for Russian-localised Windows but something else in other versions, and ‘ANSI’ was CP1251 (again, in Russian Windows only). So historically there's been created a lot of documents in those two encodings. Most importantly, most ZIP archives had file names encoded in CP866, but some of them in CP1251.

Since ‘ANSI’ was the default in Windows XP and that is still being used in lots of places, such documents are still being produced as we speak™.

Oh, and the third encoding an absolute minority still uses is KOI-8R/U, which only die-hard Unixoids use these days because it allows them to strip bit 7 and still being able to read text in Russian since that’s been the design decision when the charset was developed (they mapped most letters to their phonetic equivalents in inverse case but with bit 7 set: A → а, B → б, C → ц, but, for example, Q → я and V → ж). That encoding has traditionally been used on Unix and Linux systems in the past.