Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:41 UTC (Mon) by wtarreau (subscriber, #51152)
In reply to: Recent improvements in GCC diagnostics by dvdeug
Parent article: Recent improvements in GCC diagnostics

> You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.

Yes but this was already well known. All of us coming from the DOS world were used to seeing 1-for-1 replacement. I was even used to reading a "é" when it was written "Ä" on screen. The problem with UTF-8
is the variable size that breaks when facing unexpected sequences, particularly the rollback since it was decided that it was probably robust enough to support backspace instead of storing it into a buffer. As a result the linux terminal itself is broken. Just boot on a console with init=/bin/sh, set your locale to latin1, press "é" then backspace and discover how you eat the prompt. I mentioned this 10+ years ago already and was told "we know but it would be difficult to do better"...

> Do you want to view changelogs on Debian?

I don't, but there are way less problems reading UTF-8 on ISO than the opposite, because at worst I get a few chars I don't care about and that's all, which is much better than invisible chars remaining stuck in the middle of nowhere, the invisible non-breakable space that some mistakenly insert in their command lines using alt+space that breaks their command-lines, RTL stuff that makes your cursor go wild when editing a line etc.

Don't get me wrong, I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 12:36 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Since people want to be able to express themselves in ways that culture has made common. Unicode is way more descriptive than prescriptive and that's for the best IMNSHO. IRC had :) and whatnot. With more pixels available, people would obviously want to do more too. I'm not the greatest fan of emoji, but it is far better than slinging raw images around.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 14:32 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

> The problem with UTF-8 is the variable size that breaks when facing unexpected sequences,

Which is Unix's responsibility; had Microsoft had their way, we'd be using UTF-16.

> I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by ...

That's a cop-out. None of the complaints above have anything to do with emoji. They all have to do with the inevitable problem with having more bits and both languages that are right-to-left and left-to-right. There's nothing any solution could have done much better in that sense. Either we have a constant length code of 16 or 32 bits, or we have a variable length code like UTF-8, or we have a codepage switching mechanism (all of which have supported CJK have also been variable length; a single byte codepage switching mechanism would be horribly inefficient for Chinese).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 15:38 UTC (Mon) by rschroev (subscriber, #4164) [Link]

UTF-16 is also variable size. Its predecessor UCS-2 was fixed size, but it soon became clear that two bytes simply isn't enough. Microsoft's (and Java's, I believe) attempt to commit to UCS-2 in order to avoid variable size didn't pay off; they have to deal with it just as Unix does.

Even with the fixed-length UTF-32 there is the fact that glyphs are often composed of multiple code points.

None of this is the responsibility of Unix. It's just the consequence of the complexity of human language.