Recent improvements in GCC diagnostics [LWN.net]

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 16:33 UTC (Sat) by mb (subscriber, #50428) [Link] (21 responses)

>I continue to use iso-8859-1

And that is exactly why UTF-8 text is broken on your machine.

>you constantly get your terminal mangled with invisible bytes that break code sequences,
>backspace that sometimes fails to remove offending bytes or even eats the prompt

That used to happen all the time back in the bad old days where everybody configured some other iso-xxx encoding for their machine and application.
Since everybody uses UTF-8 these problems are completely gone.

Ascii or any other country specific encoding is only usable, if you only have US american texts on your system. As soon as you receive text from somebody else, it immediately breaks.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 20:21 UTC (Sat) by dave_malcolm (subscriber, #15013) [Link] (4 responses)

I added an option to control what unicode characters GCC will use for these diagrams.

It's possible to select pure ASCII with -fdiagnostics-text-art-charset=ascii ; here's the example I posted earlier, but specifying ASCII output.

If there are some worthwhile heuristics for sniffing the terminal connection to affect the default, that might be worth considering; we already have some logic for deciding whether to emit SGR codes for embedding URLs so maybe we should do similar for the text-art character set?

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 5:26 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (2 responses)

Well, the vast majority of programs I'm seeing just respect the configured locale and adapt to it (via the LANG variable). Even gcc seems to do it when printing warnings, as it will use quotes instead of some other forms of brackets for example to quote text, so you might very well have the info there already. Just test the program with
LANG=C, you might already have the info you need internally.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 23:12 UTC (Mon) by dave_malcolm (subscriber, #15013) [Link] (1 responses)

FWIW I've now added a special-case so that GCC will default to pure ASCII for such diagrams if LANG=C is in the environment. The patch is here.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 23:56 UTC (Mon) by ABCD (subscriber, #53650) [Link]

Shouldn't you also be looking at the LC_CTYPE and LC_ALL variables as well if you are looking at LANG (as LC_* overrides LANG and LC_ALL overrides everything)? Additionally, I believe that it is expected that LANG=POSIX and LANG=C should behave identically.

Looking further into this, it appears that perhaps the best answer would be to do something like this:

#include <langinfo.h>

/* ... */

  const char *charset = nl_langinfo (CODESET);
  /* If the current locale's charset is ASCII, don't assume that the terminal supports anything else.  */
  if (!strcmp (charset, "ANSI_X3.4-1968"))
    text_art_charset = DIAGNOSTICS_TEXT_ART_CHARSET_ASCII;
  diagnostics_text_art_charset_init (context, text_art_charset);

Another option might be to test the charset for UTF-8 explicitly, instead of assuming anything that isn't ANSI_X3.4-1968 can support the line drawing characters.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 23:17 UTC (Sun) by ermo (subscriber, #86690) [Link]

In my very humble opinion, you are to be commended for taking on-board the feedback re. character sets and various TERM scenarios and, in response, deciding to create a range of options that will most likely satisfy most use-cases from serial lines to full blown modern fully unicode capable virtual terminal emulators.

The only suggestion I have is that you might want to specifically test TERM=linux in a linux virtual console with a LANG=en_US.UTF-8 locale enabled with a few different fonts (e.g. latarcyrheb-(size) and Terminus ter-v(size)n to ensure that the conservative -fdiagnostics-text-art-charset=unicode option works like you intend it to in that scenario?

I believe you can see for yourself which box-art characters are enabled ootb in a linux virtual console in UTF-8 mode for a given console font by invoking `showconsolefont` (part of the kbd package). The outcome may surprise you, and not necessarily in a good way.

Thanks again for engaging. I look forward to be able to take advantage of this new functionality in the future.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 5:21 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (15 responses)

> Since everybody uses UTF-8 these problems are completely gone.

Yeah that's what I had been told repeatedly. Due to this, last time I installed a new distro on my machine, I adopted it and rolled back one week later. Too much pain. You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal. Ditto whenever you grep for something in any of the many text files you wrote in the last 30 years and it prints an accentuated character. This encoding is viral, it only works when 100% of the contents you work with already works and encourages you to convert all your data (including historic ones) and to reinstall all your systems all at once, otherwise you put garbage everywhere. I have way less trouble in iso, occasionally switching to utf-8 for the rare annoying applications that require it to display eye-candy stuff than doing the opposite!

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 7:08 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> I have way less trouble in iso, occasionally switching to utf-8 for the rare annoying applications that require it to display eye-candy stuff than doing the opposite!

You can just spend a couple of hours to set up everything to utf-8 _once_, and it'll keep working forever. Old files can be converted on the as-needed basis. And if they are pure ASCII, then no conversion is even necessary.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:32 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (5 responses)

Yeah, sure, connecting to the myriad of remote machines I have access to, suddenly switching them all at once, pissing of other users and sometimes their owners, not to mention the numerous ones which do not have that crappy option. And files, I'm sorry, but no. I'm certainly not going to replace all my files' contents, old source code, e-mails etc.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 9:13 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Honestly, I have no idea how people manage to dig themselves into such a hole. At this point, you really need to go out of your way to NOT use utf-8 on remote hosts. I haven't seen a single case of a distro that does NOT default to it.

(Also, one-byte encoding suck. I can tell that as a survivor of KOI-8, CP-1251, CP-855, and the good old GOST standard encoding).

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 15:43 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

> I haven't seen a single case of a distro that does NOT default to it.

That's exactly why you don't have this problem in the first place.

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 16:41 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (2 responses)

So what distros did you see?

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 19:16 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

I get the distinct impression that modern distros "work fine" for a suitable value of "work".

The problem appears to be that wtarreau is looking after a LOT of boxes, of assorted ages, many of which predate universal unicode.

And which - for whatever reason - he does not have the ability, or authority, to upgrade.

Cue one unholy mess.

Cheers,
Wol

Recent improvements in GCC diagnostics

Posted Oct 18, 2023 14:46 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

Exactly ;-)

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 9:44 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (2 responses)

You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal.

That terminal's UTF-8 mode is seriously defective.

There's a well-established norm (print � – Unicode code point U+FFFD REPLACEMENT CHARACTER – and carry on) for how terminals should behave in that situation, and I would very much not describe the result as "garbling" the terminal.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 20:28 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> That terminal's UTF-8 mode is seriously defective.

Quite possibly. But this also highlights a useful rule of thumb: Plain text usually isn't.

The vast majority of terminals and terminal emulators in actual use today do not render plain text. They render rich text, using in-band signalling with an ANSI standard set of escape codes, plus a huge variety of non-standard extensions. Those extensions are (poorly) managed by terminfo(5) and the TERM environment variable, which have been subjected to exactly the same problem as the browser User-Agent string (except with xterm instead of Mozilla/5.0). SSH is an especially bad pain point, because the *remote* host's terminfo is consulted rather than the local host (meaning that you cannot synchronize the installation of a new terminal emulator with the installation of its terminfo files, unless you do a simultaneous installation on all machines everywhere that you might possibly want to log into). If I had to guess, I would suggest that this might have nothing whatsoever to do with text encoding, and everything to do with one of those terrible mechanisms malfunctioning in some ridiculous way.

I mean, either that, or it's a terminal from the 90's that still thinks "Unicode" means "UCS-2." But I would like to believe that wtarreau is competent enough to avoid using such a monstrosity after "adopting" UTF-8.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 20:39 UTC (Mon) by Wol (subscriber, #4433) [Link]

The trouble is too many developers band-aid their own paper cut, rather than asking what is the real problem and fixing that.

Cheers,
Wol

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 14:49 UTC (Sun) by dvdeug (guest, #10998) [Link] (4 responses)

> Ditto whenever you grep for something in any of the many text files you wrote in the last 30 years and it prints an accentuated character

Which accented character? You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.

And that's pretty idiosyncratic. Do you want to view changelogs on Debian? They're UTF-8 encoded, and have the original script names of Arab and Japanese developers, among others. You can't trust that any text files that comes from any where will be encoded in ISO-8859-1.

> it only works when 100% of the contents you work with already works

One can make that complaint about just about any character set that's larger than 8-bit; even some large 8-bit character sets, like CP1252 and worse VISCII (which puts characters in C0 slots), will break stuff that expects ISO-8859-1. The set of characters sets that protect C0 and C1 space and use one byte per character, no combining characters, work fairly well together, even if they may be illegible. But that's not feasible for many, and can still leave people the puzzle of figuring out what character set is supposed to be used to interpret the text.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:41 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (3 responses)

> You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.

Yes but this was already well known. All of us coming from the DOS world were used to seeing 1-for-1 replacement. I was even used to reading a "é" when it was written "Ä" on screen. The problem with UTF-8
is the variable size that breaks when facing unexpected sequences, particularly the rollback since it was decided that it was probably robust enough to support backspace instead of storing it into a buffer. As a result the linux terminal itself is broken. Just boot on a console with init=/bin/sh, set your locale to latin1, press "é" then backspace and discover how you eat the prompt. I mentioned this 10+ years ago already and was told "we know but it would be difficult to do better"...

> Do you want to view changelogs on Debian?

I don't, but there are way less problems reading UTF-8 on ISO than the opposite, because at worst I get a few chars I don't care about and that's all, which is much better than invisible chars remaining stuck in the middle of nowhere, the invisible non-breakable space that some mistakenly insert in their command lines using alt+space that breaks their command-lines, RTL stuff that makes your cursor go wild when editing a line etc.

Don't get me wrong, I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 12:36 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Since people want to be able to express themselves in ways that culture has made common. Unicode is way more descriptive than prescriptive and that's for the best IMNSHO. IRC had :) and whatnot. With more pixels available, people would obviously want to do more too. I'm not the greatest fan of emoji, but it is far better than slinging raw images around.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 14:32 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

> The problem with UTF-8 is the variable size that breaks when facing unexpected sequences,

Which is Unix's responsibility; had Microsoft had their way, we'd be using UTF-16.

> I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by ...

That's a cop-out. None of the complaints above have anything to do with emoji. They all have to do with the inevitable problem with having more bits and both languages that are right-to-left and left-to-right. There's nothing any solution could have done much better in that sense. Either we have a constant length code of 16 or 32 bits, or we have a variable length code like UTF-8, or we have a codepage switching mechanism (all of which have supported CJK have also been variable length; a single byte codepage switching mechanism would be horribly inefficient for Chinese).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 15:38 UTC (Mon) by rschroev (subscriber, #4164) [Link]

UTF-16 is also variable size. Its predecessor UCS-2 was fixed size, but it soon became clear that two bytes simply isn't enough. Microsoft's (and Java's, I believe) attempt to commit to UCS-2 in order to avoid variable size didn't pay off; they have to deal with it just as Unix does.

Even with the fixed-length UTF-32 there is the fact that glyphs are often composed of multiple code points.

None of this is the responsibility of Unix. It's just the consequence of the complexity of human language.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 20:53 UTC (Sun) by atai (subscriber, #10977) [Link]

but if you are a developer your code may be used by Chinese and if you use free software/open source code you may compile code written by Chinese.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 8:45 UTC (Mon) by geert (subscriber, #98403) [Link] (1 responses)

> I continue to use iso-8859-1

Oops, you forgot to upgrade to iso-8859-15 when trading in your FRF for EUR ;-)

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 15:46 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

Yeah I don't care, I might have seen that char once, maybe twice at all in a terminal. Usually that char is displayed on a Windows OS, for example at the end of an invoice, I don't need it ;-)

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 11:13 UTC (Thu) by jezuch (subscriber, #52988) [Link] (2 responses)

I guess it's because of this attitude that when I order anything from Amazon to my flat at Orężna street, I get mojibake like every time. Look, we used to have 5 different encoding standards, but we worked it out 20 years ago. No problems since then.

Except when dealing with American retailers 🤷‍♂️

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 14:07 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

But no worries, if you live at an address containing an apostrophe you still get double encoding issues in all sorts of places, lol. Occasionally a site rejects it. I don't expect that one to ever go away.

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 14:38 UTC (Thu) by Wol (subscriber, #4433) [Link]

Or you live in a town like Scunthorpe ...

Cheers,
Wol