Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Posted Oct 14, 2023 15:54 UTC (Sat) by wtarreau (subscriber, #51152)In reply to: Recent improvements in GCC diagnostics by vadim
Parent article: Recent improvements in GCC diagnostics
Posted Oct 14, 2023 16:33 UTC (Sat)
by mb (subscriber, #50428)
[Link] (21 responses)
And that is exactly why UTF-8 text is broken on your machine.
>you constantly get your terminal mangled with invisible bytes that break code sequences,
That used to happen all the time back in the bad old days where everybody configured some other iso-xxx encoding for their machine and application.
Ascii or any other country specific encoding is only usable, if you only have US american texts on your system. As soon as you receive text from somebody else, it immediately breaks.
Posted Oct 14, 2023 20:21 UTC (Sat)
by dave_malcolm (subscriber, #15013)
[Link] (4 responses)
It's possible to select pure ASCII with -fdiagnostics-text-art-charset=ascii ;
here's the example I posted earlier, but specifying ASCII output. If there are some worthwhile heuristics for sniffing the terminal connection to affect the default, that might be worth considering; we already have some logic for deciding whether to emit SGR codes for embedding URLs so maybe we should do similar for the text-art character set?
Posted Oct 15, 2023 5:26 UTC (Sun)
by wtarreau (subscriber, #51152)
[Link] (2 responses)
Posted Oct 16, 2023 23:12 UTC (Mon)
by dave_malcolm (subscriber, #15013)
[Link] (1 responses)
Posted Oct 16, 2023 23:56 UTC (Mon)
by ABCD (subscriber, #53650)
[Link]
Shouldn't you also be looking at the LC_CTYPE and LC_ALL variables as well if you are looking at LANG (as LC_* overrides LANG and LC_ALL overrides everything)? Additionally, I believe that it is expected that LANG=POSIX and LANG=C should behave identically. Looking further into this, it appears that perhaps the best answer would be to do something like this: Another option might be to test the charset for UTF-8 explicitly, instead of assuming anything that isn't ANSI_X3.4-1968 can support the line drawing characters.
Posted Oct 15, 2023 23:17 UTC (Sun)
by ermo (subscriber, #86690)
[Link]
The only suggestion I have is that you might want to specifically test TERM=linux in a linux virtual console with a LANG=en_US.UTF-8 locale enabled with a few different fonts (e.g. latarcyrheb-(size) and Terminus ter-v(size)n to ensure that the conservative -fdiagnostics-text-art-charset=unicode option works like you intend it to in that scenario?
I believe you can see for yourself which box-art characters are enabled ootb in a linux virtual console in UTF-8 mode for a given console font by invoking `showconsolefont` (part of the kbd package). The outcome may surprise you, and not necessarily in a good way.
Thanks again for engaging. I look forward to be able to take advantage of this new functionality in the future.
Posted Oct 15, 2023 5:21 UTC (Sun)
by wtarreau (subscriber, #51152)
[Link] (15 responses)
Yeah that's what I had been told repeatedly. Due to this, last time I installed a new distro on my machine, I adopted it and rolled back one week later. Too much pain. You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal. Ditto whenever you grep for something in any of the many text files you wrote in the last 30 years and it prints an accentuated character. This encoding is viral, it only works when 100% of the contents you work with already works and encourages you to convert all your data (including historic ones) and to reinstall all your systems all at once, otherwise you put garbage everywhere. I have way less trouble in iso, occasionally switching to utf-8 for the rare annoying applications that require it to display eye-candy stuff than doing the opposite!
Posted Oct 15, 2023 7:08 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
You can just spend a couple of hours to set up everything to utf-8 _once_, and it'll keep working forever. Old files can be converted on the as-needed basis. And if they are pure ASCII, then no conversion is even necessary.
Posted Oct 16, 2023 6:32 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (5 responses)
Posted Oct 16, 2023 9:13 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
(Also, one-byte encoding suck. I can tell that as a survivor of KOI-8, CP-1251, CP-855, and the good old GOST standard encoding).
Posted Oct 17, 2023 15:43 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
That's exactly why you don't have this problem in the first place.
Posted Oct 17, 2023 16:41 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Oct 17, 2023 19:16 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
The problem appears to be that wtarreau is looking after a LOT of boxes, of assorted ages, many of which predate universal unicode.
And which - for whatever reason - he does not have the ability, or authority, to upgrade.
Cue one unholy mess.
Cheers,
Posted Oct 18, 2023 14:46 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link]
Posted Oct 15, 2023 9:44 UTC (Sun)
by mpr22 (subscriber, #60784)
[Link] (2 responses)
That terminal's UTF-8 mode is seriously defective. There's a well-established norm (print � – Unicode code point U+FFFD REPLACEMENT CHARACTER – and carry on) for how terminals should behave in that situation, and I would very much not describe the result as "garbling" the terminal.
Posted Oct 16, 2023 20:28 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Quite possibly. But this also highlights a useful rule of thumb: Plain text usually isn't.
The vast majority of terminals and terminal emulators in actual use today do not render plain text. They render rich text, using in-band signalling with an ANSI standard set of escape codes, plus a huge variety of non-standard extensions. Those extensions are (poorly) managed by terminfo(5) and the TERM environment variable, which have been subjected to exactly the same problem as the browser User-Agent string (except with xterm instead of Mozilla/5.0). SSH is an especially bad pain point, because the *remote* host's terminfo is consulted rather than the local host (meaning that you cannot synchronize the installation of a new terminal emulator with the installation of its terminfo files, unless you do a simultaneous installation on all machines everywhere that you might possibly want to log into). If I had to guess, I would suggest that this might have nothing whatsoever to do with text encoding, and everything to do with one of those terrible mechanisms malfunctioning in some ridiculous way.
I mean, either that, or it's a terminal from the 90's that still thinks "Unicode" means "UCS-2." But I would like to believe that wtarreau is competent enough to avoid using such a monstrosity after "adopting" UTF-8.
Posted Oct 16, 2023 20:39 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Oct 15, 2023 14:49 UTC (Sun)
by dvdeug (guest, #10998)
[Link] (4 responses)
Which accented character? You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.
And that's pretty idiosyncratic. Do you want to view changelogs on Debian? They're UTF-8 encoded, and have the original script names of Arab and Japanese developers, among others. You can't trust that any text files that comes from any where will be encoded in ISO-8859-1.
> it only works when 100% of the contents you work with already works
One can make that complaint about just about any character set that's larger than 8-bit; even some large 8-bit character sets, like CP1252 and worse VISCII (which puts characters in C0 slots), will break stuff that expects ISO-8859-1. The set of characters sets that protect C0 and C1 space and use one byte per character, no combining characters, work fairly well together, even if they may be illegible. But that's not feasible for many, and can still leave people the puzzle of figuring out what character set is supposed to be used to interpret the text.
Posted Oct 16, 2023 6:41 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
Yes but this was already well known. All of us coming from the DOS world were used to seeing 1-for-1 replacement. I was even used to reading a "é" when it was written "Ä" on screen. The problem with UTF-8
> Do you want to view changelogs on Debian?
I don't, but there are way less problems reading UTF-8 on ISO than the opposite, because at worst I get a few chars I don't care about and that's all, which is much better than invisible chars remaining stuck in the middle of nowhere, the invisible non-breakable space that some mistakenly insert in their command lines using alt+space that breaks their command-lines, RTL stuff that makes your cursor go wild when editing a line etc.
Don't get me wrong, I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).
Posted Oct 16, 2023 12:36 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Since people want to be able to express themselves in ways that culture has made common. Unicode is way more descriptive than prescriptive and that's for the best IMNSHO. IRC had :) and whatnot. With more pixels available, people would obviously want to do more too. I'm not the greatest fan of emoji, but it is far better than slinging raw images around.
Posted Oct 16, 2023 14:32 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (1 responses)
Which is Unix's responsibility; had Microsoft had their way, we'd be using UTF-16.
> I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by ...
That's a cop-out. None of the complaints above have anything to do with emoji. They all have to do with the inevitable problem with having more bits and both languages that are right-to-left and left-to-right. There's nothing any solution could have done much better in that sense. Either we have a constant length code of 16 or 32 bits, or we have a variable length code like UTF-8, or we have a codepage switching mechanism (all of which have supported CJK have also been variable length; a single byte codepage switching mechanism would be horribly inefficient for Chinese).
Posted Oct 16, 2023 15:38 UTC (Mon)
by rschroev (subscriber, #4164)
[Link]
Even with the fixed-length UTF-32 there is the fact that glyphs are often composed of multiple code points.
None of this is the responsibility of Unix. It's just the consequence of the complexity of human language.
Posted Oct 15, 2023 20:53 UTC (Sun)
by atai (subscriber, #10977)
[Link]
Posted Oct 16, 2023 8:45 UTC (Mon)
by geert (subscriber, #98403)
[Link] (1 responses)
Oops, you forgot to upgrade to iso-8859-15 when trading in your FRF for EUR ;-)
Posted Oct 17, 2023 15:46 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link]
Posted Oct 19, 2023 11:13 UTC (Thu)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Except when dealing with American retailers 🤷♂️
Posted Oct 19, 2023 14:07 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (1 responses)
Posted Oct 19, 2023 14:38 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
Recent improvements in GCC diagnostics
>backspace that sometimes fails to remove offending bytes or even eats the prompt
Since everybody uses UTF-8 these problems are completely gone.
I added an option to control what unicode characters GCC will use for these diagrams.Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
LANG=C, you might already have the info you need internally.
FWIW I've now added a special-case so that GCC will default to pure ASCII for such diagrams if LANG=C is in the environment.
The patch is here.
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
#include <langinfo.h>
/* ... */
const char *charset = nl_langinfo (CODESET);
/* If the current locale's charset is ASCII, don't assume that the terminal supports anything else. */
if (!strcmp (charset, "ANSI_X3.4-1968"))
text_art_charset = DIAGNOSTICS_TEXT_ART_CHARSET_ASCII;
diagnostics_text_art_charset_init (context, text_art_charset);
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Wol
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal.
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Wol
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
is the variable size that breaks when facing unexpected sequences, particularly the rollback since it was decided that it was probably robust enough to support backspace instead of storing it into a buffer. As a result the linux terminal itself is broken. Just boot on a console with init=/bin/sh, set your locale to latin1, press "é" then backspace and discover how you eat the prompt. I mentioned this 10+ years ago already and was told "we know but it would be difficult to do better"...
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Recent improvements in GCC diagnostics
Wol