Recent improvements in GCC diagnostics

By Jonathan Corbet
October 13, 2023

The primary job of a compiler is to translate source code into a binary form that can be run by a computer. Increasingly, though, developers want more from their tools, compilers included. Since the compiler must understand the code it is being asked to translate, it is in a good position to provide information about how that code will execute — and where things might go wrong. At the 2023 GNU Tools Cauldron, David Malcolm talked about recent work to improve the diagnostic output from the GCC compiler.

Much of the talk was dedicated to improvements in the ASCII-art output created by the compiler's static analyzer. In the existing GCC 13 release, the compiler is able to quote source code, underline and label source ranges, and provide hints for improving the code. All of this output is created by a module called pretty-print.cc, which has a lot of nice capabilities but which is proving increasingly hard to extend. It does not create two-dimensional layouts well, is not good with non-ASCII text, and its colorization support falls short.

This module tries to explain potential code problems found by the analyzer using text, and "sort-of succeeds", he said. But it is lacking spatial information that would be helpful for developers. If the compiler is complaining about a potential out-of-bounds access, which direction is this access going? Is it before or after the valid area, or perhaps overlapping with it? To illustrate this point, Malcolm showed this example (taken from his slides):

This output describes a potential buffer overflow and provides useful information, but it still may not be enough for the developer to visualize what is really going on. So GCC 14 adds a diagram:

More complex situations can be illustrated as well; see the slides for other examples. There will also be better diagrams for string operations that show, when possible, the actual string literal involved and which handle UTF-8 strings.

All of these pictures are the result of a new text-art module that can do everything provided by pretty-print.cc and quite a bit more. It handles two-dimensional layouts and the full Unicode character set. It has support for color and other text attributes, including "blink" — though he requested that the audience not actually use that feature. It is "round-trippable", meaning that its output can be parsed back into a two-dimensional buffer; this feature will be useful for future diagrams, he said. As a demonstration of what text-art can do, he put up the output from the "most useless GCC plugin ever" — a chessboard.

There is, naturally, still work to be done. One project is a new #pragma operation to have GCC draw the in-memory layout of a structure so that developers can see how the individual fields will be packed. Another is to provide output in the SVG format, though he confided that he is not sure about how useful that capability will be. "Crude prototypes" of both features exist, he said.

Moving on to the GCC static analyzer, Malcolm talked about some new features for analyzing C string operations. He implemented a new warning for operations that might be passed an unterminated string, but then took it back out and created a more flexible module that is able to scan for an expected null byte. It can, for example, check format strings for proper null termination, and is able to detect uninitialized bytes in strings as well.

He has added an understanding of the semantics of a number of standard string functions — strcat(), strcpy(), strlen(), and the like. The analyzer is now able to detect operations that will overrun a string buffer, though it only works with fixed-size strings at the moment. More advanced analysis is in the works for the future. There is also a check for overlapping strings passed to strcat(); he said that he wanted to use the restrict keyword to indicate where such checks make sense, but "nobody really understands what restrict does". So, for now, the checker just looks for overlaps in situations where that is not allowed.

Future plans, he said, include implementing a new function attribute to indicate the need for a null-terminated string as a parameter. The visualizations for the diagnostics produced by the analyzer can always use improvement. He would also like to add an understanding of the semantics of more standard-library functions so that their usage can be checked.

The analyzer currently only works with C code; adding the ability to handle C++ is a desired feature. Basic support for C++ does exist now, but it is unsupported, "don't use it". The biggest problem, he said, is that it has no concept of exceptions and is badly confused by code using them, but there are a number of other problems as well. There has been a Google Summer of Code student (Benjamin Priour) working on C++, focusing only on the no-exceptions case for now. The goal is to be able to use the analyzer on GCC itself (GCC has moved to C++, but does not use exceptions). A test suite has been added, and much of the analyzer code has been made able to work with either language. The handling of the C++ new keyword has been improved. There is still a lot to be done, though.

Another project Priour has worked on, also with regard to C++, is improving output when an error is detected deeply within a nested set of system header files. In such cases, a simple mistake can generate pages of output. A new compiler option, the concisely named -fno-analyzer-show-events-in-system-headers option makes all that output go away.

Despite these improvements, Malcolm said, an attempt to use the analyzer with non-trivial C++ code "will still emit nonsense".

Within the analyzer code itself, a new integration-testing suite has been established. Every analyzer patch is tested by building a whole set of projects, including coreutils, Doom, Git, the kernel, QEMU, and several others. The warnings emitted are captured and compared against a baseline to look for regressions (or improvements). The analyzer is now able to use the alloc_size function attribute to check accesses to objects returned by functions. Another feature that might make it into the GCC 14 release is a warning for potential infinite loops. This check is not ready yet; it generates false positives and runs in O(n²) time, neither of which is ideal.

Malcolm concluded with a longer-term goal: improving the handling of errors related to C++ templates. A simple typo in the wrong place can end up generating pages of useless error information. There are various groups trying to figure out what information is actually useful in such situations. The real problem, he said, is that the compiler is still stuck in the 1970s and the batch-mode interaction style that was established then. For more complex errors there really needs to be a more interactive way for developers to explore the situation.

[Thanks to the Linux Foundation, LWN's travel sponsor for supporting my travel to this event.]

Index entries for this article
Conference	GNU Tools Cauldron/2023

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 15:55 UTC (Fri) by vadim (subscriber, #35271) [Link] (42 responses)

I just have to wonder:

Why is at this point in time ASCII art still a thing? First, UTF8 supports much nicer output. Second, isn't it about time we just had proper graphical output in the terminal? Some terminals like kitty even support it already, but the functionality is sadly rarely used.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 16:19 UTC (Fri) by dave_malcolm (subscriber, #15013) [Link] (9 responses)

To be perhaps overly pedantic, the output is not "ASCII" art, as I (optionally) make use of unicode box-drawing characters, rounded corners, and U+26A0 WARNING SIGN.

FWIW I have a crude implementation of SVG output for this in one of my working copies, which *might* make it into GCC 14, but it's not quite clear to me where the SVG images would go.

What I'd really like is if a terminal supported a "disclosure widget": the ability to hierarchically wrap parts of the output in a way that the user can interactively drill down into the output. That could help a lot with the C++ template issue.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 16:33 UTC (Fri) by dave_malcolm (subscriber, #15013) [Link] (4 responses)

...though re-reading my slides I see that I used the term "ASCII art" myself; oh well.

FWIW, if you want to experiment with the new diagnostics, the slides are all screenshots from Compiler Explorer; an example visualization of a buffer overflow can be seen at:
https://godbolt.org/z/9Y5qscE5Y

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 18:24 UTC (Fri) by ermo (subscriber, #86690) [Link] (3 responses)

Have you considered deliberately limiting the character usage to glyphs that are part of CP-437, such that e.g. the linux console (TERM=linux) configured with a unicode locale and matching font, can always be expected to show the relevant glyphs?

This might potentially make a difference to someone finding themselves in the unfortunate situation of not having access to anything but said linux console...

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 23:37 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (2 responses)

That seems an extraordinarily contrived situation where someone would be limited to the physical local console without X/Wayland/etc... when debugging a compile error?

As soon as you're SSHing into a box or have a full GUI, sticking to CP437 is entirely detrimental and serves no purpose. Expecting actual UTF-8 support, especially when asking to draw diagrams, is a fair request at this point IMHO, even bash prompts in many distros use them now and it's pretty much required for internationalization support of foreign languages.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 15:25 UTC (Sat) by ermo (subscriber, #86690) [Link]

The particular situation I reference is the actual situation you could find yourself in when bootstrapping a new architecture in a VM.

Having at least a fallback for a TERM=linux set of glyphs might be useful here.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 17:49 UTC (Sat) by ballombe (subscriber, #9523) [Link]

> That seems an extraordinarily contrived situation where someone would be limited to the physical local console without X/Wayland/etc... when debugging a compile error?

This is my situation everyday I work. I do all my development on the Linux console with vim.
It is pointless for a C compiler to require X/Wayland.
Anyway as far as I am concerned the shorter the error messages are the better, I prefer to read my code.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 20:22 UTC (Fri) by iabervon (subscriber, #722) [Link] (1 responses)

When I compile C++ these days, it's generally on a build farm with output being captured in logs I can inspect afterwards (or during the build, if I want). Second place is on a VM that I've connected to with ssh, where I'm doing it inside a screen session.

For similar sorts of operation, I've found that it's really easy and common to lose the console output as I start to try to understand the issue it's reporting. It can also be annoying when it gives a helpful explanation of the code, but only if it finds something wrong.

Maybe it would be useful to have an option of producing the analysis as a standalone HTML document and turning warnings and errors in the console output into short summaries with links? That could also include enough machine-readable information that an IDE displaying GCC's analysis can connect annotated text to the place where the user is editing the code as well as to the right identifiers in the IDE's refactoring tools.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 9:44 UTC (Mon) by Tobu (subscriber, #24111) [Link]

Browsers can render HTML as it is being streamed, by the way. With some care this could be a log format.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 0:20 UTC (Sat) by Per_Bothner (subscriber, #7375) [Link]

"What I'd really like is if a terminal supported a "disclosure widget": the ability to hierarchically wrap parts of the output in a way that the user can interactively drill down into the output."

DomTerm has escape sequences for "fold (show/hide) buttons" along with delimiting sections of the output they apply to. DomTerm also supports "dynamic pretty-printing" (in the Lisp sense): you can add escape sequences to indicate logical grouping and optional line-breaks. Both folding and pretty-printing work on existing (old) output - even if you resize the window width after the application emitting the output has finished.

Here is a "dynamic screenshot": it's an html log of the output combined with some JavaScript that implements folding and pretty-printing. Try clicking the triangles and/or re-sizing the window. You probably want much of the output to initially be in the hidden state, until the "show" button is clicked.

I'm current working on changes to xterm.js (used by vscode, Jupyter, and other projects) to enable this kind of "non-traditional" functionality, possibly using "addons". (Currently, the DomTerm application can use the xterm.js terminal emulator, but without most of the DomTerm extensions. I'm hoping that switching to xterm.js long-term will allow for wider dissemination of this kind of functionality that goes beyond the traditional terminal.)

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 18:19 UTC (Sun) by njs (subscriber, #40338) [Link]

These days, the right answer to richer interactive output is probably to have a way to output an HTML report to a file the user can open in their browser of choice. Then it can embed svg, disclosure drop downs, or whatever you come up with in the future without needing to change the cli or negotiate exotic tty extensions.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 16:19 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

The improved example does use UTF-8.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 14:28 UTC (Sat) by jengelh (guest, #33263) [Link]

UTF is just a transformation (as the acronym says), I think you mean Unicode.

The usual deterrent is when you are trying to align text in some form. strlen("\x{200b}") = ?, strlen("\x{00e9}") = ?, strlen("\x{0065}\x{0301}") = ?. Though I've written strlen, the issue has nothing to do with programming language/runtime, it applies evereywhere. And something like printf("%-16s", "anything with Unicode") will break if it is not using e.g. ICU to lookup the visual character widths. But people don't want to deal with icu or ncurses or anything like it, but rather work on their program, so they just emit ASCII and call it a day.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 15:54 UTC (Sat) by wtarreau (subscriber, #51152) [Link] (28 responses)

FWIW I strongly prefer pure ASCII. UTF-8 pisses me off beyond imagination. When you have to deal with many machines and SSH into plenty that do not support it, you constantly get your terminal mangled with invisible bytes that break code sequences, backspace that sometimes fails to remove offending bytes or even eats the prompt etc. I continue to use iso-8859-1 because it remains 1:1 between bytes and chars and does work fine for me everywhere. I have no use of Chinese characters, emojis nor RTL sequences in my terminals and don't want to be bothered by some of these.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 16:33 UTC (Sat) by mb (subscriber, #50428) [Link] (21 responses)

>I continue to use iso-8859-1

And that is exactly why UTF-8 text is broken on your machine.

>you constantly get your terminal mangled with invisible bytes that break code sequences,
>backspace that sometimes fails to remove offending bytes or even eats the prompt

That used to happen all the time back in the bad old days where everybody configured some other iso-xxx encoding for their machine and application.
Since everybody uses UTF-8 these problems are completely gone.

Ascii or any other country specific encoding is only usable, if you only have US american texts on your system. As soon as you receive text from somebody else, it immediately breaks.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 20:21 UTC (Sat) by dave_malcolm (subscriber, #15013) [Link] (4 responses)

I added an option to control what unicode characters GCC will use for these diagrams.

It's possible to select pure ASCII with -fdiagnostics-text-art-charset=ascii ; here's the example I posted earlier, but specifying ASCII output.

If there are some worthwhile heuristics for sniffing the terminal connection to affect the default, that might be worth considering; we already have some logic for deciding whether to emit SGR codes for embedding URLs so maybe we should do similar for the text-art character set?

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 5:26 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (2 responses)

Well, the vast majority of programs I'm seeing just respect the configured locale and adapt to it (via the LANG variable). Even gcc seems to do it when printing warnings, as it will use quotes instead of some other forms of brackets for example to quote text, so you might very well have the info there already. Just test the program with
LANG=C, you might already have the info you need internally.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 23:12 UTC (Mon) by dave_malcolm (subscriber, #15013) [Link] (1 responses)

FWIW I've now added a special-case so that GCC will default to pure ASCII for such diagrams if LANG=C is in the environment. The patch is here.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 23:56 UTC (Mon) by ABCD (subscriber, #53650) [Link]

Shouldn't you also be looking at the LC_CTYPE and LC_ALL variables as well if you are looking at LANG (as LC_* overrides LANG and LC_ALL overrides everything)? Additionally, I believe that it is expected that LANG=POSIX and LANG=C should behave identically.

Looking further into this, it appears that perhaps the best answer would be to do something like this:

#include <langinfo.h>

/* ... */

  const char *charset = nl_langinfo (CODESET);
  /* If the current locale's charset is ASCII, don't assume that the terminal supports anything else.  */
  if (!strcmp (charset, "ANSI_X3.4-1968"))
    text_art_charset = DIAGNOSTICS_TEXT_ART_CHARSET_ASCII;
  diagnostics_text_art_charset_init (context, text_art_charset);

Another option might be to test the charset for UTF-8 explicitly, instead of assuming anything that isn't ANSI_X3.4-1968 can support the line drawing characters.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 23:17 UTC (Sun) by ermo (subscriber, #86690) [Link]

In my very humble opinion, you are to be commended for taking on-board the feedback re. character sets and various TERM scenarios and, in response, deciding to create a range of options that will most likely satisfy most use-cases from serial lines to full blown modern fully unicode capable virtual terminal emulators.

The only suggestion I have is that you might want to specifically test TERM=linux in a linux virtual console with a LANG=en_US.UTF-8 locale enabled with a few different fonts (e.g. latarcyrheb-(size) and Terminus ter-v(size)n to ensure that the conservative -fdiagnostics-text-art-charset=unicode option works like you intend it to in that scenario?

I believe you can see for yourself which box-art characters are enabled ootb in a linux virtual console in UTF-8 mode for a given console font by invoking `showconsolefont` (part of the kbd package). The outcome may surprise you, and not necessarily in a good way.

Thanks again for engaging. I look forward to be able to take advantage of this new functionality in the future.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 5:21 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (15 responses)

> Since everybody uses UTF-8 these problems are completely gone.

Yeah that's what I had been told repeatedly. Due to this, last time I installed a new distro on my machine, I adopted it and rolled back one week later. Too much pain. You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal. Ditto whenever you grep for something in any of the many text files you wrote in the last 30 years and it prints an accentuated character. This encoding is viral, it only works when 100% of the contents you work with already works and encourages you to convert all your data (including historic ones) and to reinstall all your systems all at once, otherwise you put garbage everywhere. I have way less trouble in iso, occasionally switching to utf-8 for the rare annoying applications that require it to display eye-candy stuff than doing the opposite!

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 7:08 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> I have way less trouble in iso, occasionally switching to utf-8 for the rare annoying applications that require it to display eye-candy stuff than doing the opposite!

You can just spend a couple of hours to set up everything to utf-8 _once_, and it'll keep working forever. Old files can be converted on the as-needed basis. And if they are pure ASCII, then no conversion is even necessary.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:32 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (5 responses)

Yeah, sure, connecting to the myriad of remote machines I have access to, suddenly switching them all at once, pissing of other users and sometimes their owners, not to mention the numerous ones which do not have that crappy option. And files, I'm sorry, but no. I'm certainly not going to replace all my files' contents, old source code, e-mails etc.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 9:13 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Honestly, I have no idea how people manage to dig themselves into such a hole. At this point, you really need to go out of your way to NOT use utf-8 on remote hosts. I haven't seen a single case of a distro that does NOT default to it.

(Also, one-byte encoding suck. I can tell that as a survivor of KOI-8, CP-1251, CP-855, and the good old GOST standard encoding).

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 15:43 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

> I haven't seen a single case of a distro that does NOT default to it.

That's exactly why you don't have this problem in the first place.

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 16:41 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (2 responses)

So what distros did you see?

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 19:16 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

I get the distinct impression that modern distros "work fine" for a suitable value of "work".

The problem appears to be that wtarreau is looking after a LOT of boxes, of assorted ages, many of which predate universal unicode.

And which - for whatever reason - he does not have the ability, or authority, to upgrade.

Cue one unholy mess.

Cheers,
Wol

Recent improvements in GCC diagnostics

Posted Oct 18, 2023 14:46 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

Exactly ;-)

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 9:44 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (2 responses)

You just need to have a program you're debugging that accidentally prints a 8-bit byte by accident at the end of stdout to have a garbled terminal.

That terminal's UTF-8 mode is seriously defective.

There's a well-established norm (print � – Unicode code point U+FFFD REPLACEMENT CHARACTER – and carry on) for how terminals should behave in that situation, and I would very much not describe the result as "garbling" the terminal.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 20:28 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> That terminal's UTF-8 mode is seriously defective.

Quite possibly. But this also highlights a useful rule of thumb: Plain text usually isn't.

The vast majority of terminals and terminal emulators in actual use today do not render plain text. They render rich text, using in-band signalling with an ANSI standard set of escape codes, plus a huge variety of non-standard extensions. Those extensions are (poorly) managed by terminfo(5) and the TERM environment variable, which have been subjected to exactly the same problem as the browser User-Agent string (except with xterm instead of Mozilla/5.0). SSH is an especially bad pain point, because the *remote* host's terminfo is consulted rather than the local host (meaning that you cannot synchronize the installation of a new terminal emulator with the installation of its terminfo files, unless you do a simultaneous installation on all machines everywhere that you might possibly want to log into). If I had to guess, I would suggest that this might have nothing whatsoever to do with text encoding, and everything to do with one of those terrible mechanisms malfunctioning in some ridiculous way.

I mean, either that, or it's a terminal from the 90's that still thinks "Unicode" means "UCS-2." But I would like to believe that wtarreau is competent enough to avoid using such a monstrosity after "adopting" UTF-8.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 20:39 UTC (Mon) by Wol (subscriber, #4433) [Link]

The trouble is too many developers band-aid their own paper cut, rather than asking what is the real problem and fixing that.

Cheers,
Wol

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 14:49 UTC (Sun) by dvdeug (guest, #10998) [Link] (4 responses)

> Ditto whenever you grep for something in any of the many text files you wrote in the last 30 years and it prints an accentuated character

Which accented character? You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.

And that's pretty idiosyncratic. Do you want to view changelogs on Debian? They're UTF-8 encoded, and have the original script names of Arab and Japanese developers, among others. You can't trust that any text files that comes from any where will be encoded in ISO-8859-1.

> it only works when 100% of the contents you work with already works

One can make that complaint about just about any character set that's larger than 8-bit; even some large 8-bit character sets, like CP1252 and worse VISCII (which puts characters in C0 slots), will break stuff that expects ISO-8859-1. The set of characters sets that protect C0 and C1 space and use one byte per character, no combining characters, work fairly well together, even if they may be illegible. But that's not feasible for many, and can still leave people the puzzle of figuring out what character set is supposed to be used to interpret the text.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:41 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (3 responses)

> You might have been lucky enough to be using ISO-8859-1 since it came out 35 years ago, but just about anyone else might have problems with various Mac, DOS, and character sets supporting other languages. CJK languages all need more space than one codepage will supply.

Yes but this was already well known. All of us coming from the DOS world were used to seeing 1-for-1 replacement. I was even used to reading a "é" when it was written "Ä" on screen. The problem with UTF-8
is the variable size that breaks when facing unexpected sequences, particularly the rollback since it was decided that it was probably robust enough to support backspace instead of storing it into a buffer. As a result the linux terminal itself is broken. Just boot on a console with init=/bin/sh, set your locale to latin1, press "é" then backspace and discover how you eat the prompt. I mentioned this 10+ years ago already and was told "we know but it would be difficult to do better"...

> Do you want to view changelogs on Debian?

I don't, but there are way less problems reading UTF-8 on ISO than the opposite, because at worst I get a few chars I don't care about and that's all, which is much better than invisible chars remaining stuck in the middle of nowhere, the invisible non-breakable space that some mistakenly insert in their command lines using alt+space that breaks their command-lines, RTL stuff that makes your cursor go wild when editing a line etc.

Don't get me wrong, I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 12:36 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> I just don't like the huge abuse that's being made by replacing standard chars with new ones that don't bring any value, or even emojis (since when a character needs to contain other colors than the font ones?).

Since people want to be able to express themselves in ways that culture has made common. Unicode is way more descriptive than prescriptive and that's for the best IMNSHO. IRC had :) and whatnot. With more pixels available, people would obviously want to do more too. I'm not the greatest fan of emoji, but it is far better than slinging raw images around.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 14:32 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

> The problem with UTF-8 is the variable size that breaks when facing unexpected sequences,

Which is Unix's responsibility; had Microsoft had their way, we'd be using UTF-16.

> I do understand that some other languages need more bits to store their characters, I just don't like the huge abuse that's being made by ...

That's a cop-out. None of the complaints above have anything to do with emoji. They all have to do with the inevitable problem with having more bits and both languages that are right-to-left and left-to-right. There's nothing any solution could have done much better in that sense. Either we have a constant length code of 16 or 32 bits, or we have a variable length code like UTF-8, or we have a codepage switching mechanism (all of which have supported CJK have also been variable length; a single byte codepage switching mechanism would be horribly inefficient for Chinese).

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 15:38 UTC (Mon) by rschroev (subscriber, #4164) [Link]

UTF-16 is also variable size. Its predecessor UCS-2 was fixed size, but it soon became clear that two bytes simply isn't enough. Microsoft's (and Java's, I believe) attempt to commit to UCS-2 in order to avoid variable size didn't pay off; they have to deal with it just as Unix does.

Even with the fixed-length UTF-32 there is the fact that glyphs are often composed of multiple code points.

None of this is the responsibility of Unix. It's just the consequence of the complexity of human language.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 20:53 UTC (Sun) by atai (subscriber, #10977) [Link]

but if you are a developer your code may be used by Chinese and if you use free software/open source code you may compile code written by Chinese.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 8:45 UTC (Mon) by geert (subscriber, #98403) [Link] (1 responses)

> I continue to use iso-8859-1

Oops, you forgot to upgrade to iso-8859-15 when trading in your FRF for EUR ;-)

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 15:46 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

Yeah I don't care, I might have seen that char once, maybe twice at all in a terminal. Usually that char is displayed on a Windows OS, for example at the end of an invoice, I don't need it ;-)

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 11:13 UTC (Thu) by jezuch (subscriber, #52988) [Link] (2 responses)

I guess it's because of this attitude that when I order anything from Amazon to my flat at Orężna street, I get mojibake like every time. Look, we used to have 5 different encoding standards, but we worked it out 20 years ago. No problems since then.

Except when dealing with American retailers 🤷‍♂️

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 14:07 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

But no worries, if you live at an address containing an apostrophe you still get double encoding issues in all sorts of places, lol. Occasionally a site rejects it. I don't expect that one to ever go away.

Recent improvements in GCC diagnostics

Posted Oct 19, 2023 14:38 UTC (Thu) by Wol (subscriber, #4433) [Link]

Or you live in a town like Scunthorpe ...

Cheers,
Wol

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 7:31 UTC (Tue) by spacefrogg (subscriber, #119608) [Link]

First, the moment you cannot just mark and copy or search through diagnostic output (because you translated it into graphics), the moment it becomes useless. It must be loggable, pipeable to less and pasteable to stackoverflow or another pastebin, or it is just fancy garbage.

Second, using weird Unicode glyphs puts a high demand on the used fonts. People have needs and especially those who don't go with the distribution's default font have good reason not to. The more glyphs your output produces the fewer number of fonts you can reliably use on your terminal.

So, I am very happy to read ASCII art, when it get's the point across. There is no shame in keeping simple things simple.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 19:42 UTC (Fri) by eru (subscriber, #2753) [Link] (17 responses)

For the C++ case, it would be useful if GCC "understood" the standard container templates, so that silly errors with them would produce diagnostics in terms of the higher-level data type (like vector or map) the user is trying to use, instead of pages of spewage.

Recent improvements in GCC diagnostics

Posted Oct 13, 2023 20:49 UTC (Fri) by khim (subscriber, #9252) [Link] (15 responses)

Wasn't that supposed to be fixed with concepts and C++23?

I have no idea what actually happened, but if I remember correctly C++20 added concepts themselves to the language and C++23 was supposed to cover standard library with enough concept-related markup to produce nice error messages.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 19:54 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (12 responses)

Better error handling is indeed how the C++ 20 Concepts were sold to its end users. This is because unlike the C++ 0x Concepts proposed 15 years ago or so and subsequently ripped back out before C++ 11, C++ 20 Concepts doesn't actually do very much so without a promise of better error messages it's hard to see why anybody should care.

The big problem is that unlike C++ 0x Concepts [which were roughly Rust's traits, but as a C++ feature, and were voted back out of the standard before C++ 11 was published] these concepts aren't actually checked, they're just textual substitution again, like macros, like templates, like so many of the unmaintainable nightmares of C++. So the machine may be as clueless as you are about what the problem really is, and it's very hard to guess what's worth communicating.

For example, suppose I propose to sort some Geese. In Rust, the sort function isn't defined for Geese unless they are Ord and Ord is a named trait for types which have total order. Is a Goose, in fact, Totally Ordered with respect to all others? If so, Goose impl Ord explains how that works, otherwise there's no sort function. So the compiler can say that I can't sort Geese because they aren't Ord, simple.

In C++ the sort function is defined for the concept std::totally_ordered_with<T,U> which basically comes down "there seem to be some comparison operators defined for this type, so, YOLO". Maybe sorting my Geese works? Maybe the result is that my program is silently a ill-formed C++ program with no meaning? Maybe it crashes? Who knows.

We can watch this happen in real time, in Rust my misfortunate::OnewayGreater<T> for example is a wrapper which insists it's always the greatest, and nevertheless a correct Rust sort (including the ones provided by the standard library) will do... something. I mean they can't sort it, because the type is lying, sorted isn't a possible state, but we're guaranteed nothing crazy happens.

The equivalent type in C++ is likely to cause a crash, but honestly anything might happen, anything at all.

Against this backdrop, improving error messages is... fraught.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 21:20 UTC (Sat) by khim (subscriber, #9252) [Link] (11 responses)

> The equivalent type in C++ is likely to cause a crash, but honestly anything might happen, anything at all.

Sure, but situation is not different from how unsafe traits and unsafe functions work in Rust. That difference have nothing whatsoever to do with templates or concepts.

> nevertheless a correct Rust sort (including the ones provided by the standard library) will do... something.

Sure, but it's not hard to write sort that wouldn't work like that. Just use `unsafe` and you can create all kinds of issues, not too much dissimilar from C++.

> In Rust, the sort function isn't defined for Geese unless they are Ord and Ord is a named trait for types which have total order.

Which means that in Rust you can not use sort function except if you lie about properties of your type, but even if you do, sort function couldn't use that information, anyway.

That's called “defensive programming” and, again, have nothing to do with traits, templates or concepts.

> So the machine may be as clueless as you are about what the problem really is, and it's very hard to guess what's worth communicating.

No. When requirements of the concept are not satisfied the situation is exactly the same as with Rust, only C++ is a bit more flexible (in Rust trait have to be always implemented explicitly even if it's obvious that type satisfies all the requirements while in C++ trait is implemented implicitly).

It's when template uses something not defined in concept C++20 differs from Rust.

I would argue that both languages are doing it wrong:

• The requirement to always have explicitly implemented trait is onerous, awful and problematic. Its frequent source of frustration and kludges that people use to paper over it (newtype and macros) are not pretty.

• The fact that there are no way in C++ to even say “this function is not supposed to be using anything except for what it explicitly `requires`” means that it's very hard to rely on some property of some type that it's not listed in the concept. That, too, is frequent source of frustration.

But having using both I would say that I much prefer C++ templates and concepts. I don't know why you think templates are “unmaintainable nightmares of C++”: I have used both C++ templates and Rust proc macros and I would say that development and debugging of proc macro is much more error prone and problematic.

Just, please, don't compare generics in Rust with C++ templates. They may be similar syntactically, but on semantic level C++ templates have to be compared with Rust's proc macro system, not with generics.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 11:03 UTC (Sun) by dvdeug (guest, #10998) [Link] (1 responses)

>Just, please, don't compare generics in Rust with C++ templates. They may be similar syntactically, but on semantic level C++ templates have to be compared with Rust's proc macro system, not with generics.

That's like saying 2+2 in Python is equivalent to BigInteger(2).plus(BigInteger(2)) in Java. I mean, in some sense it is (actually, something more powerful), but 99% of the time you don't need more than 2+2 in Java. Rust generics are designed to cover most of the cases that C++ templates are; if you're talking about those cases, templates and generics are the right comparison.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 14:08 UTC (Sun) by khim (subscriber, #9252) [Link]

> Rust generics are designed to cover most of the cases that C++ templates are; if you're talking about those cases, templates and generics are the right comparison.

The problem is that “crazy and impenetrable error messages” that C++ is so famous for are not from these cases. And they can be covered by concepts just fine.

It's heavy and convoluted TMP that is intrinsically linked with these. Where you can use std::is_same_v, std::enable_if_t, if constexpr and other such things. And Rust generics are completely unsuitable for these usecases, only macros can do that something similar.

> That's like saying 2+2 in Python is equivalent to BigInteger(2).plus(BigInteger(2)) in Java. I mean, in some sense it is (actually, something more powerful), but 99% of the time you don't need more than 2+2 in Java.

Very good example. Yes, 99% of time you don't care about limitations of Java integers. But when you start, specifically, talking about how convoluted and awkward syntax of Java is for user-defined types then the fact that you can use overloaded operators to mix integers with bignums and user-defined types is Python, but not in Java that difference becomes important.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 12:05 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (8 responses)

The requirement you don't like - to explicitly implement Ord rather than concluding that Goose is Ord because hey I can compare one Goose to another and YOLO - is exactly why traits are able to deliver semantic value which includes the better error messages. My choice to impl Ord for misfortunate::AlwaysGreater and co. was something I did, as the author of the type for illustrative purposes. In C++ this kind of thing happens entirely by accident because it's implied. My misfortunate library cautions it would make no sense in C++, any code that wasn't painstakingly reviewed by experts would have dozens of worse problems unintentionally.

In C++ specifically, the standard just says if a Concept has semantics, those are required, _but_ failing to meet the semantic requirements is Ill-Formed No Diagnostic Required, aka your program has no meaning and may do anything but you may not get compiler errors or warnings.

Yes, it's also true that C++ doesn't check your sort function to ensure that: having said with Concepts that it only requires comparisons it doesn't then try to add things together, or call a method named foo() on them -- but while that contributes further to the terrible diagnostics situation (because now sort may have *secret* requirements on top of the advertised restrictions and who can say whose fault that is when the program doesn't compile?) it's potentially possible for good libraries to get this right, whereas there's nothing to be done about core language YOLO design.

It doesn't really make sense to compare C++ templates to a proc macro. For a start only template meta-programming comes close to the point where a proc macro is necessary, for the sort of trivial templates ordinary C++ programmers are writing the generics actually are the equivalent functionality. Indeed take sort which we already discussed, in Rust sort is a generic function, it isn't a proc macro, and in C++ of course it's a bunch of templates.

My favourite standard library Rust function is generic, and very short, you can't write equivalent C++ but if you could it would necessarily be a template:

pub fn drop<T>(_x: T) { }

But also at the far end of the scale, proc macros are far more powerful. If C++ templates could install software doubtless by now there'd be C++ projects which just use cmake by installing it even if you don't want it. For a proc macro while installing software would be _incredibly_ rude it's nowhere close to the edge of what's possible.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 14:53 UTC (Sun) by khim (subscriber, #9252) [Link] (7 responses)

> For a start only template meta-programming comes close to the point where a proc macro is necessary

Yes. But only these produce hard-to-penetrate error messages when concepts are properly used which, essentially, means that only these are interesting. And they are done with proc macro in Rust which leads to much harder to decypher error messages than TMP.

> for the sort of trivial templates ordinary C++ programmers are writing the generics actually are the equivalent functionality.

Not really. In C++ you have a choice between flexibility and nice error messages. And can pick solution from the full range of possibilities. From simple auto foo(auto x, auto y) { return x + y; } to nicely-defined library with concepts and everything in between.

In Rust you also have a choice but it's much more drastic: you either can use traits or macros and that decision can not be easily changed.

When in C++ the appropriate course would just say “oh, yeah, we may want to deal with both integers and floats in that code so let's replace int arguments with auto” in Rust you immediately have to deal with bazillion concepts even if your goal is simple and desiring to test your algorithm with 32bit floats and 64bit floats to gauge it's stability.

> But also at the far end of the scale, proc macros are far more powerful.

Seriously? Is it some kind of sick joke? Show me how to change kArgumentsCounttemplate variable into procmacro:

int main() {
    constexpr auto SinArguments = kArgumentsCount<sin>;
    constexpr auto PowArguments = kArgumentsCount<pow>;
    constexpr auto FmaArguments = kArgumentsCount<fma>;
    std:: cout << std::format("sin have {} argument(s)\n", SinArguments);
    std:: cout << std::format("pow have {} argument(s)\n", PowArguments);
    std:: cout << std::format("pow have {} argument(s)\n", FmaArguments);
}

Proc macro can do arbitrary things outside of Rust, that's true. But inside of Rust they are much more limited than C++ templates.

Rust did many things right, but it's metaprogramming capabilities are both harder to use and more limited than in C++. Which would have been acceptable if not for crazy attempts of some Rust zealots to portray C++ templates as some kind of failure.

C++ did many things wrong and Rust did many things right, that's true, but specifically templates in C++ are both more powerful and easier to use than Rust's traits. Zig does even better on the “easy to use” front (but the flip-side is much worse “error messages” front).

Rust went after simple error messages for simple cases and awful complexity for complex cases. C++ did the other choice. That's jut simple objective fact, i don't understand why is it so hard to admit it. Maybe because if you would admit it then the main perceived problem of C++ templates (awful error messages) would disappear? And if the other, more acute problem, the monomorphisation bloat, would be accepted then the question of “why haven't Rust done like Swift did would raise it's ugly head?”

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 16:31 UTC (Sun) by mb (subscriber, #50428) [Link] (5 responses)

>constexpr auto SinArguments = kArgumentsCount<sin>;

What do you do with this argument count in a real program?

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 17:29 UTC (Sun) by khim (subscriber, #9252) [Link] (4 responses)

Real program would, of course, do more than just counting numbers. It may, e.g., transparently and accurately log these parameters. Or marshal them and execute code via RPC. But counting them is just a very simple, minimal and self-contained task.

There are lots of uses for reflection capabilities in meta-programming, even that Keynote that failed to a Keynote was mostly about these things.

Again: it's Ok to admit that choice that Rust made is more limiting that what C++ did, but produces better error messages. It may be even right thigh to do. I, personally, never had issues with TMP and it's error messages (the biggest practical use were limitations of MSVC), but people are different, some may value hand-holding more than expressivity.

It's all about trade-offs and for someone the fact that you couldn't write generic code in Rust as easily as you may in C++ or Zig may be an nice trade-off (because for the use-cases where Rust generics work adequately they do provide better error messages).

But all that talk about how C++ templates are awful because they are implicit and thus dangerous is total red herring in my experience: as tialaramex himself noted one may easily lie to the compiler in Rust, too (and this may lead to very dangerous consequences with traits like Send and Sync) and the fact that in Rust you have to use sort_by instead of sort doesn't make code any safer for the user of floats: if someone needs to sort floats then s/he would do that and would, most likely, don't even think twice about the fact that “to avoid bugs” Rust makes that really inconveniet.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 18:21 UTC (Sun) by mb (subscriber, #50428) [Link] (3 responses)

>It may, e.g., transparently and accurately log these parameters. Or marshal them and execute code via RPC.

I still don't see how counting the number of arguments of otherwise opaque function types helps here.

If you want to wrap an actual function call (or any other statement) and print out the result, then you can actually do that with macros: https://doc.rust-lang.org/std/macro.dbg.html

>Again: it's Ok to admit that choice that Rust made is more limiting that what C++ did

There's nothing to "admit".
Rust doesn't implement many C++ features by choice. It is limited by design.
The most visible things that Rust doesn't implement are classes and exceptions.

There are many more things that are different or just outright missing in Rust. It's why it's called Rust and not C++.

>as tialaramex himself noted one may easily lie to the compiler in Rust, too (and this may lead to very
>dangerous consequences with traits like Send and Sync)

That is why these traits are unsafe to implement.
In unsafe blocks you can do unsafe things. That's why they exist.

>and the fact that in Rust you have to use sort_by instead of sort

I don't get it. When is it not possible to use sort()?

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 21:19 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

> I If you want to wrap an actual function call (or any other statement) and print out the result, then you can actually do that with macros: https://doc.rust-lang.org/std/macro.dbg.html

Sure, but what about arguments? Shouldn't I be able to print them? You couldn't even print the number of arguments, let alone their values!

You can play some trick and expand `dbg` a tiny bit, but you can not do things which in C++ not just possible, but easy. Template `LogTiming` that receives function as argument and then returns another, wrapped, function that can be used in place of original one is something developers with C++ or Java background take for granted.

But Rust couldn't do anything like that. Proc macro is limited to the list of token it receives and couldn't do anything about context, it doesn't have access to types, variables and so on.

> Rust doesn't implement many C++ features by choice. It is limited by design.

When something is not implemented “by design” it doesn't become subject of keynote (even if, ultimately, failed one).

> The most visible things that Rust doesn't implement are classes and exceptions.

And again: classes are not supported because nobody knows how to do them safely (it's not clear whether they can even be done safely at all), and exceptions are, now, officially supported. In reality they were always supported, Rust just pretended that they don't work.

> I don't get it. When is it not possible to use sort()?

When you are dealing with types that don't have a total order. Like f32 or f64. And, apparently, the fact that you can sort them with std::sort in C++ but have to use sort_by is supposed to make everything so much better.

But are you sure that someone who would just port PHP-style comparison to Rust would even think twice about how this comparison behaves?

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 23:22 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

Panic isn't an exception. Because Rust reified ControlFlow (like, literally, core::ops::ControlFlow, a sum type which represents the idea that maybe we should stop now or maybe we should press on) as well as Result it remains clear-headed about whether what we're talking about is success/failure or stop/continue the two separate ideas baked indivisibly into C++ exceptions.

This means in Rust we can express the idea that when we succeed we'll exit immediately, since that's just a Try where our branch() turns success into ControlFlow::Break

Do you think the ABI stability for C-unwind constitutes "official support" for exceptions? I believe you've badly misunderstood. With this ABI if you've got some C++ code X, which calls some Rust code Y, and then the Rust code calls some further C++ code Z, but Z throws, it's OK if X catches it. Rust will get out of the way, and technically this might be survivable. This does *not* provide for a C++ exception thrown by Z to be somehow "caught" by Rust in Y, nor for a Rust panic in Y to be "caught" as an exception in X.

As to sort_by() yes, this is much clearer. Our hypothetical PHP-porter is confronted with the question, how do they provide a function (or lambda) that does f(&a: f32, &b: f32) -> Ordering ? Let's look at two likely choices:

1. They write their own, but, Rust forces them to either decide what they meant, or, panic. Chances are they decide to panic for NaN and various other tricky cases. If the software never sees a NaN this works, if it does they panic. Everything remains totally safe unlike C++ where it's (say it with me) Ill-Formed No Diagnostic Required and so our program had no defined meaning even if this sort never occurred. [If you said "Undefined Behaviour" you're wrong, UB is a runtime occurrence, this is IFNDR which is much worse and happens during compilation]

2. They find f32::total_cmp which matches exactly the desired signature. This function will cheerfully order every 32-bit floating point value. Does it do what you expected? Maybe. But it definitely puts them in some agreed order.

In C++ of course they do less work, and as a reward they get... undefined results. Very on brand.

Recent improvements in GCC diagnostics

Posted Oct 16, 2023 6:03 UTC (Mon) by mb (subscriber, #50428) [Link]

>Sure, but what about arguments?

I just think you are demanding a solution for a problem that doesn't exist in the real world.
As I said: Rust does not implement all features of all other programming languages. It doesn't implement many C++ features by choice. There's nothing to "admit".

Of course, you can also print the arguments, if you want that:
https://play.rust-lang.org/?version=stable&mode=debug...

If you want to print the number of arguments, then just println!("2"); in this case.

>and exceptions are, now, officially supported

No, they aren't.

>But are you sure that someone who would just port PHP-style comparison to Rust would even think twice about how this comparison behaves?

Yes, I agree that PHP people certainly wouldn't think twice before doing things.
That's why PHP is what PHP is.

Recent improvements in GCC diagnostics

Posted Oct 15, 2023 22:50 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

> Seriously? Is it some kind of sick joke? Show me how to change kArgumentsCounttemplate variable into procmacro

Sure, for each kArgumentsCounttemplate proceed as follows. First, we need a counter, let's call that k and set it initially to zero. Now, fork the compiler, it's fine, we're a proc macro so we're "allowed" to do that (obviously it's a terrible idea, but so was Template Meta-Programming and look where we are now...) and attempt to compile code which calls the function with k arguments of inferred type, if that won't compile we blow up the compiler with a chosen error code, otherwise we blow up with a different error code, the surviving compiler reaps the error code and either continues (forking again with k += 1) or it knows the final value.

Alternatively, and I kinda like this approach, "just" do full blown analysis as the runtime helpers do (via the Language Server protocol) so that we can ask ourselves the answer to this question (or other reflective questions) immediately. If we find that we woke up in a compiler which lacks our retro-fitted analysis feature that's fine, emit such an analysis, link it to the compiler and replace the running compiler with our improved synthetic one in the ordinary way.

Are either of these a good idea? No. But you see nor is kArgumentsCounttemplate and yet...

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 11:33 UTC (Tue) by jwakely (subscriber, #60262) [Link] (1 responses)

> I have no idea what actually happened, but if I remember correctly C++20 added concepts themselves to the language and C++23 was supposed to cover standard library with enough concept-related markup to produce nice error messages.

You do not remember correctly. There was no plan to "cover standard library with enough concept-related markup". The existing parts of the standard library are largely untouched in new standards, we don't spend months/years retrofitting new features into everything.

Recent improvements in GCC diagnostics

Posted Oct 17, 2023 11:36 UTC (Tue) by jwakely (subscriber, #60262) [Link]

... except for constexpr, which is being gradually retrofitted into large parts of the std::lib.

Recent improvements in GCC diagnostics

Posted Oct 14, 2023 14:10 UTC (Sat) by ibukanov (subscriber, #3942) [Link]

Bad template errors even with older C++ like C++ 17 or even 14 these days tell about a quality of C++ library implementation. With properly applied static_assert at right places one can get reasonable error messages.