Malcolm: Prevent Trojan Source attacks with GCC 12

[Posted January 12, 2022 by corbet]

David Malcolm describes some GCC improvements to defend against bidirectional-text attacks in source code.

My colleague Marek Polacek and I implemented a new warning for GCC 12, -Wbidi-chars, for detecting Trojan Source attacks involving Unicode control characters. Marek implemented the guts of the warning, but when I tried it out on the examples provided by the Trojan Source researchers, I found I had trouble understanding the initial results—precisely because of the obfuscation itself.
So for GCC 12, I've added a new flag to GCC diagnostics, indicating that the diagnostic itself relates to source code encoding. When any such diagnostic is printed, GCC will now escape non-ASCII characters in the source code.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 0:48 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (2 responses)

I'm not familiar with the details of mixed-direction processing, perhaps an Arabic, Farsi, or Hebrew speaker can comment.

It seems it wouldn't be that rare to have a comment with text in a right-to-left language in source code that otherwise uses ASCII-subset identifiers. Would we expect to see these direction-boundary characters in that case, but properly nested? It would also seem that we wouldn't expect a string that only has characters in a left-to-right language to be reversed, it seems this could be done with proper nesting. To exploit that I guess we'd need two variables that are the reverse of each other but it would be hard to sneak that by.

Ideally GCC should warn against dangerous and suspect uses without discriminating against people who want to write comments or have strings in their native language.

(I haven't been active as a GCC contributor in a very long time, though I once was).

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 1:07 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (1 responses)

> Ideally GCC should warn against dangerous and suspect uses without discriminating against people who want to write comments or have strings in their native language.

Seems some attempts at that have been made?

From the post:

> We call a tokenization boundary such as a comment or string literal a bidirectional context in the warning because the obfuscation happens when there are differences between the structure as seen by the C tokenizer of the logical ordering of the characters on the one hand and the structure perceived by a human reader of the visual ordering of the code as implemented by the Unicode bidirectional algorithm on the other.

>The default is -Wbidi-chars=unpaired, in which the warning complains about unpaired characters within such a bidirectional context. A stronger form of the warning is -Wbidi-chars=any, in which the warning complains about any bidirectional control characters in the source code:

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 2:31 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Unless I am misunderstanding something, that sounds entirely correct to me: The default behavior should be to allow legitimate uses of bidi characters, and there's also a stricter option for people who want to code entirely in LTR and only write RTL characters with \u escapes, localization files, and such (or the reverse, for that matter - most punctuation characters are bidi-neutral, and you can even use preprocessor directives to "hide" all of the LTR English keywords like int behind an RTL macro, so that you can write mostly or entirely RTL C if you really want to).

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 7:55 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (21 responses)

It would be so nice to have a simple "-Wnon-ascii" option. Most of the low-level code that's not user-facing doesn't need to use chars beyond ASCII and often uses that only by accident. By having this by default in a Makefile, it would be much easier for contributors to preserve that rule and keep the code both safe and widely interoperable.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 8:02 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

Just checked, and in the kernel, on 10.9 MB of code in kernel/, only 8 bytes are non-ASCII, less than 1ppm! In drivers/ it's much more (many more contributors adding their names in changelogs). But it proves that it's quite feasible and absolutely not needed in such layers, and could trivially be enforced with such an option.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 12:42 UTC (Thu) by anton (subscriber, #25547) [Link] (13 responses)

What is the problem you see with left-to-right non-ASCII characters?

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 12:59 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (12 responses)

I just don't need them and they do not render everywhere similarly. Plus I absolutely hate to have an editor display characters that I cannot read or spell and even worse, cannot type on the keyboard (i.e. if I accidently mangle them). The situation with multi-alphabets in source code noawadays has reached a point where it's a total joke. Combine everyone's caprices and happily forget that the initial purpose was to write code to be understood by a computer without having to learn machine code...

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 3:51 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

> initial purpose was to write code to be understood by a computer without having to learn machine code...

I don't know if Grace spells this out, it may have seemed too obvious to her, but of course from the outset the purpose is that not only computers but _humans_ can understand the code, the humans will be writing it, and the humans will be reading it.

To achieve what you're talking about there's no need to be able to name a symbol prepare_dictionary() or bootloader_prefix or ERROR_FILE_NOT_FOUND, as the machine is perfectly happy with symbols named A00000001 through AFFFFFFFF. Humans, in contrast, find it helpful to have recognisable names. You might do other humans the same courtesy I think.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 12:59 UTC (Fri) by gspr (guest, #91542) [Link] (10 responses)

But maybe other people need them? Other people whose code is centered around a language other than English? Or people whose names are not ASCII-safe?

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 13:34 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (9 responses)

> But maybe other people need them?

That’s why he wanted it as an option.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 14:14 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (8 responses)

> > But maybe other people need them?

> That’s why he wanted it as an option.

That's it.

In general in computer languages, the intersection between what everyone can deal with is ASCII. The rest is causing trouble to *some* participants. Sure, within a company or a bunch of buddies from the same school or country you can write in your own language and not care about the trouble caused to anyone else trying to participate to your project. But when you start to have to deal with characters that do not exist on your keyboard, the same one that you're using to write "main()", "#include" or "const unsigned", it starts to become annoying.

I'm really amazed by the fact that many people speak a lot about inclusivity these years and that at the same time we seem to be making everything possible to complicate participation to world-wide projects using excentricities like this. I'm not a native english speaker myself, yet I make the effort of writing all my comments in this language, my doc as well, naming variables and functions this way etc, hoping that they're accessible to others. Sometimes I make mistakes in the naming and it takes me lots of efforts to find the most suitable names. Be it, I'm doing my best. But I long ago stopped writing using my native language (french), using accents or even other non-ASCII characters that I used to find convenient to refer to paragraphs etc, just because it was a pain for others to deal with (e.g. find another occurrence in the file, copy-paste it everywhere needed is not respectful of others).

Thus indeed I would like to have an option to make sure these extremely rare and most often accidental practices disappear from code I'm in charge of, without having to be rude to contributors. It's much better for them to see a warning during "make" than having someone ask them to write something differently in a comment.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 15:06 UTC (Fri) by gspr (guest, #91542) [Link] (6 responses)

> In general in computer languages, the intersection between what everyone can deal with is ASCII.

But the intersection of what everyone can deal with and what is necessary for everyone is probably empty. In that case, settling for ASCII is rather arbitrary.

> The rest is causing trouble to *some* participants.

So does ASCII!

> Sure, within a company or a bunch of buddies from the same school or country you can write in your own language and not care about the trouble caused to anyone else trying to participate to your project. But when you start to have to deal with characters that do not exist on your keyboard, the same one that you're using to write "main()", "#include" or "const unsigned", it starts to become annoying.

Annoying… for you, yes. The person whose name is not ASCII-safe might see the situation differently. (This is not a personal gripe; my name is ASCII-safe and I almost exclusively write and code in English)

> I'm really amazed by the fact that many people speak a lot about inclusivity these years and that at the same time we seem to be making everything possible to complicate participation to world-wide projects using excentricities like this.

Excentricities like what?

> I'm not a native english speaker myself, yet I make the effort of writing all my comments in this language, my doc as well, naming variables and functions this way etc, hoping that they're accessible to others.

Well, that's great. I do, too. But I find that using non-ASCII symbols, especially in comments, to describe mathematically motivated code is extremely useful and clarifying.

> Sometimes I make mistakes in the naming and it takes me lots of efforts to find the most suitable names. Be it, I'm doing my best. But I long ago stopped writing using my native language (french), using accents or even other non-ASCII characters that I used to find convenient to refer to paragraphs etc, just because it was a pain for others to deal with (e.g. find another occurrence in the file, copy-paste it everywhere needed is not respectful of others).

OK, so you chose to forego your native language for the sake of what's convenient for you. You may disagree with people who don't want to forego theirs, but it's a bit weird to write them off.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 15, 2022 10:51 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (5 responses)

As an example of an eccentricity, ASCII has _case_ which is a really weird feature where some of the symbols are available in two varieties with almost but not quite the same meaning, but, it only has case for its set of twenty six Latin letters, not for the digits for example, even though digits can have case, we just didn't bother mapping that and it fell out of use. It's so rarely used, let alone needed for the digits, that Unicode didn't even bother distinguishing either. But case was preserved for the Latin letters despite this.

On the other hand, ASCII lacks the proper quote marks having chosen to go with typewriter-style "straight" quotes to save space, and it can't spell some English words in the conventional way because it lacks accented letters. It is an odd duck. Like C it was probably a good choice in the decade when I was born, but is not The Right Thing today.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 15, 2022 11:25 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (3 responses)

> it can't spell some English words in the conventional way because it lacks accented letters.

I suspect most native English speakers probably never spelled naïve, coöperate, and fiancé(e) with the accented letters (even 35 years ago when I was in primary school and we all still had to use pen(cil) and paper for ~100% of schoolwork) anyway :)

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 15, 2022 12:25 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (2 responses)

The "coö-" spellings are certainly out of favor (except at the New Yorker). I still use "naïve" and "fiancé". I think "café" is probably among the words that keeps the accent the most IME. Also, "Pokémon" is common enough in certain circles (though probably copy/pasted).

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 15, 2022 12:37 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (1 responses)

Plenty of native English speakers spell that trademarked foreign proper noun without the acute, for a variety of different reasons. (can't input acute accents on their HID; don't care about diacritical marks at all; managed to reprogram their brains to assume (augmented) Latin vowels for any word that isn't obviously English or spelled-by-an-English-person so don't need the accent to pronounce it correctly; are weird half-purist weebs (if they were really purist they'd spell it in katakana); ...)

(In Pokemon fandom you'll even find people who deliberately de-capitalize it on the grounds that in-universe it's a humdrum ordinary word.)

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 17, 2022 1:18 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> (In Pokemon fandom you'll even find people who deliberately de-capitalize it on the grounds that in-universe it's a humdrum ordinary word.)

Yeah, but *everybody* does that if the title is a common noun in-universe. Mass Effect fans will write "the mass effect" (and I believe the official in-game codex uses this style as well), Portal fans refer to "the portal gun" (its official name is "the Aperture Science Handheld Portal Device," and so some fans call it the "ASHPD"), and I have never heard of anyone capitalizing "hobbit" except in the actual title of Tolkien's book. This is completely standard English.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 17, 2022 5:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

It's not just scripts, either. English (and many other western European languages, to be fair) has this really weird feature called "tense," where you have to indicate whether something happened in the past or the not-past, for every single sentence that you write. This is grammatically required; future constructions, for example, can be written in several different ways (using modal "will", using the "[be] going to" construction, just specifying a time as in "Tomorrow, I go to the store", etc.), but every single one of those constructions absolutely *must* be in the not-past tense, and every single sentence that actually takes place in the past must be in the past tense (can't write "*Yesterday, I go to the store"). There are plenty of languages that just don't require a tense, so you don't have to describe when everything happens if you don't feel that it is relevant.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 16:00 UTC (Fri) by marcH (subscriber, #57642) [Link]

Right, like when banning words like "dummy" or "blacklist" without realizing how _American_ English the inclusivity effort is. Very ironic.

American English is the lingua franca of computing, that ship has sailed. Pretending it's not is just making things more difficult.

Another recent irony is "pronouncing names correctly". Then the ignorant patronizing talks about "first" and "last" names instead of "given" or "family". But more importantly, it assumes that American speakers are capable of making sounds not in their language, which is obviously not true. They're not even capable of pronouncing most European names that look like English ones and it's not something most adults (in any country) can easily change. That's why many Chinese people take English "nicknames" at work, simply because they know tonal languages are extremely difficult to adjust to and the important thing is the ability to communicate.
- Paying attention and pronouncing people's names _as they desire_: yes of course, that's a very basic respect.
- Pronouncing names "correctly": of course not, we can't do that.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 13:49 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (5 responses)

Why would this need to be part of the compiler and not, say, a patch verification step. Why accept non-ASCII in code which isn't compiled on your machine? Exclude things like AUTHORS or MAINTAINERS so that names can be spelled properly, but ensure everything else is ASCII.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 14:09 UTC (Thu) by smoogen (subscriber, #97) [Link]

Layers of defense. The patch verification only catches some of the ways the code could get into a source repository that a later compile person will download from. With the fact that people clone repositories, fix things, and those repositories then get cloned and used to compile by someone else.. you end up with a lot of places where code could be inserted.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 13, 2022 18:53 UTC (Thu) by kreijack (guest, #43513) [Link] (3 responses)

> Exclude things like AUTHORS or MAINTAINERS so that names can be spelled properly, but ensure everything else is ASCII.

This topic was already deeply discussed the first article about this "Trojan". Anyway there are other cases were NON ASCII code must be allowed, like
- in comments
- in string

My opinion is that for the identifier (like name of functions, classes or variables) it is acceptable to allow a characters set ascii only. However most peoples don't agree.

Finally, I have to point out that the problem described here is not due to allowing "non ascii" characters , but due to
- allowing the bidirection unicode control characters
- allowing the homoglyphs
Both the characters above are a SMALL subset of the full unicode set.
And last but not least, give an eye to https://en.wikipedia.org/wiki/IDN_homograph_attack#Homogr... , which describes that the problem may happen even using the ASCII subset.

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 14:18 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (2 responses)

> And last but not least, give an eye to https://en.wikipedia.org/wiki/IDN_homograph_attack#Homogr... , which describes that the problem may happen even using the ASCII subset.

Yep! For the record, when I was a student, I once had fun remapping another person's keyboard so that pressing the digit "0" (zero) would instead send letter "O". That person was typing JTAG sequences with hundreds of 1/0 bits in strings and never understood why there were these strange errors (due to the tool in place having very cryptic messages).

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 14, 2022 16:51 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

Reminds me of this story: https://www.reddit.com/r/talesfromtechsupport/comments/3v...

Malcolm: Prevent Trojan Source attacks with GCC 12

Posted Jan 17, 2022 13:14 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

That's excellent!