Malcolm: Prevent Trojan Source attacks with GCC 12
My colleague Marek Polacek and I implemented a new warning for GCC 12, -Wbidi-chars, for detecting Trojan Source attacks involving Unicode control characters. Marek implemented the guts of the warning, but when I tried it out on the examples provided by the Trojan Source researchers, I found I had trouble understanding the initial results—precisely because of the obfuscation itself.So for GCC 12, I've added a new flag to GCC diagnostics, indicating that the diagnostic itself relates to source code encoding. When any such diagnostic is printed, GCC will now escape non-ASCII characters in the source code.
Posted Jan 13, 2022 0:48 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (2 responses)
I'm not familiar with the details of mixed-direction processing, perhaps an Arabic, Farsi, or Hebrew speaker can comment.
It seems it wouldn't be that rare to have a comment with text in a right-to-left language in source code that otherwise uses ASCII-subset identifiers. Would we expect to see these direction-boundary characters in that case, but properly nested? It would also seem that we wouldn't expect a string that only has characters in a left-to-right language to be reversed, it seems this could be done with proper nesting. To exploit that I guess we'd need two variables that are the reverse of each other but it would be hard to sneak that by.
Ideally GCC should warn against dangerous and suspect uses without discriminating against people who want to write comments or have strings in their native language.
(I haven't been active as a GCC contributor in a very long time, though I once was).
Posted Jan 13, 2022 1:07 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Seems some attempts at that have been made?
From the post:
> We call a tokenization boundary such as a comment or string literal a bidirectional context in the warning because the obfuscation happens when there are differences between the structure as seen by the C tokenizer of the logical ordering of the characters on the one hand and the structure perceived by a human reader of the visual ordering of the code as implemented by the Unicode bidirectional algorithm on the other.
>The default is -Wbidi-chars=unpaired, in which the warning complains about unpaired characters within such a bidirectional context. A stronger form of the warning is -Wbidi-chars=any, in which the warning complains about any bidirectional control characters in the source code:
Posted Jan 13, 2022 2:31 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted Jan 13, 2022 7:55 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link] (21 responses)
Posted Jan 13, 2022 8:02 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link]
Posted Jan 13, 2022 12:42 UTC (Thu)
by anton (subscriber, #25547)
[Link] (13 responses)
Posted Jan 13, 2022 12:59 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link] (12 responses)
Posted Jan 14, 2022 3:51 UTC (Fri)
by tialaramex (subscriber, #21167)
[Link]
I don't know if Grace spells this out, it may have seemed too obvious to her, but of course from the outset the purpose is that not only computers but _humans_ can understand the code, the humans will be writing it, and the humans will be reading it.
To achieve what you're talking about there's no need to be able to name a symbol prepare_dictionary() or bootloader_prefix or ERROR_FILE_NOT_FOUND, as the machine is perfectly happy with symbols named A00000001 through AFFFFFFFF. Humans, in contrast, find it helpful to have recognisable names. You might do other humans the same courtesy I think.
Posted Jan 14, 2022 12:59 UTC (Fri)
by gspr (guest, #91542)
[Link] (10 responses)
Posted Jan 14, 2022 13:34 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (9 responses)
That’s why he wanted it as an option.
Posted Jan 14, 2022 14:14 UTC (Fri)
by wtarreau (subscriber, #51152)
[Link] (8 responses)
> That’s why he wanted it as an option.
That's it.
In general in computer languages, the intersection between what everyone can deal with is ASCII. The rest is causing trouble to *some* participants. Sure, within a company or a bunch of buddies from the same school or country you can write in your own language and not care about the trouble caused to anyone else trying to participate to your project. But when you start to have to deal with characters that do not exist on your keyboard, the same one that you're using to write "main()", "#include" or "const unsigned", it starts to become annoying.
I'm really amazed by the fact that many people speak a lot about inclusivity these years and that at the same time we seem to be making everything possible to complicate participation to world-wide projects using excentricities like this. I'm not a native english speaker myself, yet I make the effort of writing all my comments in this language, my doc as well, naming variables and functions this way etc, hoping that they're accessible to others. Sometimes I make mistakes in the naming and it takes me lots of efforts to find the most suitable names. Be it, I'm doing my best. But I long ago stopped writing using my native language (french), using accents or even other non-ASCII characters that I used to find convenient to refer to paragraphs etc, just because it was a pain for others to deal with (e.g. find another occurrence in the file, copy-paste it everywhere needed is not respectful of others).
Thus indeed I would like to have an option to make sure these extremely rare and most often accidental practices disappear from code I'm in charge of, without having to be rude to contributors. It's much better for them to see a warning during "make" than having someone ask them to write something differently in a comment.
Posted Jan 14, 2022 15:06 UTC (Fri)
by gspr (guest, #91542)
[Link] (6 responses)
But the intersection of what everyone can deal with and what is necessary for everyone is probably empty. In that case, settling for ASCII is rather arbitrary.
> The rest is causing trouble to *some* participants.
So does ASCII!
> Sure, within a company or a bunch of buddies from the same school or country you can write in your own language and not care about the trouble caused to anyone else trying to participate to your project. But when you start to have to deal with characters that do not exist on your keyboard, the same one that you're using to write "main()", "#include" or "const unsigned", it starts to become annoying.
Annoying… for you, yes. The person whose name is not ASCII-safe might see the situation differently. (This is not a personal gripe; my name is ASCII-safe and I almost exclusively write and code in English)
> I'm really amazed by the fact that many people speak a lot about inclusivity these years and that at the same time we seem to be making everything possible to complicate participation to world-wide projects using excentricities like this.
Excentricities like what?
> I'm not a native english speaker myself, yet I make the effort of writing all my comments in this language, my doc as well, naming variables and functions this way etc, hoping that they're accessible to others.
Well, that's great. I do, too. But I find that using non-ASCII symbols, especially in comments, to describe mathematically motivated code is extremely useful and clarifying.
> Sometimes I make mistakes in the naming and it takes me lots of efforts to find the most suitable names. Be it, I'm doing my best. But I long ago stopped writing using my native language (french), using accents or even other non-ASCII characters that I used to find convenient to refer to paragraphs etc, just because it was a pain for others to deal with (e.g. find another occurrence in the file, copy-paste it everywhere needed is not respectful of others).
OK, so you chose to forego your native language for the sake of what's convenient for you. You may disagree with people who don't want to forego theirs, but it's a bit weird to write them off.
Posted Jan 15, 2022 10:51 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (5 responses)
On the other hand, ASCII lacks the proper quote marks having chosen to go with typewriter-style "straight" quotes to save space, and it can't spell some English words in the conventional way because it lacks accented letters. It is an odd duck. Like C it was probably a good choice in the decade when I was born, but is not The Right Thing today.
Posted Jan 15, 2022 11:25 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (3 responses)
I suspect most native English speakers probably never spelled naïve, coöperate, and fiancé(e) with the accented letters (even 35 years ago when I was in primary school and we all still had to use pen(cil) and paper for ~100% of schoolwork) anyway :)
Posted Jan 15, 2022 12:25 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Posted Jan 15, 2022 12:37 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
(In Pokemon fandom you'll even find people who deliberately de-capitalize it on the grounds that in-universe it's a humdrum ordinary word.)
Posted Jan 17, 2022 1:18 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Yeah, but *everybody* does that if the title is a common noun in-universe. Mass Effect fans will write "the mass effect" (and I believe the official in-game codex uses this style as well), Portal fans refer to "the portal gun" (its official name is "the Aperture Science Handheld Portal Device," and so some fans call it the "ASHPD"), and I have never heard of anyone capitalizing "hobbit" except in the actual title of Tolkien's book. This is completely standard English.
Posted Jan 17, 2022 5:41 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Jan 14, 2022 16:00 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
Right, like when banning words like "dummy" or "blacklist" without realizing how _American_ English the inclusivity effort is. Very ironic.
American English is the lingua franca of computing, that ship has sailed. Pretending it's not is just making things more difficult.
Another recent irony is "pronouncing names correctly". Then the ignorant patronizing talks about "first" and "last" names instead of "given" or "family". But more importantly, it assumes that American speakers are capable of making sounds not in their language, which is obviously not true. They're not even capable of pronouncing most European names that look like English ones and it's not something most adults (in any country) can easily change. That's why many Chinese people take English "nicknames" at work, simply because they know tonal languages are extremely difficult to adjust to and the important thing is the ability to communicate.
Posted Jan 13, 2022 13:49 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (5 responses)
Posted Jan 13, 2022 14:09 UTC (Thu)
by smoogen (subscriber, #97)
[Link]
Posted Jan 13, 2022 18:53 UTC (Thu)
by kreijack (guest, #43513)
[Link] (3 responses)
This topic was already deeply discussed the first article about this "Trojan". Anyway there are other cases were NON ASCII code must be allowed, like
My opinion is that for the identifier (like name of functions, classes or variables) it is acceptable to allow a characters set ascii only. However most peoples don't agree.
Finally, I have to point out that the problem described here is not due to allowing "non ascii" characters , but due to
Posted Jan 14, 2022 14:18 UTC (Fri)
by wtarreau (subscriber, #51152)
[Link] (2 responses)
Yep! For the record, when I was a student, I once had fun remapping another person's keyboard so that pressing the digit "0" (zero) would instead send letter "O". That person was typing JTAG sequences with hundreds of 1/0 bits in strings and never understood why there were these strange errors (due to the tool in place having very cryptic messages).
Posted Jan 14, 2022 16:51 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Jan 17, 2022 13:14 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link]
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
What is the problem you see with left-to-right non-ASCII characters?
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
- Paying attention and pronouncing people's names _as they desire_: yes of course, that's a very basic respect.
- Pronouncing names "correctly": of course not, we can't do that.
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
- in comments
- in string
- allowing the bidirection unicode control characters
- allowing the homoglyphs
Both the characters above are a SMALL subset of the full unicode set.
And last but not least, give an eye to https://en.wikipedia.org/wiki/IDN_homograph_attack#Homogr... , which describes that the problem may happen even using the ASCII subset.
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12
Malcolm: Prevent Trojan Source attacks with GCC 12