The "Trojan Source" vulnerability [LWN.net]

The "Trojan Source" vulnerability

Posted Nov 1, 2021 15:22 UTC (Mon) by mattdm (subscriber, #18) [Link]

We have scanned Fedora dist-git (spec files and patches, not expanded source) and did not find anything. We're going to add some mitigations to protect against possible future attacks, too.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:00 UTC (Mon) by dskoll (subscriber, #1630) [Link] (3 responses)

I opened the C examples in emacs. For the commenting-out.c, early-return.c, and invisible-function.c examples, the Emacs C syntax highlighter gave obviously-odd highlighting results. The homoglyph-function.c and stretched-string.c examples evaded the syntax highlighter.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:41 UTC (Mon) by siddhesh (guest, #64914) [Link]

I opened the C examples in emacs. For the commenting-out.c, early-return.c, and invisible-function.c examples, the Emacs C syntax highlighter gave obviously-odd highlighting results. The homoglyph-function.c and stretched-string.c examples evaded the syntax highlighter.

Homoglyphs are hard to track, but for BIDI almost all editors I looked at gave it away in some way or another. At the very least the control characters affected syntax highlighting. In emacs one sees underscores at points where direction changes and even the cursor jumps around as you scroll. Vim does not render RLO/LRO and shows them as <202e>, etc.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 18:51 UTC (Mon) by Deleted user 129183 (guest, #129183) [Link] (1 responses)

> For the commenting-out.c, early-return.c, and invisible-function.c examples, the Emacs C syntax highlighter gave obviously-odd highlighting results.

Yeah, it feels that the severity of this vulnerability has been largely overstated. KWrite (and Kate, obviously), for example, doesn’t only highlight the source weirdly, but also in some cases aligns the line to the right:

https://i.imgur.com/i58JRYS.png

And to think about it, placement of the comments in some of those examples has been mostly contrived and not expected to exist in a typical code. For example, if you tried to submit such code to Linux, you would probably get a message telling you to reformat the code to make it conform to the usual coding style, rather than have it merged.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 21:16 UTC (Mon) by iabervon (subscriber, #722) [Link]

I think the problem arises when people look at PR on github or an email message in their mail reader, and it looks visually like it's fine. Sure, it would cause your cursor to behave very strangely if you tried to move through it, and a text editor would likely reveal that the resulting file consists of a very strange sequence of code points, but anyone just looking at a display of the diff won't notice. If the patch is to a file that doesn't need work often, it could be years before anyone looks at it in a text editor.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:00 UTC (Mon) by mchehab (subscriber, #41156) [Link] (6 responses)

I wrote a tool to check UTF-8 chars sometime ago.

Just checked at the Kernel (next-20211101). Nothing wrong there, but I guess it is time to send another series of patches in order to avoid UTF-8 symbols that are too close to ASCII chars (like MINUS SIGN, and dash symbols). Perhaps I should consider adding it to scripts/.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 19:05 UTC (Mon) by Deleted user 129183 (guest, #129183) [Link] (5 responses)

> Nothing wrong there, but I guess it is time to send another series of patches in order to avoid UTF-8 symbols that are too close to ASCII chars (like MINUS SIGN, and dash symbols).

That would be an overreaction, I think. Linux is written solely in C, and I think there is no case when the compiler wouldn’t complain if you used minus sign instead of the typical hyphen in the actual code. I believe that C also doesn’t allow non-ASCII characters in identifiers anyway? And unless your validation script actually parsed the C code it would throw too many false positives – for example if a dash was just used in a comment.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 21:16 UTC (Mon) by dvdeug (guest, #10998) [Link] (4 responses)

Yes, C has allowed Unicode identifiers since C99, and while supporting unescaped characters is not required, most compilers do so now--GCC since GCC 10.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 22:35 UTC (Mon) by Deleted user 129183 (guest, #129183) [Link] (3 responses)

> Yes, C has allowed Unicode identifiers since C99, and while supporting unescaped characters is not required, most compilers do so now--GCC since GCC 10.

Interesting. Though Linux apparently uses the C89 standard (apparently Torvalds is kinda boomer about programming languages), so I guess this is still not a concern.

So if that’s indeed the case, I think that GCC (and other, niche compilers) should implement warnings about use of an identifier which could be visually confused with another identifiers. But unfortunately, a lot of people do not really pay attention to compiler warnings – how many times during the development of Linux we’ve seen cases when people were told “fix all compiler warnings before you submit your code to be merged”?

The "Trojan Source" vulnerability

Posted Nov 1, 2021 22:43 UTC (Mon) by dvdeug (guest, #10998) [Link]

Code like

poll=1
pol|=l
pol1=1

? It's hardly a simple or new problem, and arguably the best place to deal with it is in the editor, which should highlight mixed-script or even non-ASCII identifiers.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 3:52 UTC (Tue) by Paf (subscriber, #91811) [Link] (1 responses)

Torvalds has been specifically commenting recently about his excitement about cleaning up the kernel enough to move to more recent C standards.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 8:44 UTC (Tue) by eru (subscriber, #2753) [Link]

The kernel already use a number of later-than-C89 features, like named fields in struct initializers, and of course various GNU extensions.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:12 UTC (Mon) by linuxrocks123 (subscriber, #34648) [Link] (3 responses)

Isn't this just Section 2.6 of TR-36, written in 2014?

https://unicode.org/reports/tr36/

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:49 UTC (Mon) by siddhesh (guest, #64914) [Link] (2 responses)

Homoglyphs, yes (more like confusables in general) but not BIDI control characters based text reversing, especially across code comments and literals. The bit about comments is important because compilers tend to ignore comments altogether and if they had to add diagnostics to warn on unmatched BIDI controls, they'd now have to parse code as a user sees it, which means parsing comments too. That's a performance overhead some parsers may not want.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 22:26 UTC (Mon) by mpg (subscriber, #70797) [Link] (1 responses)

As a matter of fact, GCC already has -Wcomment: "Warn whenever a comment-start sequence /* appears in a /* comment, or whenever a backslash-newline appears in a // comment." Since this warning is enabled by -Wall which is widely used, I guess the performance hit is more than acceptable, and I don't think checking for balance of bidi controls would be significantly harder or slower. Clang has a something similar.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 1:19 UTC (Tue) by siddhesh (guest, #64914) [Link]

It's not an issue for gcc and clang (in fact Marek Polacek has already proposed -Wbidirectional for gcc and there are clang-tidy patches proposed for clang by Serge Guelton) but it's a tougher choice for dynamic language engines.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:43 UTC (Mon) by flussence (guest, #85566) [Link] (15 responses)

This has been used forever online as a source of subtle trolling and the world hasn't ended thus far. I suspect this sudden panic now is because the spreading blight of Chromium-based text editors built by people who don't know how to build text editors has hit critical mass.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 18:54 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

There's also the fact that a lot of people never got around to reading "Reflections on Trusting Trust" by Ken Thompson.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 21:58 UTC (Mon) by khim (subscriber, #9252) [Link] (13 responses)

Nah, it's no about editors. Editors haven't changed recently and Github is still Github.

The main reason it's now an issue is GCC: starting from GCC 10 proper support for Unicode is now in place which means most C/C++ compilers support unicode, which, in turn, means attacks like these are now feasible.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 8:22 UTC (Tue) by Villemoes (subscriber, #91911) [Link] (1 responses)

Yeah, and it would be nice if the compilers grew a '-Werror=non-ascii-in-code' that projects could use to say "we do want to make use of some features in C99/C11/C++987, but not random unicode chars in identifiers, TYVM".

The "Trojan Source" vulnerability

Posted Nov 2, 2021 16:26 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

This works, right up until you want to incorporate some permissively-licensed code whose copyright statement (in a comment at the top of the source file, where it belongs) names an author with a non-ASCII name. Then you need to allow Unicode in comments, or remove the copyright statement (which is usually a license violation, so don't do that). But if you allow Unicode in comments, then you can still do some BIDI shenanigans, if I'm understanding this attack correctly.

Therefore, proper mitigation requires at least one of the compiler or the editor to recognize and identify BIDI attacks, as distinct from other uses of Unicode.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 9:35 UTC (Tue) by flussence (guest, #85566) [Link] (10 responses)

The only proper Unicode support for a C compiler is to obey Annex D.1 of the standard w.r.t. permitted codepoint ranges in identifiers, and to parse UTF-8/16/32 multibyte sequences correctly within comments and string literals to find the delimiters correctly. Anything beyond that is a syntax error.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 12:25 UTC (Tue) by khim (subscriber, #9252) [Link] (9 responses)

> Anything beyond that is a syntax error.

Can, please, stop that “anyone who dares to use anything except English should be shot” attitude?

The simple fact of life: the wast majority of Earth population uses languages which don't fit into US ASCII. The fact that it took so long for C/C++ to accept that fact fact (most other languages accepted it much quirkier) doesn't make it less true.

Sure, there are always some tension between respecting other languages needs and security (proper support for Arabic, e.g., requires some quite tricky sequences which are not yet supported by C/C++/Rust), but this “my way or the highway” attitude is not helpful.

It's one thing to propose something like that for some project (which may decide that participants without knowledge of English just don't matter), it's completely different when you push for that to become a default (not even speaking about insane proposition to make it the only supported mode of operation).

The "Trojan Source" vulnerability

Posted Nov 2, 2021 14:28 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> Can, please, stop that “anyone who dares to use anything except English should be shot” attitude?

The Annex D list of accepted characters in identifiers is huge; your ire is misdirected.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 14:51 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

This old draft of the C standard from 2007 shows that Annex D.1 covers a lot more than just the Latin alphabet needed for English. There's also Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu. Kannada, Malayalam, Thai, Lao, Tibetan, Georgian, Hiragana, Katakana, Bopomofo, unified CJK, and Hangul code points in there, and this is a 15 year old draft; I would expect that if the code points on offer are not enough to write a significant language conveniently, the standard would move on and permit more code points.

That's over 20 different writing forms, covering a majority of the world's written languages, taken into account; yes, there are rarer languages that can't be written nicely in the forms covered in Annex D.1, and that is definitely a shame that I'd love to see corrected in the long run, but it's not "English or bust" - as a near-monolingual English speaker, I can only decode 3 of those into plausible sounds, and there are some scripts in there that I'd only be able to go as far as identifying as "written language, probably Asian in origin".

That said, I would prefer compilers to go beyond Annex D.1; it says it's a reproduction of Annex A of ISO TR 10176, and I'd prefer compilers to reference that Annex directly, since it gets expanded over time (e.g. the 2007 version of the C standard references the 1998 version of TR 10176, and the 2003 version added several scripts to the list.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 17:55 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

If I were making up rules from scratch and didn't have to worry about backcompat, I'd probably come up with something like this:

* Each translation unit shall have a character encoding, which shall be determined in an implementation-defined fashion. Implementations are strongly encouraged to provide support for UTF-8 at a minimum, and to default to UTF-8 if no encoding is configured (we don't mandate this because some implementations are explicitly designed to be used on EBCDIC systems and can't reasonably default to UTF-8). If any byte sequence in a translation unit is not a valid encoding of a Unicode character, then the program is ill-formed. If the selected encoding contains characters which are not in Unicode, then for the purposes of the following rules, the implementation shall behave as if those non-Unicode characters were of general category Co (private use).
* Any character with Unicode major category L, M, or N (letters, combining diacritics, numbers) is valid in identifiers, but the first character must not be an N.
* U+005F (underscore) is valid in identifiers (but identifiers starting with it are reserved). No other punctuation character (major category P) is valid in identifiers.
* All other characters are invalid in identifiers.
* If, anywhere in a translation unit, a character with major category M appears in such a way that it does not combine with an adjacent character, then the program is ill-formed.
* If any character in a translation unit belongs to general categories Cc, Cf, Cn, Zl, or Zp, then the program is ill-formed, except for U+009 (tab), U+000D (CR) and U+000A (LF). If CR appears at all, it must always be immediately followed by LF, unless the implementation specifies otherwise (i.e. you can support bare CR, if you really want to).
* Any characters in Unicode general category Co (private use) render a translation unit ill-formed unless the implementation specifies otherwise (in which case, the implementation shall also specify which general category the character is to be interpreted as, enable the user to configure this, or both). Implementations are encouraged to only support whatever subset of the private use area is actually needed (which may be configurable), rather than blindly treating all private use characters the same as regular characters.
* If, within any translation unit, more than one distinct character in general category Zs appears, the program is ill-formed (i.e. you have to pick one space character and use it throughout the TU - no mixing and matching).
* Unless explicitly restricted to identifiers, all of the above rules shall apply to every character in a translation unit, including comments and string literals. If invalid characters or byte sequences need to be placed in a string literal, developers should use an appropriate escape sequence.
* If two identifiers are equivalent under NFKD normalization, and within the same scope, then they shall be interpreted as the same identifier, even if they appear in distinct translation units (i.e. you have to do NFKD in name-mangling).
* If any two distinct (non-equivalent under NFKD) identifiers are identical under standard Unicode confusables normalization, then the program is ill-formed. If those identifiers appear in distinct translation units, no diagnostic is required.
* If any two distinct identifiers are identical after applying NFKD, removing all characters of major category M, and then applying NFKD a second time, then the program is ill-formed (this is mostly to prevent people from sticking a diacritical mark somewhere it's hard to see). If those identifiers appear in distinct translation units, no diagnostic is required.
* Implementations are encouraged to emit diagnostics of the above two bullets to the greatest extent feasible, even when it is not required (for example, during link-time optimization). Implementations may also choose to apply one or both of these normalizations as part of the compilation process, since this would only break ill-formed programs (but if you do that, you might also break ABI compatibility with other implementations, so this would require coordination between implementations to avoid that problem).

Unfortunately, I'm sure there are a few hundred ancient programs that this would break in one way or another. Oh well.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 20:19 UTC (Wed) by Grimthorpe (subscriber, #106147) [Link] (1 responses)

>That said, I would prefer compilers to go beyond Annex D.1; it says it's a reproduction of Annex A of ISO TR 10176, and I'd prefer compilers to reference that Annex directly, since it gets expanded over time (e.g. the 2007 version of the C standard references the 1998 version of TR 10176, and the 2003 version added several scripts to the list.

I can understand the thinking there, but it leads to a specific version of a standard changing over time.

The point of a specific version of a standard is that is doesn't change once published; any changes require a new version.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 20:32 UTC (Wed) by farnz (subscriber, #17727) [Link]

Which is why I said compilers should go beyond, not that the standard should change; the C standard needs to be fixed in time, but Annex D.1 does not limit what a conforming compiler can handle; a strictly conforming program can't use characters outside Annex D.1, but that doesn't mean that a compiler has to limit itself to handling those characters.

To a reasonable approximation, no-one writes C that's strictly conforming, and everyone relies on some extensions to the standard. A future version of the C standard is likely to use a later version of TR 10176 anyway, and therefore compilers jumping the gun is not unreasonable.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 14:53 UTC (Tue) by khim (subscriber, #9252) [Link]

Oops. Sorry for misreading your comment.

Yeah. Annex D is good step in right direction, but it's inconsistent. Read P1949 for more details

Unicode is complicated, but without it most people in this world can not use their native language.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 16:48 UTC (Tue) by mpg (subscriber, #70797) [Link] (2 responses)

> proper support for Arabic, e.g., requires some quite tricky sequences which are not yet supported by C/C++/Rust

As someone who's learning Arabic, I'm curious to know more. Can you explain the details or share a reference? I never looked deeply into how Arabic is encoded (beyond playing a bit with python's unicodedata to show codepoints in a string with their category and name), but always thought it would be quite straightforward and I'm a bit surprised to learn that tricky sequences are required.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 16:56 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

Read the Wikipedia article about ZWNJ: https://en.wikipedia.org/wiki/Zero-width_non-joiner

It's not allowed in identifiers (more-or-less for the reasons which parent article describes) yet sometimes you can't write text correctly without it.

I'm not sure if that's actually a practical issue for writing code with identifiers in Arabic (I have friends who are native Arabic speakers, but don't know it myself).

The "Trojan Source" vulnerability

Posted Nov 2, 2021 22:18 UTC (Tue) by mpg (subscriber, #70797) [Link]

Thank you for the reference! My first thought was that I'm not aware of any circumstance where a ZWNJ would be required in order to write Arabic properly, and the wikipedia page does not give any example in Arabic (only in other languages that happen to use the Arabic script). But then I checked the Arabic version of the page, and also the first example it gives is in Farsi, not in Arabic, but then it mentions it's also used in Arabic for acronyms. So, I've learned something new about proper writing in Arabic, thank you for that!

(And now I'm left wondering how I'm supposed to input this ZWNJ if I ever want to write acronyms properly. In the few occasions I saw native speakers write acronyms they just used a good old space, but that's probably just because their computer system doesn't make it easy for them to achieve a better result. And of course people who write code seem more likely to know how to input a ZWNJ than the average native speaker.)

Anyway, back to identifiers, perhaps it would be possible to allow ZWNJ in identifiers only in places where it would have a visible effect, that is, only between letters that would normally be joined. That would avoid "trojan source" issues while preserving the full range of legitimate uses. The cost of course would be more implementation complexity.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 16:48 UTC (Mon) by bkw1a (guest, #4101) [Link] (1 responses)

Maybe this is a dumb question (character sets confuse me!), but is there a way to get emacs or other editors to highlight non-7-bit-ASCII characters?

The "Trojan Source" vulnerability

Posted Nov 2, 2021 5:55 UTC (Tue) by zaitseff (subscriber, #851) [Link]

Not sure about highlighting, but you can search for non-ASCII characters: Press CTRL-U CTRL-S, then type [^[:ascii:]]. Press CTRL-S for the next such character: works nicely on the Trojan Source examples.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 20:56 UTC (Mon) by Psychonaut (guest, #86437) [Link] (1 responses)

While the details differ, the basic idea behind this vulnerability (i.e., using control characters in strings or comments to hide or obfuscate the true meaning of source code) is nothing new. I remember back on the Commodore 64, you could abuse the so-called "quote mode" and "insert mode" of the built-in full-screen editor to type literal backspace or cursor movement characters in a BASIC REMark statement. Clever use of this would allow you to apparently delete portions of the source code, or to overwrite them with arbitrary text, when LISTing the program to the screen.

^H in Apple II BASICs

Posted Nov 2, 2021 17:53 UTC (Tue) by david.a.wheeler (subscriber, #72896) [Link]

You could sneak ^H into Apple II BASIC as well. I know you could to it in Apple II Integer BASIC, and I think you could do it in the Applesoft BASIC as well. Again, it meant that the source code you *saw* wasn't what was actually there.

The "Trojan Source" vulnerability

Posted Nov 1, 2021 21:44 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

This is where a sensible text editor in uxterm really shines and web tools really fail ;-)

The "Trojan Source" vulnerability

Posted Nov 2, 2021 4:19 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (74 responses)

That's great news! I've always been extremely irritated by UNICODE's ubiquity, and more importantly UTF-8 that allows to mix UNICODE with plain ASCII and that allows whatever stupidities to be interleaved with machine-readable text.

I've long been having fun with homoglyphs, sending mails to friends with links and saying "look, there's a 404 on your site" or "your domain does not seem to exist anymore, did you renew it ?". But I didn't even know it was possible to change the direction in the middle of the text without affecting the charset. It seems to be U+202B or U+202E that does it. That's great, lots more fun to come, including in Git's author field or commit messages!

I understand the value of this for sites like wikipedia that need to combine lots of texts together, but such sites already have to resort to other solutions to write math or chemical expressions, so we could reasonably imagine that it ought to be a document-level attribute to specify an encoding or direction, and that it could be made easier to embed multiple documents. But switching charsets and directions at the letter level does not correspond to something humans commonly do (most of them use a single charset at once), nor something that computers need. We just created this need by making it possible :-/ For me source code only ought to be ASCII. Editors are sufficiently advanced to help decode areas that need to be decoded differently when hovering on them for example, without everything being mixed natively and using same fonts and colors.
Anyway that's already a lost battle...

The "Trojan Source" vulnerability

Posted Nov 2, 2021 6:00 UTC (Tue) by zaitseff (subscriber, #851) [Link] (5 responses)

I think anyone writing in two or more languages where one of those languages is right-to-left and the other left-to-right might disagree with you: much more common in non-Western countries than you might suspect...

The "Trojan Source" vulnerability

Posted Nov 2, 2021 6:39 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (4 responses)

I'm not dismissing the number of people using different alphabets or writing directions at all. I mean they don't need to *mix* characters from various sets without quoting. For example why would you find the Cyrillic "H" (equivalent to the latin "N") between latin characters ? It's not an "H", it uses the same representation but is a character to be used within another alphabet. For me there should be instructions to switch the alphabet and this ought to always appear only on word boundaries. The quoted-printable mode was not much different form this (or at least made it easy to check). Also why would you want to write using a latin alphabet in RTL mode, except to have fun ? Especially when *some* characters are switched directions (parenthesis, brackets etc) but not all! At the very least we ought to see completely reversed characters, as used to appear in ancient greek where you could write both directions and the characters directions indicated where to start from.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 6:53 UTC (Tue) by LtWorf (subscriber, #124958) [Link]

I think your complicated solution of word boundaries wouldn't even solve the issue the article is about.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 9:26 UTC (Tue) by dottedmag (subscriber, #18590) [Link]

In string literals that's a common thing, there are names that are combinations of Cyrillic and Latin letters.

Visually escaping literals and comments is a problem, yeah.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 22:11 UTC (Tue) by bartoc (guest, #124262) [Link] (1 responses)

People actually _do_ need to mix characters from different languages. Many, many people are multilingual. Many, many people want to be able to write direct quotes in a different language than the surrounding text.

Still more people want to be able to write all kinds of math characters in various surrounding texts, using various appropriate modifiers. Besides, allowing the mixing means you don't have to figure out which encoding a document is in, and latin is a decent choice for the first 127 characters, given how much metadata and such is transmitted in ascii, and the fact that it has so few characters.

Compared to other common encodings UTF-8 is by far the best, as it's self-synchronizing (allowing easy searching and error recovery) and pretty compact.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 7:35 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

Mixing scripts in text is very natural for anyone who speaks multiple languages; that leads to mixed scripts within “words” since computer languages are found of concatenation when constructing identifiers (an “evil” English-derived trait, also present in many human languages).

The "Trojan Source" vulnerability

Posted Nov 2, 2021 6:57 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (10 responses)

> That's great news! I've always been extremely irritated by UNICODE's ubiquity, and more importantly UTF-8 that allows to mix UNICODE with plain ASCII and that allows whatever stupidities to be interleaved with machine-readable text.

Do you speak any other language besides English?

Because English happens to be the ONLY language in the whole world that can be correctly written limiting yourself to ASCII.

Well maybe Latin too, but not any modern day Latin derivations.

And yes of course all of the European languages mix ASCII + other stuff for their non ASCII stuff.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 9:09 UTC (Tue) by eru (subscriber, #2753) [Link] (2 responses)

Because English happens to be the ONLY language in the whole world that can be correctly written limiting yourself to ASCII.

Nitpicking: there is at least one other living language: Swahili, spoken by millions in East Africa (sometimes as a second language; it is a lingua franca there).

Incidentally, those of us that need more than A-Z (like me) can be thankful to emojis, requiring them to be supported has wonderfully increased Unicode support availability.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 21:32 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Emojis also really helped with a case which might have eventually been solved by the dominance of UTF-8 but meanwhile sucked.

A bunch of software implemented UCS-2 ("Unicode" as it was conceived when it was strictly a 16-bit code in the 1980s) and then said, well, we've basically done Unicode. Before Emojis were popularized, anybody explaining why that's wrong is just saying technical mumbo jumbo you don't care about. But when your users can't write U+1F4A9 Pile of Poo suddenly the fact you're limited to the Basic Multilingual Plane jumps out as the problem it was all along.

Everything relying on MySQL is affected for example, because MySQL's "utf8" is an alias for "utf8mb3" aka "Not UTF8, but we hoped you wouldn't notice".

The "Trojan Source" vulnerability

Posted Nov 6, 2021 2:08 UTC (Sat) by ghane (guest, #1805) [Link]

... and Malay, as used in Malaysia and Singapore. I can think of other examples, with less assurance, but the common theme seems to be that the writing, spelling, and general orthography was designed by a British educator or priest, and he had only a (what we now call) ASCII pen and paper.

The downside of this has been that certain spoken sounds and accents have been lost in Malay (they still exist, but no one under 50 uses them), because the written (purely phonetic) form is what is taught. Since only one diagraph exists ("sy", as in "SHoulder") the Arabic loan word "khabar" (news) must be written as "kabar", and ends up pronounced that way by non-natives, and most native speakers too.

This was seen as a good thing, as it enabled literacy rates to rise very fast in the 60s and 70s. The has been cemented by the T9 cellphone keyboard and US-en keyboards.

And so, although most Malaysians are somewhat bilingual (everyone has school friends whose native language is Malay, Hokkien, or Tamil), the idea that you would need anything but a Latin keyboard is surprising here.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 12:10 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

> Because English happens to be the ONLY language in the whole world that can be correctly written limiting yourself to ASCII.

Not even then because "café" and "naïve" are (adopted) English words and they came with their spelling and pronunciations. Sure, English orthography is weird enough that the unadorned versions are plausible, but it is certainly easier with the accents on them.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 13:56 UTC (Tue) by Funcan (guest, #44209) [Link]

Most english speakers would write both of those without their diacritics, and indeed a quick web search suggests that is overwhelmingly the case online.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 21:16 UTC (Tue) by rodgerd (guest, #58896) [Link]

New Zealand English relies heavily on macrons for loanwords, since it's the difference between talking about a parrot and a shit.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 17:09 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

Because English happens to be the ONLY language in the whole world that can be correctly written limiting yourself to ASCII.

I don't think this is strictly true. There's the trivial case of Scots, which uses essentially the same orthography as English. But there's also German, which normally uses umlauts and esstsets but which can use digraphs (e.g. ue or ss) as an expedient if it's limited to the Latin alphabet. There are also oddball cases, like the use of Romanji in Japanese; it isn't standard to write Japanese using only Romanji, but it's certainly possible.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 19:55 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

You can shoehorn anything into the Latin alphabet if you try hard enough (I've even seen click phonemes transcripted as "!"). But this is trivially true of almost any widely-adopted alphabet - you might need to invent or borrow a symbol or two (or use extra diacritical marks), but you can nearly always come up with something that works. I imagine you can write English in the Cyrillic, Greek, or Arabic scripts, with some effort.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 21:01 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> I've even seen click phonemes transcripted as "!"

The standard transcription of isiXhosa uses 'c', 'x', and 'q' (and digraphs/trigraphs containing same, since it has a total of 18 clicks) for its click consonants.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 5:48 UTC (Wed) by LtWorf (subscriber, #124958) [Link]

Well you could probably be understood writing English without X W Y Z.

But I assure you that in Scandinavia you won't generally be understood if you pronounce ö as o and ä as e, and so reading and knowing where a vowel was killed in the name of ASCII might get complicated.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 12:01 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> so we could reasonably imagine that it ought to be a document-level attribute to specify an encoding or direction, and that it could be made easier to embed multiple documents

Text files have no document-level attributes and no mechanism for embedding documents.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 14:52 UTC (Tue) by Wol (subscriber, #4433) [Link] (7 responses)

> But switching charsets and directions at the letter level does not correspond to something humans commonly do (most of them use a single charset at once)

Actually, almost everyone using a right-to-left language does this quite a lot. I believe numbers are pretty much universally left-to-right ...

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 7, 2021 16:57 UTC (Sun) by nix (subscriber, #2304) [Link] (6 responses)

What? Arabic numbers are written right-to-left, even in left-to-right languages. (It's in the name!)

The "Trojan Source" vulnerability

Posted Nov 7, 2021 17:09 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (5 responses)

We call them Arabic numerals because they were introduced to Europe by Arab merchants; the Arabs had previously acquired them from Hindu mathematicians :)

Both Western Arabic numerals (the ones you and I use) and Eastern Arabic numerals (the ones traditionally used in the middle east) are written with the MSD on the left.

The "Trojan Source" vulnerability

Posted Nov 7, 2021 22:40 UTC (Sun) by Gaelan (guest, #145108) [Link] (4 responses)

I wonder if people who speak RTL languages think of Arabic numerals as "big endian, written left-to-right" (i.e. the same way we do), or as "little endian, written in the normal order"?

The "Trojan Source" vulnerability

Posted Nov 8, 2021 0:29 UTC (Mon) by karkhaz (subscriber, #99844) [Link] (1 responses)

See my two comments on this topic: https://lwn.net/Articles/829994/ and here https://lwn.net/Articles/830017/

Not sure what you mean by the normal order, but when writing in Arabic, numbers are 'little-endian' in that when you read them aloud, you read in the opposite direction to the surrounding text. With words, you read a sentence from right to left. When your eyes encounter a number in the middle of a sentence, you 'skip ahead' to the leftmost digit (which is the highest-magnitude one, as with English), and begin reading the number from left to right. When you've finished reading the number, your eyes jump past the number (leftward) again and continue reading the rest of the sentence from right to left.

(Though in practice, nobody reads numbers one digit at a time, in Arabic or any other language. Unless the number is very long, your eyes can probably parse the entire number with a single glance.)

When hand-writing Arabic, you write from right to left, and if you need to write a number, you move your hand leftwards to leave a gap large enough to fit the whole number that you intend to write. You then write the number left-to-right, as with English. One of the disadvantages of RTL languages is that when handwriting as a right-handed person, you're much more likely to smudge the ink because your hand glides over text that you've only just written.

The "Trojan Source" vulnerability

Posted Nov 8, 2021 15:49 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

To the extent that they can be applied to durable storage outside of computer systems, such as on paper, the terms "little-endian" and "big-endian" refer to the order of the digits as they are placed with respect to the surrounding text—for example, compared to the order in which you would number the words or paragraphs. If the text is right-to-left and numbers are written with the least significant digit on the right, then the numbers are little-endian. (In computers, a little-endian field in a structure does not become big-endian just because a program loads or stores the most significant byte first; only the locations of the digits matter, not the order in which they are read or written.)

Reading aloud is akin to a serial communication protocol, and the order used for serialization can differ from the order used for storage (i.e. the written order). From your description, in Arabic (as in English) the numbers are read in big-endian order since the most significant digit is pronounced first.

The "Trojan Source" vulnerability

Posted Nov 8, 2021 0:35 UTC (Mon) by dtlin (subscriber, #36537) [Link] (1 responses)

From what I understand, in Arabic the 1's and 10's digits are read little-endian, but all other digits are big-endian. For example, 25 is خمسة وعشرون (five and twenty), 125 is مائة وخمسة وعشرون (hundred five and twenty).

The "Trojan Source" vulnerability

Posted Nov 8, 2021 13:23 UTC (Mon) by mpg (subscriber, #70797) [Link]

That's my understanding as well. Though I'd like to point out that numbers are read the same way in German (that is, "five and twenty", "hundred five and twenty"), so it's probably not related to the direction of the surrounding text (RTL for Arabic, LTR for German).

The "Trojan Source" vulnerability

Posted Nov 2, 2021 19:48 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (46 responses)

> But switching charsets and directions at the letter level does not correspond to something humans commonly do (most of them use a single charset at once),

* Suppose you are writing a newspaper article about some guy from the UK who did something newsworthy. But you're in Iran. Are you really going to respell the person's name into the Arabic abjad? What if their name is phonotactically invalid in Arabic?
* What if you're living in Israel (Hebrew is RTL), and you want to write a message to someone, telling them the address of a location in the United States. You certainly won't respell that, if you actually want your recipient to find it!
* Obviously, this goes both ways, so you can just as easily need to embed Hebrew or Arabic into LTR writing.
* This is not a matter of formatting, either. This is a fundamental requirement to lay out the text correctly, so you can't just kick it upstairs to the rich text folks. It should be handled even by a plain text system.
* And, once you need to embed one script inside of another, somebody has to write an algorithm for figuring out which way the BIDI-ambiguous characters should go (particularly U+0020 SPACE). It is obviously impossible for this algorithm to always get everything right, so now you need override characters to correct it.

The "Trojan Source" vulnerability

Posted Nov 2, 2021 22:56 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> Are you really going to respell the person's name into the Arabic abjad?

Japanese writers respell European personal and place names into katakana all the time, which does vastly more violence to the phonology of English than writing it in Persian script(1) does.

(1) Arabic script extended with letters for consonants found in Persian but not Arabic; coincidentally, these turn out to be the same consonants found in English but not Arabic.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 3:30 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (5 responses)

> * Suppose you are writing a newspaper article about some guy from the UK who did something newsworthy. But you're in Iran.

This is exactly the type of things I was mentioning previously: here it's about complex documents and you can use plenty of typesetting options offered by the software involved for writing this newspaper, you can switch to other charsets or directions just like you can change the font style (bold/italics) or include math formulas or paste images in the middle of the text. All stuff that isn't covered by UNICODE and properly supported by LaTeX or HTML for example. I remember having had to write "é" in HTML to see an "é" in a page, and so what ? It did work fine, and editors were made to ease this input.

You'll notice that for regular use, even for pure humans it's difficult to switch directions on the same line. You have to leave room for what you need to write without being certain it will fit. And for printers like Gutenberg in their time, it would have required to push all the characters on the same line in the same order anyway. We just created a new requirement that didn't exist for pure convenience or laziness.

Similarly it's not correct to use multiple charsets inside a same word. Some glyphs look similar, and they will all be read as coming from the same charset by a human (or even an OCR software), which is the problem caused by homoglyphs, which is another problem that was purely created by UNICODE and that didn't exist in the real world.

I sincerely think that being able to state the charset on a per-word basis would fit the usage real humans have of charsets. And the writing direction could then be applied to the charset itself.

> * And, once you need to embed one script inside of another, somebody has to write an algorithm

Yes absolutely but just like your browser will decide where to place an image or like LaTeX will decide how to render a math formula. We must not confuse the encoding and the software.

But anyway we have what we have now, which covers much more than what we need and cause tons of problems. Just think that with the help of IDN it might even be possible to register the domain "<RTL>nwl<LTR>.net" and try to impersonate "lwn.net", or to do the same using some homoglyphs, though there are not that many possible in "lwn", maybe the "l" could be replaced with something looking like an l. It's not something we needed.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 3:59 UTC (Wed) by sfeam (subscriber, #2841) [Link] (1 responses)

"Similarly it's not correct to use multiple charsets inside a same word."

That may be true in English, but do not be hasty to overgeneralize. For instance there is a fascinating collection of posts on LanguageLog (topic: Diglossia and digraphia) documenting how letters or glyphs from one language are becoming borrowed elements of another. For example, one post reports that 啾C烤雞 has become a fast-food item in Taiwan, using C as a component of the borrowed term "juicy chicken". Another reports '"Note that "快D" is Hong Kong's very common spelling for "hurry up" (again, I believe there is no equivalent Chinese character available for the "D"', with a followup post noting that The use of the English letter D in the written form of Hong Kong Cantonese was first described back in 1982 [references given].

Here's a nice one discussing the English➝Japanese➝Mandarin evolution of "up主" link.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 8:12 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

But then what you're describing is nothing more than an evolving alphabet, just like the "u" and "v" split from the latin alphabets ~500 years ago or "i" and "j" split ~200 years ago. And mixing charsets to borrow some characters from a set just means that we admit not being able to make some such charsets evolve, and in this case only the original characters ought to be considered, regardless of any origin, and then there ought not be any alias (i.e. Cyrillic mostly borrowed from Greek, including some letters originally from the RTL parts, and in this case there's no reason some of these chars would differ from the latin ones when they both share the same origin).

The "Trojan Source" vulnerability

Posted Nov 3, 2021 4:11 UTC (Wed) by dtlin (subscriber, #36537) [Link]

Mixing glyphs from different charsets within a single word is not an error in many real life usages, such as the modern Taiwanese adoption of the Latin letter Q to represent a sound that has no corresponding hanzi, or probably more troubling, Wakhi which mixes Cyrillic and Greek letters into its Latin script.
There are places where Unicode contains functionality which is handled elsewhere – for example, interlinear annotation control characters for ruby text exist even though markup is recommended to be used instead – but denying mixed script words is ignoring reality.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 13:14 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

>the problem caused by homoglyphs, which is another problem that was purely created by UNICODE

This turns out not to be the case.

GOST 19768-74 (1974, defining KOI-8, the Soviet 8-bit encoding for mixed Latin and Russian Cyrillic text) and ELOT 928 (1986, defining the Greek 8-bit encoding for mixed Greek and Latin text, which was adopted by ISO a year later as ISO/IEC 8859-7) both predate Unicode and both cause homoglyphs to exist.

(It should be noted in passing that users of the Cyrillic and Greek alphabets have embraced Unicode with great enthusiasm.)

The "Trojan Source" vulnerability

Posted Nov 3, 2021 23:18 UTC (Wed) by mpg (subscriber, #70797) [Link]

> with the help of IDN it might even be possible to register the domain "<RTL>nwl<LTR>.net"

I don't think that particular one is possible: https://datatracker.ietf.org/doc/html/rfc5892#page-51 has "200E..2064 ; DISALLOWED # LEFT-TO-RIGHT MARK..INVISIBLE PLUS".

Regarding homoglyph attacks, some registries have rules to prevent them, for example .eu uses homoglyph bundling: https://eurid.eu/en/register-a-eu-domain/domain-names-wit... and browsers have heuristics as well.

But more importantly, we need to remember https://xkcd.com/538/ and look at the real world: homoglyph attacks don't actually matter that much for phishing because much simpler attacks are already effective enough. The IDN faq has links to a couple of studies: https://unicode.org/faq/idn.html#1

The "Trojan Source" vulnerability

Posted Nov 3, 2021 14:58 UTC (Wed) by mpg (subscriber, #70797) [Link] (38 responses)

While I fully agree with your last two points, I don't think the arguments in the first three points are quite correct.

> Are you really going to respell the person's name into the Arabic abjad?

As a matter of fact, yes of course you are. Here's an experiment everyone can do, no need to speak multiple languages:

1. pick a wikipedia page about a even that you know is going to name a number of people, for example https://en.wikipedia.org/wiki/2016_United_States_presiden...
2. locate the "languages" section in the side bar and open any number of translations in languages that are not written in the Latin alphabet
3. check if you see any names written in the Latin alphabet in the middle of the text - you won't (except perhaps for pictures of campaign posters in this case).

I hope this will convince you that people are in fact transliterating names all the time. Now I'm going to argue that it's indeed the most sensible thing to do in most circumstances.

> What if their name is phonotactically invalid in Arabic?

The thing about loss of phonetic information in transliterations is, it doesn't matter at all to the intended reader: the phonetic information that's lost mostly doesn't make sense to the reader anyway. That is, the reader will likely not hear the difference, and is probably unable to pronounce the name correctly. (Assuming of course the reader doesn't speak the source language.)

Also, even when using the same alphabet (non-random example: a French person reading the name of a Polish colleague), a lot of phonetic information is lost: the mapping between (groups of) letters and sounds varies wildly across languages that use the Latin alphabet (and for a given language, can also be pretty complex and very far from one-to-one). So, information loss if far from being a specific feature of transliteration.

It's actually even possible for less phonetic information to be lost when transliterating. As an example based on personal experience, if I show my name written in the Latin alphabet to a random Arabic-speaking person (who doesn't happen to also speak French) the most likely result is they're going to pronounce it a bit like an English speaker who's not familiar with that name would pronounce it, which is quite different to how it's pronounced in French. If I show them an Arabic transliteration instead, their pronunciation is going to be closer to the original.

Of course, when you're not familiar at all with the writing system of the name (like, if I were to read a Russian name in Cyrillic for example), then the phonetic information is exactly zero, while a transliteration would at least convey non-zero (if approximative) information on how the name is pronounced.

Finally, apart from phonetics, in an unfamiliar writing system, even the most basic operations like testing for equality can't be taken for granted: if you see the same name written in different fonts, will you recognize that it's the same? If you see two blobs that look similar, can you confidently conclude that they're indeed the same name?

So yes, transliterating is the right thing to do: it's the only reasonable option when the target reader can't be assumed to be familiar with the original writing system, and even when they are, not transliterating imposes a high cognitive load on the reader, while not transmitting more phonetic information unless the reader is actually familiar enough with the original language, not just its alphabet.

> * Obviously, this goes both ways, so you can just as easily need to embed Hebrew or Arabic into LTR writing.

I think we should acknowledge that the situation is very far from being symmetrical. Take your first point and "reverse" it: "you're writing an article in a UK newspaper about some guy from Iran who did something newsworthy. Are you really going to respell the person's name into the Latin alphabet? What if their name contains sounds that are not part of English?" Unless we live in _very_ different bubbles, all the names you're seeing in your news sources are systematically transliterated to English, and most of the time you've never seen them written in their original writing system even once. And most of them are full of sounds to which the transliteration can't do justice. (Case in point: did you know that "Arab" in Arabic starts with a consonant sound, before the first "a" sound?)

I think your examples rely on the implicit assumption that everyone has a least some familiarity with the Latin/English alphabet. I'm not saying that's not true, I'm just saying the situation is very asymmetrical.

> * What if you're living in Israel (Hebrew is RTL), and you want to write a message to someone, telling them the address of a location in the United States. You certainly won't respell that, if you actually want your recipient to find it!

I think that's a more convincing example that the name of a person in a newspaper, but even then, that depends on what you want your recipient to be able to do with the address. If I'm given an address in a writing system I'm not familiar with, about the only things I can do with that is (1) copy-paste it and (2) show it to a native speaker. I can't pronounce a reasonable approximation of it. I can't break it into parts. Other than copy-pasting, I can't copy it (either using a keyboard on another device, or on a piece of paper). It's not even sure I can recognize the street name on a street sign or a map because it's going to be using a different font (and I can't tell which part of the address in the street name). Granted, if I'm using a navigation app, all I need to do is to copy-paste the name into the app. But now consider if I'd been given GPS coordinates instead: compared to the native name, GPS coordinates make less sense to a native speaker, but OTOH I'm able to type them on a keyboard or write them down on paper. So, I don't think it's that clear-cut.

Have you travelled to parts of the world where the main language does not use the Latin alphabet? Do you remember being given addresses in the local writing system or an English transliteration? I can't answer for you, but as a data point, I just checked the French-speaking travel guide about Jordan that's in my library: there's not a single Arabic character in the book, and all addresses are given in French/English transliteration. (Which makes sense considering it's printed on paper so I can't copy-paste from it.)

So, while I agree that writing bidirectional text is useful and needs to be supported (including in plain text), I don't think the particular examples you're giving are that convincing, because if we followed that line of reasoning, we should see a lot more non-Latin scripts in the parts of world that use the Latin alphabet, and that's just not what we observe.

I think perhaps a stronger argument would be that in practice there is a dominant writing system, which is written in a certain direction, and if your native language happens to be written in the other direction, then you'll end up having to handle bidirectional text, because the dominant writing system is everywhere (especially when computers are involved). Said otherwise, for people whose native language is written LTR, bidi is a nice to have for some cases, but for people whose native language is written RTL, it's a must. It's all too easy to forget about that when we're in the former category.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 15:45 UTC (Wed) by mpg (subscriber, #70797) [Link]

Sorry for the disproportionately long response, I hope it doesn't come off as aggressive, that certainly wasn't the intention.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 17:19 UTC (Wed) by Wol (subscriber, #4433) [Link]

> The thing about loss of phonetic information in transliterations is, it doesn't matter at all to the intended reader: the phonetic information that's lost mostly doesn't make sense to the reader anyway. That is, the reader will likely not hear the difference, and is probably unable to pronounce the name correctly. (Assuming of course the reader doesn't speak the source language.)

I learnt Russian many moons ago. I rapidly learnt that they have more letters than we do :-) Because they have more sounds than we do.

Ever wondered why Japanese/Chinese speak an "l" when they read an "r"? Ever wondered why many Europeans speak a "d" when they read "th"? It's because they CAN'T HEAR IT.

Far-eastern babies can easily tell the difference between "l" and "r". I guess European babies can easily tell the difference between "th" and "d". But because they only ever hear adults using ONE of those sounds, the brain loses the ability to hear the other - it forces all sounds to sound like what it expects to hear.

So when I talk to a native Russian speaker, and they use their consonant that is actually half-way between "th" and "d", I *cannot* hear what they actually said, my brain tells me "that's a "d"" or "that's a "th"", despite the reality being it was neither.

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 3, 2021 20:16 UTC (Wed) by tzafrir (subscriber, #11501) [Link] (2 responses)

> 1. pick a wikipedia page about a even that you know is going to name a number of people, for example https://en.wikipedia.org/wiki/2016_United_States_presiden...
> 2. locate the "languages" section in the side bar and open any number of translations in languages that are not written in the Latin alphabet
> 3. check if you see any names written in the Latin alphabet in the middle of the text - you won't (except perhaps for pictures of campaign posters in this case).
>
> I hope this will convince you that people are in fact transliterating names all the time. Now I'm going to argue that it's indeed the most sensible thing to do in most circumstances.

Only up to a point. "Git" is transliterated in Hebrew and Arabic but not in Chinese and Greek. "LWN.net" remains "LWN.net" in Arabic. Korean, Hebrew, Russian and Chinese.

Transliterating is indeed sensible: to keep the text coherent. but not always practical. And no, it's not always simple to figure out the original name from the transliterated one. Or even figure out how to properly pronounce it.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 21:31 UTC (Wed) by mpg (subscriber, #70797) [Link] (1 responses)

> Only up to a point. "Git" is transliterated in Hebrew and Arabic but not in Chinese and Greek. "LWN.net" remains "LWN.net" in Arabic. Korean, Hebrew, Russian and Chinese.

Good point, I only checked people's names, not other names. (And I probably should not have written "all the time", I meant it's a really common thing, not that it's literally universal.) I'll note, though, that the Farsi page for LWN.net uses "LWN.net" in the title but also a transliteration of it right above the box with the logo. (For the other languages, not being familiar with their writing system, I wouldn't recognize a transliteration of a known word if I saw one.)

(By the way, Wikipedia often gives both a transliteration and the original writing, often with the API transcription too, and sometimes even a recording, which I find very helpful. Continuing with my previous example, if you look at the article for each candidate rather than the article fro the election, the original name is usually given in the first paragraph. But I think that's because we're looking at an encyclopedia, not a newspaper.)

I guess it makes sense that things whose name is also a domain name or a command name are less often transliterated, because you need the original name to visit the site or use the command.

> And no, it's not always simple to figure out the original name from the transliterated one. Or even figure out how to properly pronounce it.

I agree with the first point, but for pronunciation, I've already explained why I think transliteration is not the issue, and that it can even help in some cases. I guess it all comes down what you're most likely to need to do with the name: if you want to read it easily and pronounce some approximation of it, transliteration is your friend; if you need type it on your shell prompt or in your browser's address bar, the original name is what you want.

The "Trojan Source" vulnerability

Posted Nov 7, 2021 17:08 UTC (Sun) by nix (subscriber, #2304) [Link]

> (By the way, Wikipedia often gives both a transliteration and the original writing, often with the API transcription too, and sometimes even a recording, which I find very helpful. Continuing with my previous example, if you look at the article for each candidate rather than the article fro the election, the original name is usually given in the first paragraph. But I think that's because we're looking at an encyclopedia, not a newspaper.)

A hilarious consequence of Wikipedia's transliteration rules combined with people not being able to figure out what words in foreign writing systems mean (or, it seems, even that they are different words or not): http://itre.cis.upenn.edu/~myl/languagelog/archives/00518...

The "Trojan Source" vulnerability

Posted Nov 3, 2021 20:47 UTC (Wed) by kleptog (subscriber, #1183) [Link] (29 responses)

I think you even understate the issues with transliterating names.

When you're born you're (usually) given a name in the local language with the local script. If you then move overseas to a place with a different script you get "effects". For example, you apply for a visa and that visa will include a transliteration of your name into the local script of the country. And of course it wouldn't do for you specify the transliteration yourself, no, it has to be done using some "official transliteration", which of course varies over time and place.

Which means that you can collect official documents all of which ostensibly have your name, but which write it different ways. Which of course leads to these same officials complaining that your documentation isn't consistent. I've even seen cases where members of the same family end up getting differently spelled family names. Or just get it flat-out wrong. Because the officials cannot read the original script they also cannot check if it's correct. And they certainly aren't going to take your word for it.

You do make a really good point about how all the names you see in newspapers are all transliterated into the local language. I suppose states publish lists of the official transliterations of their leader's names. And there's a definite Latin script privilege.

Lots of good comments in this thread btw.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 21:11 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> When you're born you're (usually) given a name in the local language with the local script. If you then move overseas to a place with a different script you get "effects".

Which, so I've heard, can even include a completely new name.

Allegedly, in Iceland all official Icelandic documentation must have local Icelandic names. So apply for citizenship and you have to take a local name ...

And that was (in practice, if not in law) true for America not that long ago, in that many Ellis Island officials, if they didn't understand your original name, they would give you an Americanised version, or even a completely new name.

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 3, 2021 22:37 UTC (Wed) by TomH (subscriber, #56149) [Link]

Whilst many immigrants did change their names the idea that it was done by immigration officials at Ellis Island is largely a myth I'm afraid.

See https://genealogy.stackexchange.com/a/3825/45 for some discussion and pointers to more articles on the subject...

The "Trojan Source" vulnerability

Posted Nov 3, 2021 22:05 UTC (Wed) by mpg (subscriber, #70797) [Link] (11 responses)

Indeed, transliteration can be a total nightmare on official papers. (My partner, who's social worker, has seen lots horrifying stories about this.) That's certainly not a problem I want to minimize, but unfortunately using the original script on official papers would not really be a practical solution either, because you need something that local officials can work with too.

I think having a standardized lossless transliteration would help a lot. I hear Mandarin has an official romanization called pinyin, which I hope makes the administrative life of Chinese people in Latin script countries less painful. But defining such standards for each pair of languages/scripts seems like a daunting task. I guess there's just some irreducible complexity in working across language/script barriers.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 22:51 UTC (Wed) by sfeam (subscriber, #2841) [Link] (4 responses)

"unfortunately using the original script on official papers would not really be a practical solution either, because you need something that local officials can work with too"

Maybe. But let me quote directly from the ballot instructions for our local (US) election this week:

While your signature doesn't need to be written in cursive or even legible, your signature on the return envelope does have to match what's on file. [...] You can always update your signature by returning a paper registration form or coming to see us in person at a Vote Center.

As I understand it, this extends to writing your name in non-English characters, so long as you are consistent in what you use.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 23:41 UTC (Wed) by mpg (subscriber, #70797) [Link] (3 responses)

I think the set of operations that local officials need to be able to perform on your signature is very reduced: compare two versions (presumably) written by you for equality. By contrast, the set of operations to perform on a name is larger, from comparing two versions of a name in different fonts, to copying the name from an application form (or your birth certificate) into a computer system for processing, which is something I know I wouldn't want to do (and probably wouldn't do very reliably if I tried) in a script I'm not familiar with.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 3:02 UTC (Thu) by sfeam (subscriber, #2841) [Link] (2 responses)

"copying the name from an application form into a computer system for processing" is not guaranteed to be something a typical local official can do in places like Japan and China where the set of characters used for names includes some that are not in any available font, or not in unicode at all. And even if there is a code point that would serve, the official may well not know how to enter it. See for example A Limitation on Names in the PRC".

This may be getting a bit far afield from the issue of source code vulnerabilities, but the relevant point is that people have legitimate reasons for wanting to be able to write words, including possibly their own name, that include code points outside a single standard character set.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 16:30 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> This may be getting a bit far afield from the issue of source code vulnerabilities, but the relevant point is that people have legitimate reasons for wanting to be able to write words, including possibly their own name, that include code points outside a single standard character set.

I think if you want to be that accommodating then you need to accept an image rather than mixed character sets. People may well have legitimate reasons for wanting to be able to write words, including possibly their own name, that include glyphs outside *any* standard character set. There isn't much you can do with something like that, however, beyond display it back exactly as it was entered. It's more reasonable to insist on Unicode everywhere for words which are actually intended to be understood by other people and/or processed by computers, and deal with any missing glyphs by getting them added to the standard.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 17:09 UTC (Thu) by mpg (subscriber, #70797) [Link]

I had no idea, thanks for sharing, the article was an interesting read.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 7:58 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> I hear Mandarin has an official romanization called pinyin, which I hope makes the administrative life of Chinese people in Latin script countries less painful.

Pinyin needs tonal marks (diacritics) to be understandable. And guess what gets omitted all the time?

The "Trojan Source" vulnerability

Posted Nov 4, 2021 21:28 UTC (Thu) by mpg (subscriber, #70797) [Link]

Aw, why can't things just be simple for once?

Indeed, most locals won't know how to type those diacritics (most French people don't even know how to type É, À, Ç, which are supposed to be part of French writing), so they're gonna omit them, and we lose unicity again, in that perhaps one document will say your name is Chěng, another will have Cheng, and perhaps yet another is going to spell it Cheng3 (which IIUC is how you're supposed to indicate tone if you can't use the diacritics), and local officials are going to start questioning if those are really all the same...

Sigh.

The "Trojan Source" vulnerability

Posted Nov 8, 2021 1:52 UTC (Mon) by ghane (guest, #1805) [Link] (3 responses)

I live in a Chinese-majority-speaking country, but everyone I know speaks good English, the exceptions being elderly parents, who speak basic English. I have survived here for many years with nothing but English.

The taught form of Chinese (Mandarin-dialect) writing is in Simplified Characters, students also learn Pinyin. Whenever I have actually seen it used (store fronts, etc), it is written un-accented. I have been told that the assumption is that if you really cared, you would be able to read the Chinese characters, and if you can not, then the un-accented letters are good enough, you wouldn't understand what the tones mean. This works.

I understood early on that Peking had not been renamed, it was simply written in a different script, which reused the glyphs found on an English typewriter. But it was still a shock some years ago in Taiwan, when I wanted to visit the tomb of Chiang Kai Shek, and no one, including my English-fluent work colleagues, knew who I was talking about, till I wrote it down. Then there were "Oh, sure, of course". They wrote it down *exactly* the same way as me, but pronounced totally different, and insisted they were pronouncing it as I had written it down.

I blame Wade and Giles.

The "Trojan Source" vulnerability

Posted Nov 8, 2021 7:45 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

I always understood China had one written language, but many spoken languages, as in all the languages used the same glyph for the same word, but two different regional languages might actually say them as two completely different words.

I know we actually use slightly different spellings and the pronounciation is similar, but like our lake/loch/lough - all pretty much the same word for the same thing. But we also have mere, and there's probably others I can't think of right now ...

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 8, 2021 9:01 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

That sometimes happens, but not always. Pretty often different languages would use different characters.

The "Trojan Source" vulnerability

Posted Nov 8, 2021 10:13 UTC (Mon) by farnz (subscriber, #17727) [Link]

Arabic's an interesting one in that regard; there are multiple versions of spoken Arabic in use in the Middle East, and (e.g.) Iraqi Arabic is identifiably different from Syrian Arabic to the point where they are close to becoming different languages. But because of the importance in Islam of memorising the Quran accurately, and the respect accorded to those who can recite the Quran from memory, that split has never actually happened.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 23:47 UTC (Thu) by khim (subscriber, #9252) [Link] (14 responses)

> I've even seen cases where members of the same family end up getting differently spelled family names.

And there are cases where you need differently spelled family name, but officials refuse to accept these.

Simplest example: Vladimir Putin have two daughters: Maria Putina and Katerina Putina.

Russians would say it's the exact same family name (it's Putin for male and Putina for female), but German authorities, of course, refuse to accept it — which means single mother can not get compatible documents in Germany and Russia because Germans would insist on unchanged family name while Russian authorities would expect changed one.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 9:02 UTC (Fri) by anselm (subscriber, #2796) [Link] (13 responses)

single mother can not get compatible documents in Germany and Russia because Germans would insist on unchanged family name while Russian authorities would expect changed one.

I'm in Germany and I can assure you that the concept of kids having different surnames than their parents is not unknown here. We do have a reputation for rampant bureaucracy but we're not that dense. In particular, it is absolutely possible in the eyes of the law for a single mother to have a surname that is completely different from that of her child (and that means completely different, not just an appended “a”).

The "Trojan Source" vulnerability

Posted Nov 5, 2021 10:59 UTC (Fri) by khim (subscriber, #9252) [Link] (12 responses)

> We do have a reputation for rampant bureaucracy but we're not that dense.

You may not be that dense, but you authorities… oh yeah. Please read that article, e.g., specifically this passage: you can apply for a letter from the Kenyan embassy in Berlin, then explains this tradition thus allowing you to name your child based on your traditions. Although Kenyans have more troubles because of their traditions, but believe me, it's not a simple as saying that you are Putina, but Russian and thus male child needs to have surname Putin.

Apply for the letter from the embassy! That's what is needed to resolve cases like these.

Now, I'm not saying that German authorities are completely heartless. Far from it. They would actually help you, suggest what forms would you need to fill, what letters from embassy to bring and so on… but naming child properly would still remain quite a quest.

P.S. Thankfully in case which I observed personally situation was not that dire: father was available, mother and father had the same Cyrillic surname (different from German law POV), so it was simple matter to requesting to pick surname of a different parent. But in case of a single parent and no official custody… or parents with different surnames… this may become quite a challenge.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 15:09 UTC (Fri) by anselm (subscriber, #2796) [Link] (11 responses)

that article

“That article” is a reasonable overview of how the German bureaucracy deals with people's names. The first-name rules can, in practice, be boiled down to “whatever you can get past the person at the registrar's office”; some of them are stricter than others but in general, if you can show that, e.g., an unusual first name is common elsewhere in the world, you'll be fine. I have no words for parents who think it is a good idea to name their offspring “Adolf Hitler”; there is no legal problem if you wanted to name your son “Adolf” (perhaps after his great-grandfather) but for obvious reasons it's not exactly a popular name for children hereabouts today.

As far as surnames are concerned, the tradition is indeed that women who marry will adopt their husband's surname and that will also be the surname of any children resulting from the union. Today it is not uncommon for men to adopt their wife's surname instead, for either the husband or wife to adopt the other's surname as a “double name”, or for both husband and wife to just keep whatever names they were using before they got married. As a married couple you then need to decide what your children's surname will be, and all of your children must have the same one, but you can defer that decision until your first kid is officially registered. (If you then get divorced, revert to the surname you had before your marriage, and keep custody of your kid whose surname is that of your ex-spouse, it is possible for your surname to be completely different from your kid's surname, to everybody's confusion.)

In Germany in general, changing your first name and especially your surname (without getting married or divorced) is very difficult – not like in, say, the UK, where you can adopt a new name basically whenever you like by making an official declaration. This is probably because we like things to be nice and orderly here, and allowing people to have arbitrary name changes somehow doesn't feel right.

All of this of course means that things may indeed not be straightforward for people who arrive in Germany from places where the traditions are different. They're usually dealt with reasonably in the end. If you want to see a place with really strict naming rules, consider Iceland, where the state will force you to give your kids Icelandic names from an officially approved list (although to be fair you can submit new names to be added to that list), and where your surname derives from your parents' first name, so if you're Olaf and your dad is Erik, you'll be Olaf Eriksson but your son Thor will be Thor Olafsson.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 15:32 UTC (Fri) by khim (subscriber, #9252) [Link] (7 responses)

> This is probably because we like things to be nice and orderly here, and allowing people to have arbitrary name changes somehow doesn't feel right.

Yeah, sure, but let's consider a very simple example (small modification or real-world example). Assume a different world where Vladimir is not father of Marina but her son instead.

Thus we have Marina Putina who gave birth to two twins. Boy and girl. Boy should be Vladimir Putin while girl needs to be Katerina Putina. Would German authorities accept that? I'm afraid this would be in direct violation of the rule all of your children must have the same one, but you can defer that decision until your first kid is officially registered.

> If you then get divorced, revert to the surname you had before your marriage, and keep custody of your kid whose surname is that of your ex-spouse, it is possible for your surname to be completely different from your kid's surname, to everybody's confusion.

Oh, sure, but that's other story. I knew a family where mother, father and child all had different surnames because of complex history. We are just talking about birth certificate here.

> If you want to see a place with really strict naming rules, consider Iceland, where the state will force you to give your kids Icelandic names from an officially approved list (although to be fair you can submit new names to be added to that list), and where your surname derives from your parents' first name, so if you're Olaf and your dad is Erik, you'll be Olaf Eriksson but your son Thor will be Thor Olafsson.

Wow. What happens if mother doesn't know who was father of the child? I guess they may do a DNA paternity testing today (although in same cases this may be difficult if two potential fathers were marines who drowned before child was born), but what they done before it became available?

The "Trojan Source" vulnerability

Posted Nov 5, 2021 16:00 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

> What happens if mother doesn't know who was father of the child?

If the mother doesn't know who the father is, or doesn't want the father involved in the child's life, then her child gets a matronymic instead of a patronymic.

This has apparently been acceptable practice for a very long time, with Wikipedia's page about Icelandic names citing not only modern examples, but also a mediæval poet called Eilífr Goðrúnarson.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 17:49 UTC (Fri) by anselm (subscriber, #2796) [Link] (5 responses)

Thus we have Marina Putina who gave birth to two twins. Boy and girl. Boy should be Vladimir Putin while girl needs to be Katerina Putina. Would German authorities accept that?

If Mrs Putina was living in Germany at the time, with German citizenship, then as far as German authorities are concerned both her kids would have to be registered as “Putin”. Those are the rules. (If she moved back to Russia later then the girl would presumably get to be “Putina” there, if that's OK with the Russian authorities.) Alternatively Mr Putin and Mrs Putina could opt to have both kids be called “Putina”, but Mr Putin and/or his son, later in life, might not be enthusiastic. I don't know exactly what happens if the Russian citizens Mr Putin and Mrs Putina are visiting Germany and Mrs Putina happens to give birth to twins while they're here. In Germany, citizenship status generally depends on the citizenship status of one's parents, so in that case the kids would be Russian and not German, and Russian rules would apply. (Mr Putin and Mrs Putina would presumably go to the Russian embassy/consulate to have the birth registered, and German authorities would not be involved at all.)

The general rule is that if you already have a name and become a naturalised German citizen, you get to keep your name as it is, but you can optionally have it adjusted to suit German conventions. So if Ms. Putina, upon assuming German citizenship, would, for example, prefer to be called “Putin”, that's not a problem. The same would presumably apply to her children if they had been born outside Germany. (These are exceptions to the general rule that you're not supposed to change your surname at all except when you marry, get a divorce, or become adopted.)

The "Trojan Source" vulnerability

Posted Nov 5, 2021 18:07 UTC (Fri) by khim (subscriber, #9252) [Link] (4 responses)

Note that you may live in Germany without having German citizenship (if you have a Blue Card, e.g.).

You children would become a German citizens (in addition to having citizenship of parents) and that is when shit hits the fan.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 22:50 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> > In Germany, citizenship status generally depends on the citizenship status of one's parents, so in that case the kids would be Russian and not German, and Russian rules would apply.

And what if Russian rules conflict?

A friend of mine was born in Nigeria. So although she was a British national, as a "national born abroad" her children had no automatic right to inherit. The first two were born in England so that wasn't a problem. The third was born in Bahrain and could not inherit. Fortunately the little girl's father was British-born so she inherited from him.

Oh - and as for citizenship depending on the status of the parents, why can't I claim German nationality? My mum was born in Germany to a German mother but she wasn't entitled to citizenship, so I can't inherit ... (actually, I could try claiming under the Jewish refugee rules, but it's a bit of a long shot...)

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 6, 2021 0:52 UTC (Sat) by anselm (subscriber, #2796) [Link] (1 responses)

Oh - and as for citizenship depending on the status of the parents, why can't I claim German nationality? My mum was born in Germany to a German mother but she wasn't entitled to citizenship, so I can't inherit ...

Your mum was presumably born before 1975. Before 1975, the rule was that for someone to obtain German citizenship at birth, their father¹ had to be German, or their unmarried mother had to be German. If they had a German mother who was married to a non-German, that was just too bad as far as their (then non-existent) German citizenship was concerned. Without knowing more about the specific circumstances, it seems that your mum might be a victim of that unfair rule, and since in that case she's not entitled to German citizenship by birth, neither are you.

The law was changed in 1975 after the German federal constitutional court had pointed out its blatant asymmetry, and the rules are now more relaxed. If the current rules had been in effect when your mum was born, she would indeed have been entitled to German citizenship, and that would quite likely have applied to you, too, even if your mum had been permanently living in the UK at the time of your birth (there are more rules).

1. “Father” here means “person legally married to their mother”, not “male person involved in their conception”. That particular rule still applies.

The "Trojan Source" vulnerability

Posted Nov 6, 2021 19:53 UTC (Sat) by Wol (subscriber, #4433) [Link]

> Your mum was presumably born before 1975. Before 1975, the rule was that for someone to obtain German citizenship at birth, their father¹ had to be German, or their unmarried mother had to be German.

:-)

I started secondary school that year ...

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 6, 2021 0:26 UTC (Sat) by anselm (subscriber, #2796) [Link]

You children would become a German citizens (in addition to having citizenship of parents) and that is when shit hits the fan.

The rule is that if at least one parent has lived in Germany for eight years or more and has an indefinite right of residence in the country, any of their children who are born in Germany are entitled to German citizenship (this is generally not viewed as a disadvantage). They can have German citizenship in addition to whatever citizenship they may get from their parents. When they're 21, they may be asked to pick one or the other unless they have spent a significant part of their youth in Germany (eight years of residence, or six years of school, or graduation from school, or completion of a vocational qualification), in which case they may keep both. I don't know offhand how the Putin/Putina issue is dealt with in such a case, but since the parents aren't German citizens, the restriction that all their children must have the same surname may not apply; if Mr Putin and Mrs Putina go to the Russian embassy to make sure that their Germany-born kids have Russian citizenship as Vladimir Putin and Katerina Putina, then when the question comes up of whether little Katerina can also have German citizenship because her mum and/or dad has been living in Germany for the last eight years, she may already have the Russian “Putina” surname and the German authorities may well pick that up and run with it (as we said before, if you're naturalised in Germany you get to keep whatever name you already have). But maybe not.

Generally, countries award citizenship to newborn children based on the citizenship of their parents (ius sanguinis, or “law of the blood”) or based on whether they're born in the country (ius soli, or “law of the ground”). Germany mostly goes by ius sanguinis (i.e., if one of your parents is German, you get to be German, too, even if you're born abroad) but we operate the ius soli exception discussed above for the benefit of the children of migrant workers who came to Germany from abroad and settled here (the EU makes this very easy now, but in the 1960s there was a large influx of workers from places like Turkey, and in consequence there are now many second-generation or third-generation immigrants who are German citizens of Turkish extraction). This is in contrast to, say, the United States, which is very much a ius soli country – if you're born in the US, you're entitled to US citizenship no matter what your parents' citizenship is (unless your parents are foreign diplomats), but that's not how it works in Germany.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 16:25 UTC (Fri) by Wol (subscriber, #4433) [Link]

> In Germany in general, changing your first name and especially your surname (without getting married or divorced) is very difficult – not like in, say, the UK, where you can adopt a new name basically whenever you like by making an official declaration.

My wife went through THREE different names the week we got married. She'd been married before so turned up at the rehearsal with one name, and left with a completely different one (new first name, original surname) courtesy of my best man who is a Solicitor. Then a few days later we got married and she left with my surname :-)

Cheers,
Wol

The "Trojan Source" vulnerability

Posted Nov 7, 2021 17:16 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

> I have no words for parents who think it is a good idea to name their offspring “Adolf Hitler”

The problem is that of course he was a powerful man in the 30s and 40s and back then if Germany had had a nontrivial empire it would likely have been common to name people in non-German parts of that empire after someone powerful in the imperial capital.

Which is how we get M. K. Stalin, the leader of Tamil Nadu and a notable threat to the BJP (because popular, successful, and competent). Literally named after Joseph Stalin, shortly before the latter's death.

I suppose it might have been barely possible to get his parents to pick a different name, but his father, M. Karunanidhi, was a fairly powerful figure even back then... tricky.

The "Trojan Source" vulnerability

Posted Nov 7, 2021 23:26 UTC (Sun) by anselm (subscriber, #2796) [Link]

The problem is that of course he was a powerful man in the 30s and 40s and back then if Germany had had a nontrivial empire it would likely have been common to name people in non-German parts of that empire after someone powerful in the imperial capital.

Possibly. But in 21st-century Germany, even the neo-Nazis seem to realise that it's not a great idea. Even calling your son “Adolf” (without the “Hitler” – after, say, your beloved late grandfather) probably amounts to setting him up for a huge amount of schoolyard grief that he could just as well do without.

Having said that, in Spanish-speaking countries “Jesus” appears to be a reasonable name for boys but although it is perfectly legal here in Germany, it is practically a non-starter. According to statistics, only 2 out of 100,000 kids born in Germany per year are called Jesus, and the general consensus even among religious people is that it's not exactly doing them a favour.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 7:49 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (2 responses)

> So yes, transliterating is the right thing to do:

It is not because there is no such thing as an authoritative latin (for example) transliteration, different languages use the same symbols to encode different sounds, therefore a transliteration only makes sense when targeting a specific language.

Many transliterated names make zero sense because they are used in language where phonetics are different from the language they were transliterated to in the first place.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 15:31 UTC (Thu) by mpg (subscriber, #70797) [Link]

I fully agree that there's (usually) no such thing a universal "Latin" transliteration, and transliterations need to be language-specific. And I think that's actually what people are doing: recycling my little wikipedia experiment we can look at translations of https://en.wikipedia.org/wiki/Dmitri_Shostakovich - this time in Latin-based scripts - and observe the variety of transliterations used in different languages using (various extensions of) the same script.

> a transliteration only makes sense when targeting a specific language.

I don't disagree, but all texts I can think of target a specific language, just by virtue of being written in that language.

> Many transliterated names make zero sense because they are used in language where phonetics are different from the language they were transliterated to in the first place.

Yes, of course if people make the mistake of using a transliteration that was made for a different language than theirs, the results are not going to be good. But that's a pretty avoidable mistake, and again, that's not at all specific to transliteration: if I look at a Polish name, I have little idea how it's actually pronounced, regardless if the name is originally Polish or a Polish-oriented transliteration of a Russian name.

So, I stand by my opinion that transliterating (in a way that's appropriate for the target language) is the right thing to do most of the time for names of people in the context of a news article. (For names of command-line programs or websites, you probably want the original name instead or in addition. In the context of an encyclopedia, I think giving both a transliteration, the original name, plus an IPA transcription and when possible a recording, as wikipedia does, is ideal. In the context of administrative documents such as visas etc, honestly I don't know what the ideal solution would be: non-standard transliterations are a source of important problems, but I feel like using the original script would be a source of other problems; I think there's some irreducible complexity here.)

The "Trojan Source" vulnerability

Posted Nov 4, 2021 15:48 UTC (Thu) by mpg (subscriber, #70797) [Link]

I mean, don't get me wrong: transliterations are terrible and I avoid them like the plague while learning Arabic, not least because as you say everybody's using a different system which is a pain in the neck. But outside the specific context of learning a language, I fail to see what the better alternative would be. So, I have two questions for you:

1. Say you're writing an article in English about the relationships between پاکِستان and भारत. How are you going to refer to the names of the various people involved?

2. If transliterating is the wrong thing to do, why do you think it's so common? I'm not saying the majority is always right, it often isn't, but as a general rule I think when disagreeing with the majority, it's good to be able to articulate what we think it is that all these people are missing.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 19:06 UTC (Wed) by jem (subscriber, #24231) [Link]

>I've always been extremely irritated by UNICODE's ubiquity, and more importantly UTF-8 that allows to mix UNICODE with plain ASCII

What's the problem with mixing Unicode with plain ASCII? Are you also irritated by the ISO-8859-1 encoding, which also allows mixing a subset of Unicode with plain ASCII?

The "Trojan Source" vulnerability

Posted Nov 2, 2021 17:05 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Just limit non-ASCII stuff to comments and string constants. And yes, I speak several languages that use completely non-Latin scripts.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 3:03 UTC (Wed) by dtlin (subscriber, #36537) [Link] (4 responses)

In the very first example on that page, the non-ASCII characters are only in the comments. The trick is that directionality affects the layout of the rest of the line.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 20:09 UTC (Wed) by kreijack (guest, #43513) [Link] (3 responses)

Cyberax told something on which I agree: for the identifiers use only the ascii charset. This would reduce the risk of "homoglyphs" , and solve another issue about multi-language intereoperability: the ascii charset is the most common "subset" of characters recognized by everyone, so it create less problem from a interoperability point of view.

As Italian people, I never though about writing a function name like caffè().

For the other UNICODE codes that affect the directionability, I would allow them only in the comments that span a full line; something like

/*
here I would allow code that affect the directionability
*/

/* here I don't */

For the literal string, I would allow all the characters (even the homoglyphs ones, even I am sure that this pose some concerns) but the one that affect the directionability.

But I don't know the UNICODE so deep to exclude that other risks exist.

The "Trojan Source" vulnerability

Posted Nov 3, 2021 21:37 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

I generally agree with your approach which I do mostly share. However I'm trying as much as possible to also avoid non-ASCII in literal strings, but not for this particular reason, for another one which is that not everyone has all characters in their fonts. It's not uncommon even in browsers to stumble on sites with some stupid characters appearing as a square with 4 hex digits in them because the character is missing from on local font. Most often it's acceptable on a site (except when it's the logo of a special navigation button), but it can be cumbersome to some developers from other countries to have to be very careful about not breaking anything by dealing with opaque stuff.

But while I mostly work on low-level code where interactions with humans are essentially technical and always in English, I also know that when dealing with higher level stuff you cannot avoid localization and then you don't have much choice of avoiding to deal with charsets in literals. I have probably not written a single line of french text in a program in the last 15-20 years, so I agree that it helps to stay away from those monstrosities.

The "Trojan Source" vulnerability

Posted Nov 4, 2021 23:56 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

> Cyberax told something on which I agree: for the identifiers use only the ascii charset.

That's always possible but not always feasible. E.g. if you write code for accounting program then you have to deal with the fact that many terms from local law can be translated and transliterated differently to US-ASCII (basically the story of names discussed just above with a bit smaller amount of craziness).

In such case the ability to use the terms exactly as they are written in official documents is godsend.

Of course that means that programmer who can not understand the appropriate script wouldn't be able to easily edit such a program, but then it may be a blessing in a disguise: chances are high that such person wouldn't be able to understand requests made by officials, too, which means that it makes sense to separate that tricky part of code from the ORM or web-server sources anyway.

The "Trojan Source" vulnerability

Posted Nov 5, 2021 7:52 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> E.g. if you write code for accounting program then you have to deal with the fact that many terms from local law can be translated and transliterated differently to US-ASCII

Then you place a fat big "#pragma ignore_unicode_rules" at the top of the file and affirm that you are OK with keeping all the pieces.

Everything else should limit Unicode-ness to string constants and comments.

This is already what's happening effectively in practice. For example, I tried to find a project on Chinese Github ( https://gitee.com/ ) that actually uses Chinese for identifiers and I couldn't find any. But there are plenty of Chinese-language comments. E.g.: https://gitee.com/dotnetchina/TimeCrontab/blob/master/src... (from the first project on their "Explore" page).

Trojan source is a special case of underhanded code

Posted Nov 2, 2021 17:39 UTC (Tue) by david.a.wheeler (subscriber, #72896) [Link] (1 responses)

As I have posted elsewhere:

The “Trojan Source” paper is interesting. Unicode bidirectional commands have been exploited in other contexts, but this is the first paper I can recall that specifically discusses bidi in source code.

However, I think it’s important to realize this is a special case of “underhanded code” aka “underhanded source code” aka “maliciously misleading code”. Underhanded code is source code crafted so that the source code looks like it does one thing to human reviewers, but it actually does something else. Homoglyphs are a common mechanism of attack (e.g., 1/l or O/0), as are misleading indentation, etc.

The first reference I can find to underhanded code is the 2004 Obfuscated V Contest (http://graphics.stanford.edu/~danielh/vote/vote.html) created by Daniel Horn.

Below are some of the related works that discuss underhanded code / maliciously misleading code; see my 2020 paper for a more complete list. My 2020 paper cites more examples, and it also describes a brief experiment in *countering* underhanded code. It turns out that a lot of underhanded code can be countered by relatively simpler measures... but those measures have to be implemented to work :-). My 2020 paper is here:
https://www.ida.org/research-and-publications/publication...
https://www.ida.org/-/media/feature/publications/i/in/ini...

--- David A. Wheeler

=== SOME RELATED WORKS ===

The Obfuscated V Contest (http://graphics.stanford.edu/~danielh/vote/vote.html) was created by Daniel Horn in 2004 and is the earliest “underhanded” programming contest that I found. It was a contest to create source code that looked like it did one thing, but actually did another.

Underhanded C Contest (http://www.underhanded-c.org/) has run in many years. Per its FAQ, "The Underhanded C Contest is an annual contest to write innocent-looking C code
implementing malicious behavior.”

Underhanded Crypto Contest (https://underhandedcrypto.com/). As of this
time, it has run from 2014 to 2018. The contest website does not directly note
the 2018 winners; however, the 2018 winners are presented and discussed in
a DefCon 26 presentation [Caudill 2018]. The set of all entries is available on
GitHub (https://github.com/UnderhandedCrypto/entries).

Underhanded Solidity Coding Contest (USCC) (https://u.solidity.cc/; details
are available at its GitHub site https://github.com/Arachnid/uscc). Solidity is
a contract-oriented programming language for writing smart contracts that can
be implemented on blockchain platforms such as Ethereum. The
announcement of the winners of the first (2017) contest is available at
[Johnson 2017], and the complete set of 2017 winners is posted on GitHub at
https://github.com/Arachnid/uscc/tree/master/submissions-.... The
developers of Solidity used the contest results to improve their tooling.

The “Write a program that makes 2+2=5” discussion on StackExchange at
https://codegolf.stackexchange.com/questions/28786/write-...
makes-2-2-5 shows how to do that in a variety of programming languages.

The “Underhanded code contest: Not-so-quick sort” (https://
codegolf.stackexchange.com/questions/19569/underhanded-code-contest-
not-so-quick-sort) is a small underhanded code contest. The goal of this
contest was to “Write a program, in the language of your choice, that reads
lines of input from standard input until EOF, and then writes them to standard
output in ASCIIbetical order, similar to the sort command-line program. ...
The underhanded part... is to prove that your favored platform is `better,’ by
having your program deliberately run much more slowly on a competing
platform.”

“April Fools Day!” (https://codegolf.stackexchange.com/questions/114891
/april-fools-day) is a small underhanded code contest with a few underhanded
code samples. The goal is to “write a program or function which appears to
print the first ten numbers of any integer sequence (on OEIS, the answerer
may choose which sequence), but instead prints the exact text “Happy April
Fool’s Day!” if and only if it is run on April 1st of any year.”

The “Underhanded Python” posting (https://gist.github.com/L3viathan
/e47d359470d5e18a357c67d9e4328c16) is quite clever. It uses the fact that
“//” opens a comment in other languages to fool the reader. It is revealed by
syntax coloring but even vim syntax coloring was not obvious enough to
immediately reveal the attack.

The 2003 attack on the Linux kernel source code. An attacker attempted to
subvert the Linux kernel in 2003 through underhanded code that used =
instead of ==. This is discussed in [Corbet 2003] and [Felten 2013].

My PhD dissertation "Fully Countering Trusting Trust through Diverse Double-Compiling" discusses how to counter the "trusting trust" problem & includes a section about maliciously misleading source code. See: https://dwheeler.com/trusting-trust/

The JavaScript Misdirection Contest announced the winner on September 27, 2015 http://misdirect.ion.land/

My paper "Initial Analysis of Underhanded Source Code", (by David A. Wheeler, April, 2020, IDA document: D-13166),
discusses underhanded code and the effectiveness of several potential countermeasures. It also includes a number of citations to other works on underhanded code. See:
https://www.ida.org/research-and-publications/publication...
https://www.ida.org/-/media/feature/publications/i/in/ini...

Note that my 2020 paper includes references to many other related works (it includes a literature survey of such work).

Trojan source is a special case of underhanded code

Posted Nov 2, 2021 19:35 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

‮a saw I nehw draobyek a no 'O' ot '0' eht deppamer I nehw em sdnimeR⁦
‮dnah yb secneuqes GATJ lla gnipyt saw ohw yug roop eht taht os tneduts⁦
‮sretcarahc dab deretne dah eh taht gnizilaer tuohtiw liaf meht was⁦
‮(-: seno tcerroc eht esu ot niatrec gnieb etipsed⁦

‮yaw wen a yllaer s'tahT .desrever-ylbuod si tnemmoc siht yaw eht yb hA⁦
‮!ti evol I !elbatsap-ypoc yllaivirt ton txet dna nuf evah ot⁦