|
|
Subscribe / Log in / New account

Trojan Source and Python

By Jake Edge
November 16, 2021

The Trojan Source vulnerabilities have been rippling through various development communities since their disclosure on November 1. The oddities that can arise when handling Unicode, and bidirectional Unicode in particular, in a programming language have led Rust, for example, to check for the problematic code points in strings and comments and, by default, refuse to compile if they are present. Python has chosen a different path, but work is underway to help inform programmers of the kinds of pitfalls that Trojan Source has highlighted.

On the day of the Trojan Source disclosure, Petr Viktorin posted a draft of an informational Python Enhancement Proposal (PEP) to the python-dev mailing list. He noted that the Python security response team had reviewed the report and "decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language". He agreed with that decision, in part because there are plenty of other kinds of "gotchas" in Python (and other languages), where readers can be misled—purposely or not.

But there is a need to document these kinds of problems, both for Python developers and for the developers of tools to be used with the language, thus the informational PEP. After some adjustments based on the discussion on the mailing list, Viktorin created PEP 672 ("Unicode-related Security Considerations for Python"). It covers the Trojan Source vulnerabilities and other potentially misleading code from a Python perspective, but, as its "informational" status would imply, it is not a list of ways to mitigate the problem. "This document purposefully does not give any solutions or recommendations: it is rather a list of things to keep in mind."

ASCII

It starts by looking at the ASCII subset of Unicode, which has its own, generally well-known, problem spots. Characters like "0" and "O" or "l" and "1" can look the same depending on the font; in addition, "rn" may be hard to distinguish from "m". Fonts designed for programming typically make it easier to see those differences, but human perception can sometimes still be outwitted:

However, what is “noticeably” different always depends on the context. Humans tend to ignore details in longer identifiers: the variable name accessibi1ity_options can still look indistinguishable from accessibility_options, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in responsbility_chain_delegate.

Beyond that, the ASCII control codes play a role. For example, NUL (0x0) is treated by CPython as an end-of-line character, but editors may display things differently. Even if the editor highlights the unknown character, putting a NUL at the end of a comment line might be easily misunderstood, as the following example shows:

[...] displaying this example:
# Don't call this function:
fire_the_missiles()
as a harmless comment like:
# Don't call this function:⬛fire_the_missiles()

Backspace, carriage return (without line feed), and escape (ESC) can be used for various visual tricks, particularly when code is output to a terminal. Python allows more than just ASCII in its programs, however; Unicode is legal for identifiers (e.g. function, variable, and class names) as described in PEP 3131 ("Supporting Non-ASCII Identifiers"). But, as PEP 672 notes: "Only 'letters and numbers' are allowed, so while γάτα is a valid Python identifier, 🐱 is not." In addition, non-printing control characters (e.g. the bidirectional overrides used in one of the Trojan Source vulnerabilities) are not allowed in identifiers.

Homoglyphs

But the other Trojan Source vulnerability relates to "homoglyphs" (or "confusables" as the PEP calls them). Characters in various alphabets can be similar or the same as those in other languages: "For example, the uppercase versions of the Latin b, Greek β (Beta), and Cyrillic в (Ve) often look identical: B, Β and В, respectively." That can lead to identifiers that look the same, but are actually different; there are other oddities as well:

Additionally, some letters can look like non-letters:

  • The letter for the Hawaiian ʻokina looks like an apostrophe; ʻHelloʻ is a Python identifier, not a string.
  • The East Asian word for ten looks like a plus sign, so 十= 10 is a complete Python statement. (The “十” is a word: “ten” rather than “10”.)

Though there are symbols that look like letters in another language, symbols are not allowed in Python identifiers, obviating the reverse problem. Another surprising aspect might be in the conversion of strings to numbers in functions such as int() and float(), or even in str.format():

Some scripts include digits that look similar to ASCII ones, but have a different value. For example:
>>> int('৪୨')
42
>>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
five

The second example uses the indexing feature of str.format() to pick the Nth value from its arguments; in that case, the value of the number is five, even though it looks vaguely like a zero. Then there are the confusions that can arise from bidirectional text.

Bidirectional text

The presence of code containing identifiers in right-to-left order fundamentally changes the way it is interpreted by CPython, which may be puzzling to those who are used to left-to-right ordering; the Unicode bidirectional algorithm is used to determine how to interpret and display such code. For example:

In the statement ערך = 23, the variable ערך is set to the integer 23.

The example above might be clear enough from context for someone reading it who is used to reading left-to-right text, but another of the PEP's examples takes things further:

In the statement قيمة - (ערך ** 2), the value of ערך is squared and then subtracted from قيمة. The opening parenthesis is displayed as ).

Another extended example gets to the heart of the Trojan Source bidirectional vulnerability. It starts by showing the difference a single right-to-left character makes in a line of code, then looks at the effects of the invisible Unicode code points that change or override directionality within a line.

Consider the following code, which assigns a 100-character string to the variable s:
s = "X" * 100 #    "X" is assigned

When the X is replaced by the Hebrew letter א, the line becomes:

s = "א" * 100 #    "א" is assigned

This command still assigns a 100-character string to s, but when displayed as general text following the Bidirectional Algorithm (e.g. in a browser), it appears as s = "א" followed by a comment.

[...] Continuing with the s = "X" example above, in the next example the X is replaced by the Latin x followed or preceded by a right-to-left mark (U+200F). This assigns a 200-character string to s (100 copies of x interspersed with 100 invisible marks), but under Unicode rules for general text, it is rendered as s = "x" followed by an ASCII-only comment:

s = "x‏" * 100 #    "‏x" is assigned

Readers who normally use left-to-right text may find it interesting to paste some of that code into Python or to try working with it in an editor; the behavior is not intuitive, at least for me. The uniname utility may be useful for peering inside to see the code points.

There are other Unicode code points that affect directionality, but the effects of all of them terminate at the end of a paragraph, which is usually interpreted as the end of line by various tools, including Python. Using those normally invisible code points can have wide-ranging effects as seen in Trojan Source and noted in the PEP:

These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python's comments always extend to the end of a line), but it doesn't render them harmless.

Normalization

Another topic covered in the PEP is the normalization of Unicode code points for identifiers. In Unicode, there are often several different ways to generate the same "character"; using Unicode equivalence, it is possible to normalize a sequence of code points to produce a canonical representation. There are four ways to do so, however; Python uses NFKC to normalize all identifiers, but not strings, of course.

There are some interesting consequences stemming from that, which can also be confusing. For example, there are multiple variants of the letter "n" in Unicode, several in mathematical contexts, all of which are normalized to the same value, leading to oddities like:

>>> xⁿ = 8
>>> xn
8
In a followup message, Paul McGuire posted a particularly graphic demonstration of how a simple program can be transformed into something almost unreadable via normalization. Treating strings differently means that functions like getattr() will behave differently than a lookup done directly in the code. An example from the PEP (ab)uses the equivalence of the ligature "fi" with the string "fi" to demonstrate that:
>>> class Test:
...     def finalize(self):
...         print('OK')
...
>>> Test().finalize()
OK
>>> Test().finalize()
OK
>>> getattr(Test(), 'finalize')
Traceback (most recent call last):
  ...
AttributeError: 'Test' object has no attribute 'finalize'

Similarly, using the import statement to refer to a module in the code will normalize the identifier, but using importlib.import_module() with a string does not. Beyond that, various operating systems and filesystems also do normalization; "On some systems, finalization.py, finalization.py and FINALIZATION.py are three distinct filenames; on others, some or all of these name the same file."

Reaction

The reaction to the PEP has been quite positive overall, as might be expected. There were some questions about whether it should be a PEP or part of the standard documentation. For now, Viktorin is content to keep it as a PEP, but thinks it may make sense to integrate it into the documentation at some point; "I went with an informational PEP because it's quicker to publish", Viktorin said.

The conversation also turned toward changes that could be made to Python to help avoid some of the problems and ambiguities that arise. Several suggestions that might seem to be reasonable at first blush are too heavy-handed, likely resulting in too many false positives or, even, effectively banning some common languages (e.g. Cyrillic), as Serhiy Storchaka pointed out. For the most part, it was agreed that these kinds of problems should be detected by linters and other tools that can be configured based on the project and its code base. There may be some appetite for disallowing explicit ASCII control codes in strings and comments, however, as Storchaka suggested:

All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no [reasons] to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings.

As can be seen here and in the PEP, Viktorin provided a whole cornucopia of things for Python developers of various stripes to consider. While the exercise was motivated by the Trojan Source vulnerabilities, the "problems" are more widespread. There is a fine line between supporting various writing systems used by projects worldwide and discovering oddities—malicious or otherwise—in a particular code base. Developers of tools targeting the Python ecosystem will find much of interest in the PEP.


Index entries for this article
SecurityPython
SecurityUnicode
PythonSecurity


to post comments

Trojan Source and Python

Posted Nov 17, 2021 8:26 UTC (Wed) by smurf (subscriber, #17840) [Link]

In fact, some parts of this PEP is applicable to many other programming languages.

There's only one language I can think of that is likely to have even more problems along these lines: Perl6.

Trojan Source and Python

Posted Nov 17, 2021 10:25 UTC (Wed) by ale2018 (guest, #128727) [Link] (38 responses)

Microsoft Word used to have (or maybe still has, it's quite some time I don't use it) a macro language that was translated in the international editions. In Italian versions, if became se. Italian macros wouldn't run on English versions of Word, and vice-versa. Some companies reacted by requiring all branches install only original English versions of any Microsoft software.

Does it make sense to use non-English words for identifiers and comments? Some programmers find it useful at times. For example, they say they can easily spot what parts of code where customized locally. I stopped doing so after I found Hebrew comments that I couldn't understand. In a globalized word, using non-English words hampers a program maintainability.

Non-English has to be used in strings, of course. However, such usage is normally relegated to a few special files. In that case, routinely running uniname -a on all imported software files looks like a practical precaution. Thanks to Jake for recalling that utility.

Trojan Source and Python

Posted Nov 17, 2021 11:10 UTC (Wed) by mbunkus (subscriber, #87248) [Link] (34 responses)

It absolutely makes sense to be able to use your native language in comments. There are millions of products out there that are country-specific or company-specific and will never, ever require internationalization or an international development team. Requiring all of those to use a foreign language that a lot of developers probably don't speak well is… a rather strange notion to me. The whole world doesn't speak English (or any other language that can be expressed solely in ASCII).

Trojan Source and Python

Posted Nov 17, 2021 18:24 UTC (Wed) by mfuzzey (subscriber, #57966) [Link] (33 responses)

>There are millions of products out there that are country-specific or company-specific and will never, ever require internationalization or an international development team

Famous last words. Companies have a habit of buying other companies or expanding into new markets, what may be true today could change tomorrow
I remember the pain when the company I work for (in France) bought a German company and got a load of source code with German comments and variable names...

> The whole world doesn't speak English (or any other language that can be expressed solely in ASCII).

Completely agree, for the *general population*

> foreign language that a lot of developers probably don't speak well

But this is specifically about *developers*, most of whom will be used to reading language / library / framework documentation in English since that is often all that exists (or is up to date).
Furthermore it's only written English (within a fairly narrow technical domain) that is really required to write code / comments, no need to actually speak it (which is harder)

Trojan Source and Python

Posted Nov 17, 2021 18:37 UTC (Wed) by mbunkus (subscriber, #87248) [Link] (32 responses)

If you really think that the earth's developer population in general overwhelmingly speaks very good English, I can only shake my head at you. I'm in Germany and know a lot of developers whose English is worse than mediocre, who have real difficulties explaining themselves in English, even if it's written & they have a lot of time. And this is Germany, which is a pretty anglophile country, not, let's say, Russia or China.

You're also completely missing the point, especially with your "within a fairly narrow technical domain". Someone else brought up bookkeeping, which is one country-specific thing I was thinking about. Tax law, any other law-related software will always be so incredibly specific to the country it's built for that there isn't a point in translating all that highly, highly domain-specific and completely non-technical. They need to talk about country-specific laws in terms that are mostly unfamiliar to the developers (they haven't studied law, after all), and you bet they will not know the correct translation into English & back — if there even is an English equivalent in the first place. Requiring English comments for these types of software is… I only have swear words for it.

Being able to understand Qt's or Python's reference documentation has nothing on the ability to talk about Ehegattensplitting, for example, for which there doesn't seem to be an English term, and dictionaries tend to describe it with something like "tax system in which husband and wife each pay income tax on half the total of their combined incomes". Yeah… no.

Trojan Source and Python

Posted Nov 17, 2021 21:16 UTC (Wed) by Wol (subscriber, #4433) [Link]

> Ehegattensplitting, for example, for which there doesn't seem to be an English term,

Because the English system is nowhere near as sensible, and treats a married couple where one partner earns £60K *very* *differently* from an almost identical couple earning £30K each.

After all, the "marginal rate of tax" ("tax" as in "deductions from gross income" and not what is officially/legally called "tax") is well over 100% for those unlucky enough to be struggling to find work ... When I managed to find a second day's work back when I was struggling a few years ago, I was lucky. It was only 75%.

Cheers,
Wol

Trojan Source and Python

Posted Nov 17, 2021 22:14 UTC (Wed) by Nahor (subscriber, #51583) [Link] (30 responses)

> If you really think that the earth's developer population in general overwhelmingly speaks very good English

I don't know where you got that "very good" English from, nobody said that. One doesn't need to be "very good", just adequate (although being good is recommended of course).

> I'm in Germany and know a lot of developers whose English is worse than mediocre

And maybe that should be seen as a problem with the developer or the hiring process. Developer are never hired just because they are good at writing code, they need to be good at analyzing issues, working with a team, communicating, ... . English ought to be another skill with a minimum requirement for developers. There is virtually no chance that your developers won't need English at some point or another (documentation, stackoverflow, tips&tricks blogs, ...).

> Tax law, any other law-related software will always be so incredibly specific to the country it's built ...
> ...
> Being able to understand Qt's or Python's reference documentation has nothing on the ability to talk about Ehegattensplitting

It doesn't matter what the software does, or where it's used. What matters is that your software uses Qt or Python and so your developers must be able to understand the documentation. If you have a problem with your code, you'll likely search for hints and pointers online, in a vastly English world.
What also matter is who is writing and maintaining said software, it's not uncommon to temporarily contract foreign companies/developers to speed up the development, or because of their specialized knowledge (e.g. databases or writing similar software for other countries)

Which then leads back to mfuzzey comment: "Famous last words". If you've been able to get away without English, you've been extremely lucky and/or writing simple enough software in a niche enough market that you didn't need external help.

Trojan Source and Python

Posted Nov 17, 2021 23:43 UTC (Wed) by Wol (subscriber, #4433) [Link] (28 responses)

> It doesn't matter what the software does, or where it's used. What matters is that your software uses Qt or Python and so your developers must be able to understand the documentation. If you have a problem with your code, you'll likely search for hints and pointers online, in a vastly English world.

Given that I'm a native English speaker, and regularly find the official documentation incomprehensible, I very much doubt that's true!

They will likely search for hints and pointers online, in their own native-language world.

Cheers,
Wol

Trojan Source and Python

Posted Nov 18, 2021 0:48 UTC (Thu) by Nahor (subscriber, #51583) [Link] (27 responses)

> They will likely search for hints and pointers online, in their own native-language world.

That's beside the point. Nobody claimed that documentation was never translated, or that websites/blogs are never in non-English languages.

The question is: is it sufficient? The "pro-English" claim is that eventually, it will come back to English:
- Sooner or later (and usually the first), you'll have to look for English answers because you won't find them in your native language.
- In all likelihood, your code will eventually be read by non-native developers.

As a developer, not being fluent in English, or writing code in one's native language is bound to become a problem eventually, thus, IMHO, ought to be avoided.

Trojan Source and Python

Posted Nov 18, 2021 15:59 UTC (Thu) by kleptog (subscriber, #1183) [Link] (26 responses)

> The question is: is it sufficient? The "pro-English" claim is that eventually, it will come back to English:

I think this underestimates the strongest filter bubble there is: Google returns results in your native language. It doesn't support results across multiple languages very well at all. So when you start searching for stuff it's going to stick to your local language.

Sure, good developers will probably have their search language set to english to get good results, but a large number of developers are going to stick to their own language unless there's no alternative. Or read auto-translated versions of the everything.

> As a developer, not being fluent in English, or writing code in one's native language is bound to become a problem eventually, thus, IMHO, ought to be avoided.

This is completely unrealistic. Maybe 18% of the world population speaks english and all those others want to program computers as well. Chances are we're going to get (or already have) multiple large development communities in different languages which don't interact very much.

Trojan Source and Python

Posted Nov 18, 2021 18:08 UTC (Thu) by Nahor (subscriber, #51583) [Link] (25 responses)

> I think this underestimates the strongest filter bubble there is: Google returns results in your native language. It doesn't support results across multiple languages very well at all. So when you start searching for stuff it's going to stick to your local language.

That's not a good faith discussion. I never said that coders won't search in there native language, or that it wouldn't be the first choice even. All I said is that they will eventually have to deal with English.
As for your comment, Google does allow switching to a different language when doing a search. It even asks you if it can't find enough good results to your query. And even if it didn't, it's not like Google is the only source of info either.

> Maybe 18% of the world population speaks english

And what is the population of coders who speaks English? That's a more relevant metric.

Trojan Source and Python

Posted Nov 19, 2021 9:08 UTC (Fri) by Wol (subscriber, #4433) [Link] (24 responses)

> That's not a good faith discussion. I never said that coders won't search in there native language, or that it wouldn't be the first choice even. All I said is that they will eventually have to deal with English.

Why? If it's a half-way decent and popular system they're using, there will be plenty of decent native language docs.

If you're going to throw comments like "not good faith" around, I don't think assuming non-English-speaking developers are incompetent at looking after themselves is good faith either.

Let's put it very bluntly. Most documentation is crap (at teaching people how to USE software). My experience is very much "the documentation only makes sense AFTER you already know what it's trying to say". In other words you NEED the self-help forums. And once you've learnt to use them, why bother with the official site?

Oh - and if we look just at Europe and America, I would guess the number of people who speak Spanish/Portuguese/Italian natively outnumbers the number of people who speak English - by a LONG way.

Cheers,
Wol

Trojan Source and Python

Posted Nov 19, 2021 9:28 UTC (Fri) by geert (subscriber, #98403) [Link] (13 responses)

> Oh - and if we look just at Europe and America, I would guess the number of people who speak Spanish/Portuguese/Italian natively outnumbers the number of people who speak English - by a LONG way.

But nobody else speaks Italian ;-)

Seriously, I doubt your statement.
https://en.wikipedia.org/wiki/List_of_languages_by_number...

Trojan Source and Python

Posted Nov 19, 2021 12:54 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

I'm lumping those three languages together because, like many European languages, they are mostly mutually comprehensible. Stick a Spaniard, Italian, and Portuguese in the same room and they would probably be able to communicate, albeit with some difficulty.

Likewise the Germanic nations - Holland, Germany, Austria, Denmark etc. I've even heard tales of British soldiers being able to communicate with Germans using Platt-Deutsch!

So if we add the populations of Italy, Spain, Portugal and Romania together (have I missed any other Latin-speaking European nations?), that's probably not far off the population of the US. And then we throw South America into the mix ... (And how many USians speak Spanish first, English second?)

Cheers,
Wol

Trojan Source and Python

Posted Nov 19, 2021 13:05 UTC (Fri) by geert (subscriber, #98403) [Link]

You've missed the francophones.

Trojan Source and Python

Posted Nov 23, 2021 9:09 UTC (Tue) by anton (subscriber, #25547) [Link] (4 responses)

Dialects of German are often mutually incomprehensible. As an Austrian I certainly don't understand anyone speaking Platt (a German dialect) or Dutch (not a German dialect). I actually even have problems with some of the more extreme cases of Bavarian (my native dialect family) in Austria, never mind Alemannic (another dialect family spoken in Austria).

As for Denmark, Danish is a Northern Germanic language, so even further away than Platt or Dutch.

We do have Standard German as Lingua Franca in Germany and Austria and much of Switzerland, however, and it is the written form of German in any case, so wrt. comments there are no language problems between German speakers. But I typically do not understand Dutch comments.

Trojan Source and Python

Posted Nov 23, 2021 18:25 UTC (Tue) by mpr22 (subscriber, #60784) [Link] (3 responses)

Platt, Dutch, and Standard German are each from a different major branch of West Germanic, and have diverged by a sufficient amount, over a sufficiently long period, that calling the first two "dialects" (whether of German or of each other) is, by many reasonable standards for "dialect" vs. "language", unambiguously wrong :)

Trojan Source and Python

Posted Nov 24, 2021 15:30 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

Or we have the opposite, two different languages which have converged, namely Scots and English :-)

Cheers,
Wol

Trojan Source and Python

Posted Nov 24, 2021 16:45 UTC (Wed) by pizza (subscriber, #46) [Link] (1 responses)

What's funny is that Scots and English share a common ancestor (namely "Old English") but diverged considerably after the grand Norman adventure of 1066.

Trojan Source and Python

Posted Nov 25, 2021 12:48 UTC (Thu) by Wol (subscriber, #4433) [Link]

Even funnier is that Scots is actually Anglish, while English is Saxon ...

Cheers,
Wol

Trojan Source and Python

Posted Nov 19, 2021 13:04 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

Just taken a look at your link ...

English is in third place BEHIND SPANISH.

Then add Italian and Portuguese together, and just those two together run English fairly close ...

So for every native English speaker there are about TWO native Romance speakers ... (I'm only including those languages that are close and in part at least mutually comprehensible, ie I've left out French which is also classified Romance).

Then of course "Written Technical Romance" probably has a much higher mutual comprehensibility than the basic language ...

Cheers,
Wol

Trojan Source and Python

Posted Nov 19, 2021 17:11 UTC (Fri) by micka (subscriber, #38720) [Link] (1 responses)

Sorry, I know I was preemptively excluded as a french, so my opinion is probably rejected, but I don't really expect italian speaking persons and spanish persons to understand each other that easily.
What I know is that french and occitan are not really mutually intelligible for example. And I sure don't know what to do with written italian text, not even mentioning spoken italian.

Trojan Source and Python

Posted Nov 19, 2021 19:18 UTC (Fri) by rghetta (subscriber, #39444) [Link]

Well, an italian can often understand enough spanish (and vice versa) for basic needs, such as asking for directions, food and accommodation. Even for a bit of small talk, but nothing more. A technical discussion certainly not.

Trojan Source and Python

Posted Nov 19, 2021 18:03 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

To be fair, this list is probably more relevant to this discussion: https://en.wikipedia.org/wiki/List_of_languages_by_total_...

When you sort by L2 (non-native speakers), the top 5 are:

1. English
2. Standard Arabic (excl. dialects)
3. Hindi (excl. Urdu)
4. Mandarin Chinese (incl. Standard Chinese, excl. other Chinese)
5. French

There is room for argument over where lines should be drawn, and in particular the decision to exclude most dialects probably inflates the L2 counts for more fragmented languages to some extent (i.e. it counts people who maybe "should" be L1 as L2 instead). Standard Arabic is listed as having no L1 (native) speakers at all, because it's not a vernacular language, so perhaps it should not be on the list, and we can similarly argue over whether Chinese should be one language or several. However, this inflation is actually useful, as it serves to illustrate the context in which language learning primarily occurs.

My read of this: Other than English, people are mainly learning (second) languages and/or dialects which are useful within their part of the world, rather than (say) learning Latin or some other European language just because they can. Learning a language is hard work, and it's not something the average person is going to do unless there is substantial value in doing it. English happens to be valuable enough to hit that threshold, even on a global scale, but people who are not engaged in international trade and commerce have less reason to learn it.

I *do not* agree with the broader assertion that all programming "should" be in English. I am simply providing data.

Trojan Source and Python

Posted Dec 3, 2021 16:01 UTC (Fri) by phanser (guest, #60087) [Link] (1 responses)

these numbers are really strange: no more french speakers than french population, let me laught

Trojan Source and Python

Posted Dec 3, 2021 19:06 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

> no more french speakers than french population, let me laugh

It's counting the number of people with French as their *first* language, not everyone who speaks French. That figure would not include the entire population of France: Though French is the only official language in France, a 1999 survey only counted about 86% of the population in Metropolitan France where the French language was predominantly or exclusively spoken in their home before the age of five (i.e. as their "mother tongue"). That number might have increased somewhat in the past 20 years, but there would still be some fraction of those born there who learned another language before French, plus immigrants from non-French-speaking countries.

Trojan Source and Python

Posted Nov 19, 2021 12:49 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

Oh - and if we look just at Europe and America, I would guess the number of people who speak Spanish/Portuguese/Italian natively outnumbers the number of people who speak English - by a LONG way.

There's probably a large number of native Spanish/Portuguese/Italian speakers who are computer programmers and know enough English to be able to puzzle out documentation, code comments, and programmer support web sites. That number is likely to be considerably larger than the number of native Spanish/Portuguese/Italian speakers who are computer programmers and are comfortable writing accurate and grammatical English prose for documentation, code comments, and programmer support web sites.

Requiring people everywhere in the world to be fluent authors of English prose (never mind fluent speakers of English) as a prerequisite to doing computer programming is a silly notion. There are sure to be people with very good programming skills whose English is so-so, and who's going to tell them they can't be programmers?

Trojan Source and Python

Posted Nov 19, 2021 14:13 UTC (Fri) by farnz (subscriber, #17727) [Link]

On top of that, machine translation isn't completely dire these days. It's not great, and I wouldn't want to stake my life on it, but it's good enough that with school level Classical Greek, Latin, and French (not fluent in anything other than English), and a basic grasp on the Cyrllic alphabet, I can use machine translation for all the EU languages (including Bulgarian) and not only get the sense of what's written but also be able to spot where the machine translation is leading me astray.

More incredibly, to me, I can use two separate machine translations (Bing and Google, as examples) for Taiwanese, Korean or Japanese, and end up either understanding the sense of what's written, or being aware that I'm getting a bad translation - and I've had that confirmed by native speakers of those three languages. In all three cases, I'm aware that I'm not getting a good translation, but it's good enough that I can follow along with (say) a Twitter thread, or a news article. I believe this also works for other languages, but those are the three I've had native speakers check for me.

Assuming that machine translation to other languages is as good as machine translation to English, I can well believe that there's entire communities out there of developers who can't read, write or speak English at all, but who can use machine translation to have a one-way sharing of code from the English speaking world into their community. And, of course, by the nature of such communities, it'd be surprising if I ever heard about them - because the high paid tech world is currently English language dominated and open to quite a lot of migration, if your English gets good enough to tell me about such a community, you're probably leaving it behind anyway.

About the only place I can think of that could support such a community in a form where people don't abandon it as soon as their English is good enough to leave is China, but without some ability to read Chinese myself, I'm not in a position to see whether there is a top-flight Chinese language development community, or if the best Chinese developers learn enough English to emigrate to a high paid tech job outside China.

Trojan Source and Python

Posted Nov 19, 2021 18:30 UTC (Fri) by Nahor (subscriber, #51583) [Link] (7 responses)

"Why"?

Because most native-language speakers are NOT software developers, so saying things like "the number of people who speak Spanish/Portuguese/Italian natively outnumbers the number of people who speak English" is meaningless. It doesn't matter if a farmer, or an insurer, or a taxi driver, ... understand English or not (well, the taxi driver should but for other reasons). Using that as a reason to support native languages in code, only serves at confusing the issue.

Because it doesn't matter if a language is "native" or not, what matters is if a language is understood by coders or not. Most coders are not native-English speakers, I agree with that, but the vast majority do understand English just fine.



And lets not forget that the initial issue is about Unicode characters. All European languages (as far as I know at least) can be written just fine using basic ASCII characters. While some of those languages have characters that don't have an ASCII version, like accentuated ones, they all have an ASCII fallback, e.g. German "ß" is "ss", or French "é" is "e". And even some of non-European languages also have Romanized versions, e.g. Japanese.

I still claim that English should be the way to write code, but if you're dead set on supporting native languages, then European languages would still be supported to some degree by ASCII only programming languages.

Trojan Source and Python

Posted Nov 22, 2021 19:08 UTC (Mon) by khim (subscriber, #9252) [Link] (6 responses)

> Most coders are not native-English speakers, I agree with that, but the vast majority do understand English just fine.

Depends on country. Most coders in Egypt or India would know English very well. Because software is usually teached in English there thus without knowledge of English you wouldn't be able to become a coder.

Situation is entirely opposition in Russia: not only most programmers don't know English, the most popular Russian book-keeping program doesn't have English interface, doesn't have English documentation and it's scripting language doesn't even have English keywords! Then why would someone who writes code for it know English?

> All European languages (as far as I know at least) can be written just fine using basic ASCII characters.

Have you, somehow, moved across universes? In my universe Greece and Greek alphabet were in Europe before Latin (which forms basis for ASCII) even existed.

Trojan Source and Python

Posted Nov 22, 2021 20:15 UTC (Mon) by smurf (subscriber, #17840) [Link]

Plus most European languages need non-ASCII characters. Umlauts, accents, dipthongs, inverted exclamation/question marks, interesting capitalization rules (Turkish: their uppercase i is not I but İ, and of course there's an ı-without-a-dot to go with the I), interesting sorting rules …

Trojan Source and Python

Posted Nov 24, 2021 15:27 UTC (Wed) by Wol (subscriber, #4433) [Link]

What about Cyrillic? Which can be ascii'ised, but at what cost?

Pretty much ALL the slavic languages, ime, have what was described to me in my russian classes as "false friends" - those letters for which the syntax - the graphical symbol - exists in both the local and American alphabet, but for which the semantics - the letter itself - are completely different.

The cyrillic H is a roman N.
The cyrillic C is a roman S.
The Polish L-bar is I believe a roman W.

Etc etc. It really pisses me off when I see fake Russian written in the roman alphabet mis-using cyrillic symbols ...

You're really asking for trouble trying to transliterate that lot.

Cheers,
Wol

Trojan Source and Python

Posted Nov 27, 2021 21:04 UTC (Sat) by Nahor (subscriber, #51583) [Link] (3 responses)

> Situation is entirely opposition in Russia: not only most programmers don't know English, the most popular Russian book-keeping program doesn't have English interface, doesn't have English documentation and it's scripting language doesn't even have English keywords!

I've never argued for English to replace native language for everything. We are talking about source code and comments in them. So a UI language is irrelevant. That doesn't affect the language used to write comments, and variable names.

> Then why would someone who writes code for it know English?

The whole article is about using Unicode to mix different languages. If your scripting language is in Russian, there is indeed no point in using English. People using that script will need to know Russian, so every comments and variable can be in Russian.
The issue at hand is when people are mixing spoken languages in their code, when the source code accepts RTL and LTR characters, or accept homoglyphs. If you use a Russian scripting language, the IDE/compiler should require only Russian characters.

>> All European languages (as far as I know at least) can be written just fine using basic ASCII characters.
> Have you, somehow, moved across universes?

Fine, read it as "Most European languages" if that makes you feel better. That doesn't change the crux of the argument.

Trojan Source and Python

Posted Nov 28, 2021 9:26 UTC (Sun) by mpg (subscriber, #70797) [Link] (1 responses)

> If you use a Russian scripting language, the IDE/compiler should require only Russian characters.

I'm afraid that's unrealistic. That would prevent, for example, the inclusion of URLs in comments, because https is written with Latin characters. Same if you want to read or write files with some usual extension such as .csv, which also requires Latin characters. So, Latin characters should be allowed at least in comments and string literals.

I think it's important for those of us whose native language is written using a Latin script to keep in mind that we're the only ones who can (and will, unless we take explicit steps like learning a non-Latin language) live happily in a mono-script environment. Speaker of languages using other writing systems just don't have that option if they want to use a computer / the internet, as you can't do that without Latin characters. (Same goes with directionality: speakers of LTR languages are the only ones who can live happily in a mono-directional environment, speakers of RTL languages just don't have that luxury.)

There's a fundamental asymmetry here, which like most others, is all too easy to ignore when we're on the privileged side of it.

Trojan Source and Python

Posted Dec 2, 2021 0:56 UTC (Thu) by khim (subscriber, #9252) [Link]

> Speaker of languages using other writing systems just don't have that option if they want to use a computer / the internet, as you can't do that without Latin characters.

Even before internet. BESM-6 predates it, it doesn't support lowercase yet supported two scripts. And no, back then, back in 1968 they had no need to think about dumb users. Only very knowledgeable personnel worked with computers back then.

Trojan Source and Python

Posted Dec 2, 2021 0:42 UTC (Thu) by khim (subscriber, #9252) [Link]

> That doesn't affect the language used to write comments, and variable names.

Seriously? Last time I've checked variable names taken from Google Translate are much worse than native ones.

At least native ones can be understood by native speaker. The ones from Google Translate couldn't be understood by no one. You can name them i1, i2, i3 and it wouldn't make code any less comprehensible.

Comments fare a bit better, but ultimately don't clarify anything.

I think you haven't read what I wrote correctly. Developers don't know English. Not users. Developers.

> If you use a Russian scripting language, the IDE/compiler should require only Russian characters.

That wouldn't work either. They don't understand English but they do understand math. And math uses latin script usually.

For them programs with all these are fn and for are not English, but math. And while would be read like this.

> That doesn't change the crux of the argument.

It does if you recall that Europe is minority of Earth population. And not even that important in the IT industry.

Trojan Source and Python

Posted Nov 18, 2021 12:58 UTC (Thu) by jschrod (subscriber, #1646) [Link]

For many people, being able to read English documentation (passive vocabulary) is quite different to being able to write understandable English comments or even understandable English documentation (active vocabulary).

Since you equate these two, you don't understand the problem, IMNSHO.

Trojan Source and Python

Posted Nov 17, 2021 11:14 UTC (Wed) by dtlin (subscriber, #36537) [Link]

AppleScript is another programming language that had different "dialects", with different keywords to match different human languages, and the Japanese dialect even changed the grammar to better match Japanese. According to https://www.cs.utexas.edu/~wcook/Drafts/2006/ashopl.pdf, these would compile to the same program:

English: the first letter of every word whose style is bold
Japanese: スタイル=ボールドであるすべての単語の最初の文字 (roughly in order, style=bold be all of word of first of letter)

OpenOffice is a well-known large project that was commented in a non-English language (German). LibreOffice has largely translated them to English, but as far as I know that hasn't been done in AOO. Does it make sense in other cases, it probably depends… but as Python is often used as an educational language I think comments matching the local language make sense; students only need to learn 50-ish English keywords instead of English as a whole to get started.

At $WORK project, most non-ASCII characters are indeed in a few specific files holding translated strings. But there's quite a few exceptions: we make use of an external library that takes $ to get around the fact that $ is a special character with sometimes tricky escaping, format("• %s", thing) happens in quite a few places and is more readable than format("\u2022 %s", thing), and the BiDi control characters are directly used in a utility that simulates an RTL locale using LTR languages (for ease of testing RTL locales by people who only know LTR languages). The constants are in escaped form, but the unit test uses Unicode RLO/RLM/PDF directly, so that it actually *looks* as you'd expect when reviewing it.

Trojan Source and Python

Posted Nov 17, 2021 16:06 UTC (Wed) by khim (subscriber, #9252) [Link]

> Does it make sense to use non-English words for identifiers and comments?

Sometimes. I know one example where it's the right thing to do: book-keeping software. Rules which you have to follow are written in some local language, they are changing often and it's, essentially, hopeless to even try to translate these without insane amount of coordination (if you only know English then here is one example: Expanded memory vs Extended memory… these are two very different things — but words “expanded” and “extended” have the exact same list of possible translations to Russian… when you try to translate foreign law into English you often have the same problem).

And it's usually cheaper to find local guy who can understand law as it's written (and can talk to local lawyers who often don't knwo English) instead of trying to translate everything to English.

Trojan Source and Python

Posted Nov 20, 2021 11:03 UTC (Sat) by kreijack (guest, #43513) [Link]

> Microsoft Word used to have (or maybe still has, it's quite some time I don't use it)
> a macro language that was translated in the international editions. In Italian versions, if became se.
> Italian macros wouldn't run on English versions of Word, and vice-versa.
> Some companies reacted by requiring all branches install only original English versions of any Microsoft software.

:) Yes I remember how was hard writing the macro in Italian language....
I am happy to inform you that now the VBA macro is in English

> Does it make sense to use non-English words for identifiers and comments?

I would treat differently identifiers and comments:
Regarding the identifiers, my opinion is that it should be mandatory to write the identifiers in simple ascii (_a-zA-Z0-9). Preferably in English [*]. To be clear 'caffè' is (IMHO) unacceptable as identifier (however caffe, without accent makes my eyes bleed; so please never use caffe in an identifier)

For the comment, the compiler/language should allow any Unicode character. Instead the editor should warn if what is showed could be interpreted differently by the compiler (highlighting the omoglyphs and the bidirectional text outside the comment).

However I understood company that ask that the internal documentation (and so even the program source) must be written in English.

[*] When I use the term English, I am referring to a "very basic" technical English as the most common language. Let to me to say otherwise: using a very high level English for the comment, is a problem too.

Some programmers find it useful at times. For example, they say they can easily spot what parts of code where customized locally. I stopped doing so after I found Hebrew comments that I couldn't understand. In a globalized word, using non-English words hampers a program maintainability.

Non-English has to be used in strings, of course. However, such usage is normally relegated to a few special files. In that case, routinely running uniname -a on all imported software files looks like a practical precaution. Thanks to Jake for recalling that utility.

Trojan Source and Python

Posted Nov 17, 2021 11:17 UTC (Wed) by intgr (subscriber, #39733) [Link] (9 responses)

I think Python's behavior in many cases is irresponsibly sloppy.

> For example, NUL (0x0) is treated by CPython as an end-of-line character

There should be no legitimate reason why a NUL character should appear in a text file, it should be disallowed. But treating it as a newline is a whole new level of silly.

> >>> int('৪୨')
> 42

Is there *really* anybody who finds this behavior of accepting non-Arabic numerals as digits useful? Do people in other countries interact with information systems in non-Arabic numerals? Shouldn't str(intvalue) then also produce localized numbers? I suspect this partial behavior is extremely rarely useful, but very frequently dangerous to everyone.

This also affects regular expressions, the \d escape also allows non-Arabic numerals. So validation regular expressions that are sound in other languages will accept nonsense in Python.

I'm sure this creates lots of security vulnerabilities. You don't want Tibetian numerals in IP addresses, Tamil numerals in social security numbers, etc. This can mess up uniqueness checks in databases, break integrations with other systems, etc.

Trojan Source and Python

Posted Nov 17, 2021 16:23 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

First: I'm 99% sure you have mixed Latin numerals (0123456789) and Arabic numerals (٠١٢٣٤٥٦٧٨٩). I would assume you meant Latin numerals in your text.

> Do people in other countries interact with information systems in non-Arabic numerals?

Yes. Well… I can only say about actual Arabic numerals since I have just returned from Egypt. Yes, you would see ١٢٣.٥ in your bill, not 123.5.

Sadly Uber would say that car which is supposed to pick you up has 1234 number when it would, actually, have ١٢٣٤ on it's license plate. Took me few days to adjust.

But yeah, it's usually all-or-nothing. You never want to accept mix of many numerical systems. Large shops in Egypt may actually switch their system to print 123.5 in your bill, but they would never print 1٢3.٥

Trojan Source and Python

Posted Nov 17, 2021 17:30 UTC (Wed) by intgr (subscriber, #39733) [Link]

> First: I'm 99% sure you have mixed Latin numerals (0123456789) and Arabic numerals (٠١٢٣٤٥٦٧٨٩).

I'm 99% sure you're confusing numerals and alphabets. The Latin alphabet/script is the system of letters used by most western languages, but there are no "Latin numeral" digits.

0123456789 are commonly called Arabic numerals or Western Arabic numerals: https://en.wikipedia.org/wiki/Arabic_numerals

٠١٢٣٤٥٦٧٨٩ are Eastern Arabic numerals: https://en.wikipedia.org/wiki/Eastern_Arabic_numerals

Trojan Source and Python

Posted Nov 17, 2021 16:24 UTC (Wed) by dtlin (subscriber, #36537) [Link] (6 responses)

Python is not alone, Java's Integer.parseInt("৪୨") == 42 as well. Raku is at least consistent, with both +"৪୨" == 42 and ৪୨ == 42.

But in general String → Int cannot be assumed to be an invertible operation. Many languages will convert at least one of "042", "052", "0o52", or "0x2a" to 42. IP uniqueness by string has never been the case, you will find that `ping 0177.1` works exactly the same as `ping 127.0.0.1`.

Trojan Source and Python

Posted Nov 17, 2021 17:20 UTC (Wed) by intgr (subscriber, #39733) [Link] (2 responses)

> But in general String → Int cannot be assumed to be an invertible operation.

Sure. My point is not about invertibility, but that if you want your app to be localized with non-Arabic digits, the built-in library is inadequate anyway. I think it's inconsistent and not actually useful to allow non-Arabic digits as input, but offer no way to convert integers into strings in non-Arabic systems.

> Java's Integer.parseInt("৪୨") == 42 as well.

At least java.util.regex.Pattern defines \d to be equivalent to [0-9] so it's not infected with the same regex issue.

> IP uniqueness by string has never been the case

There are lots of validation regexes floating around that allow only IPv4 addresses in normalized forms, like https://www.oreilly.com/library/view/regular-expressions-... Expecting uniqueness with these makes sense.

The non-normal forms of IPv4 are used so rarely, it makes more sense to treat them as input mistakes.

Trojan Source and Python

Posted Nov 17, 2021 17:36 UTC (Wed) by dtlin (subscriber, #36537) [Link] (1 responses)

> but offer no way to convert integers into strings in non-Arabic systems.

They do, though. Java: NumberFormat.getInstance(locale).format(integer) or String.format(locale, "%d", integer)

> java.util.regex.Pattern defines \d to be equivalent to [0-9]

Not if UNICODE_CHARACTER_CLASS is given.

> There are lots of validation regexes floating around that allow only IPv4 addresses in normalized forms

And doesn't use \d so it isn't relevant.

Trojan Source and Python

Posted Nov 18, 2021 15:14 UTC (Thu) by edgewood (subscriber, #1123) [Link]

> And doesn't use \d so it isn't relevant.

I found one example using \d on the first page of a DDG search, then stopped looking. Raymond Chen of Microsoft no less.

I was actually surprised that there wasn't more examples on the first page, but one's enough.

Trojan Source and Python

Posted Nov 19, 2021 23:16 UTC (Fri) by guus (subscriber, #41608) [Link] (1 responses)

Even GCC allows this, albeit with a warning, see: https://godbolt.org/z/cboahT7rh. Clang returns an error though.

Trojan Source and Python

Posted Nov 22, 2021 12:27 UTC (Mon) by smurf (subscriber, #17840) [Link]

Umm, no. Completely different meaning of "int" here.

Also 'x', with x not being one single character / codepoint / what-have-you, is not legal C.

Trojan Source and Python

Posted Nov 20, 2021 20:34 UTC (Sat) by flussence (guest, #85566) [Link]

Since you mention Raku here, it's worth adding that it *does* have data types for storing numbers with a preserved string representation (IIRC it was originally needed for round tripping command line parameters correctly):

given «0o52» {
   dd $_;  # IntStr.new(42, "0o52")
   dd +$_; # 42
   dd ~$_; # "0o52"
}

Things like IntStr.new(0x7F_00_00_01, '127.0.0.1') also work as you'd expect, though I'm not aware of any user-space code actively using it.

Trojan Source and Python

Posted Nov 18, 2021 4:17 UTC (Thu) by flussence (guest, #85566) [Link] (5 responses)

I'm curious what use there could be for allowing the ASCII form-feed in code, in this day and age…

The only time I've seen that appear recently is in raw archival copies of RFCs, and it didn't exactly work as intended there in, well, anything.

Trojan Source and Python

Posted Nov 18, 2021 9:40 UTC (Thu) by anselm (subscriber, #2796) [Link] (3 responses)

I'm curious what use there could be for allowing the ASCII form-feed in code, in this day and age…

It's a form of whitespace, and a fairly harmless one at that. If you want to crack down on whitespace in code that is actually a nuisance, start with the ASCII TAB.

Trojan Source and Python

Posted Nov 18, 2021 10:36 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Bah. I'd rather have source code formatted with all-tabs (for start-of-line code indents) instead of the current craze to enforce all-spaces-only. That'd at least allow me to change it via my editor setting when e.g. a nontrivial function demands less visual clutter.

In any case, Python3 sensibly forbids mixed tabs-and-spaces indents, so that's one problem it doesn't have. As opposed to C (or Perl or …), where confusing indents can cause (or hide) bugs easily, no Unicode LTR overrides required.

if (A)
if (B)
c()
else
d()

is just the least convoluted example of many along these lines. GCC warns about this one AFAIR, but it's fairly easy to construct cases where it doesn't.

Trojan Source and Python

Posted Nov 18, 2021 15:29 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> Bah. I'd rather have source code formatted with all-tabs (for start-of-line code indents) instead of the current craze to enforce all-spaces-only. That'd at least allow me to change it via my editor setting when e.g. a nontrivial function demands less visual clutter.

Unfortunately, patches break this because the tab "shrinks" for indented lines, but lines without leading tabs get shifted over by one space breaking visual alignment between lines with and without tabs at the front harder to review. I also dislike the use of tabs for "any 8 spaces" rather than "tab to indent, spaces to align" since that removes the only benefit of tabs: being able to change the visual indentation size (which seems to be how most tab-using projects actually work AFAIK).

For example, take this line[1]. The last change for this line does not render the alignment properly in the diff viewer (either terminal or web): <https://github.com/git/git/commit/ad464a4e84b502fdfd4671f...>. The line after this should have no tabs, but instead just be spaces because it has zero indentation, but needs aligned with the `(` on the previous line. Unfortunately, I know of no editor that actually helps out with this instead of blindly treating "tab size spaces in a row? use a tab".

[1] https://github.com/git/git/blob/cd3e606211bb1cf8bc57f7d76...

Trojan Source and Python

Posted Nov 20, 2021 0:32 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

If you use tabs, you have to have consensus on how many spaces one tab is worth (so that you can wrap at 80, or 100, or 120, or whatever made up number you wrap at). If you use spaces, you have to have consensus on how many spaces go into an indent. IMHO, those two problems are functionally equivalent, and frankly this is an incredibly silly thing to argue over in the first place, so my preference is "whatever the prevailing style guide says." PEP 8 is a "spaces not tabs" style, so I use spaces in my Python code.

Trojan Source and Python

Posted Nov 18, 2021 17:12 UTC (Thu) by jem (subscriber, #24231) [Link]

Emacs recognizes Ctrl-L as a page delimiter, and provides commands to move over pages: forward-page (C-]) and backward-page (C-[).

Trojan Source and Python

Posted Nov 18, 2021 16:26 UTC (Thu) by pixnion (subscriber, #83415) [Link]

I've found it very useful to highlight any non-ordinary ascii bytes in vim, since basic ascii characters are the only ones I expect in source code. In other words, UTF-8 etc. will be highlighted as well. Obviously not a very international way of doing things, but it works for me.

This vimrc line does the trick: match Error "[^\x09\x20-\x7E]"

It happens that I sometimes accidentally end up adding some weird control character due to some random keystroke, and the highlighting helps me detect it quickly.

Trojan Source and Python

Posted Nov 19, 2021 23:20 UTC (Fri) by guus (subscriber, #41608) [Link]

The people standardizing Unicode have put some thought into what subset of Unicode characters is suitable for programming languages, see for example https://www.unicode.org/reports/tr31/.

Erk. Unicode normalisation.

Posted Nov 28, 2021 22:21 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Haven’t we already seen in PHP that applying Unicode transformations to program source code is a Bad Idea?


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds