Debian opens a can of username worms

By Joe Brockmeier
December 5, 2024

It has long been said that naming things is one of the hard things to do in computer science. That may be so, but it pales in comparison to the challenge of handling usernames properly in applications. This is especially true when multiple applications are involved, and they are all supposed to agree on what characters are, and are not, allowed. The Debian project is facing that problem right now, as two user-creation utilities disagreed about which names are allowable. A plan is in place to sort this out before the release of Debian 13 ("trixie") sometime next year.

The useradd utility is part of the shadow-utils project, which includes programs for managing user and group accounts. The shadow-utils suite is included in Debian's passwd package. For historical reasons, and to avoid confusion with the upstream project, Debian's version of the shadow-utils sources are often referred to as "src:shadow".

Most Debian users don't work with useradd, or groupadd, directly. Instead, Debian has long supplied its own adduser (and addgroup) utilities, originally written by founder Ian Murdock. These act as simpler front ends to useradd and use Debian-supplied system defaults for creating users' home directories and configurations. It should be noted that useradd, et al., have become much more full-featured since Debian's utilities were introduced, but the project continues to maintain them nonetheless.

Little Bobby Tables

In June, Debian developer and src:shadow maintainer Chris Hofstaedtler filed a bug against the adduser package. The src:shadow package had dropped a Debian-specific patch, originally introduced in 2003 by Karl Ramm, to allow characters far beyond what were allowed by the upstream shadow-utils project. In the patch, Ramm wrote:

I can't come up with a good justification as to why characters other than ':'s and '\0's should be disallowed in group and usernames (other than '-' as the leading character). Thus, the maintenance tools don't anymore.

Hofstaedtler said that he had puzzled out some of the patch's purpose from old bug reports that had been "fixed" by the patch, and those asked for two things not allowed by the upstream shadow-utils: usernames with upper-case characters or that are purely numeric. Hofstaedtler said that upper-case names had been allowed in the upstream shadow-utils project "a long time ago", but it seemed like a bad idea to allow purely numeric usernames.

The patch enabled much more than upper-case and purely numeric names, though. With the patch dropped in version 1:4.15.2-2 of the shadow source package, one of adduser's tests—which explicitly allowed a username reminiscent of a famous xkcd comic ("bob;>/hacked")—had failed:

For src:shadow, I would really like to not have a divergence from upstream in this regard. I think if we have clear requirements then we (I) can submit them upstream and I would expect upstream to accept patches.

I do feel that making the case for "bob;>/hacked" would be very hard.

Hofstaedtler said that the patch had been reapplied for the time being, it was included again in version 1:4.15.2-3, but he asked if username requirements could be sorted out in time for the Debian "trixie" release. If the patch were dropped entirely, then useradd would restrict usernames to the POSIX standard, with the exception of allowing a "$" character at the end of a username

Debian developer and adduser maintainer Marc Haber replied in late October that other tests were failing as well, and thought that "useradd upstream is being too picky here". Since adduser depends on useradd it could not create users that useradd would reject, he said he would like to synchronize on what would be allowed or not.

As part of the research into what should be allowed in usernames, Haber took over Debian's UserAccounts wiki page, which outlines Debian's username tools and policies, and started looking into whether the project should relax its requirements around usernames.

Limits on usernames

One of the questions that bubbles up when looking at usernames is not just allowable characters, but the allowable length of the username. The documentation for shadow-utils does not specify a length for usernames or what encoding is being used.

However, in order to be portable between systems, the POSIX standard says that usernames should not include non-ASCII characters. The standard says that usernames should be "composed of characters from the portable filename character set". That set is comprised of numbers 0 through 9, upper-case and lower-case "a" through "z", the period (.), underscore (_), and hyphen (-). It also specifies that usernames should not begin with a hyphen.

It is, however, possible to assign characters outside that set with the tools at hand. But Linux distributions usually put up some guardrails in the adduser and useradd configurations to prevent administrators from creating usernames with non-ASCII characters unintentionally. These configurations can be overridden with adduser's --allow-bad-names option or useradd's --badname option.

In November, Haber posted a message on debian-devel that he had "opened an especially nasty can of worms" and was finding that things were more complicated than he had understood. He sought input and opinions on a number of questions about whether Debian should allow non-ASCII characters for usernames, how to do that if so, and if it was more appropriate to document username guidance in Debian's Policy Manual rather than its wiki. His suggestion was to allow UTF-8 for regular user accounts, but to restrict to ASCII for system accounts created by Debian packages.

Richard Lewis asked if enabling UTF-8 would open the door to "some of the abuse described" in a 2021 LWN article about flaws in Unicode handling that led to security exploits. He said that it seemed to be a bad idea to make the change, even if it would be nicer for users to have the option.

Haber said that he was not sure if it would be dangerous to allow UTF-8 usernames, "since we can expect other commands to gracefully handle a byte stream, can't we?" Additionally, local administrators already can loosen restrictions to allow UTF-8 usernames, but Debian does not test for such use cases. Debian would become "more robust" if it assumed UTF-8 characters would be used in usernames. "Vulnerabilities that could be exploited by having non-ascii user names are already here and present today, just not uncovered yet."

It would be reasonable, Timo Röhling said, to mitigate possible homograph attacks by disallowing mixed alphabets "such as cyrillic and latin letters in the same name". Haber said that was not going to help if a user could directly write to /etc/passwd, and he was unwilling to implement that himself in adduser. He would accept code and test cases written by others, though.

Keyboards

Security concerns aside, there are other practical problems with supporting non-ASCII usernames. Étienne Mollier noted that he had "one weird enough" character in his first name that posed a problem if he had to log in using a keyboard layout that lacked the capability to transcribe the lower-case or upper-case 'e' acute characters ("é" or "É"). For that reason, he said, he felt better about keeping a full ASCII username and "wouldn't feel strongly if unicode support for login never happens". But it would be good if the gecos field of the passwd file had proper Unicode support to properly display users' real names.

Not only was it difficult to type "é" on some keyboards, it could also be encoded in multiple ways. Gioele Barabucci pointed out that it could be "e with acute" which is encoded in UTF as U+00E9, or it could be "e, combined with an [acute] accent" which would be U+0065 plus U+0301:

If a keyboard input system provides the former sequence of bytes, but the username is stored in the login infrastructure using the latter sequence of [bytes], then a naive comparison will not find the user "émollier" in the system. Unicode defines in Annex 15 a few normalization forms as a way to work around this problem. But a correct use of these normalization forms still requires coordination and standardization among all programs accessing the data.

He asked if POSIX or other standards provided a normalization form for UTF-8 encoded usernames. Peter Pentchev responded that POSIX said to stick to the portable filename character set to ensure portability. Haber argued that it should be up to local admins to decide whether they wanted their local user database to be portable. "I don't think that we should restrict local admins who don't need that kind of portability."

Simon McVittie recommended that Debian consider adopting systemd's user name syntax and concepts of "strict mode" and "relaxed mode". The systemd tooling adheres to a strict naming convention when creating usernames, but it has a relaxed convention for accepting usernames created by other tools. McVittie said that seemed like a good principle for Debian to follow, even if its specific rules might differ from systemd's.

Haber seemed to agree in part, but said systemd's strict mode was "even stricter than what we currently allow for system accounts", and he did not like that systemd's policies (especially with systemd-homed, which LWN covered recently) were not configurable.

This time it's personal

The discussion, perhaps not surprisingly, brought out some strong feelings about how names and usernames were represented. Especially when, as Hofstaedtler noted, usernames can be important to some users:

I see and type my username hundreds times a day, people use it to address me in written and spoken conversations with it, etc.

If it were my uid, which I see maybe once a week and don't have to remember, I wouldn't care.

Indeed, it's not uncommon in open-source communities or within organizations to use a person's username rather than their given name—so it is unsurprising that some people feel strongly that usernames should be composed of a wider range of characters than POSIX recommends. Others dislike the practice of conflating usernames with real-world names, and see little reason to go to any trouble to go beyond ASCII.

Johannes Schauer Marin Rodrigues supported allowing more than ASCII in usernames. He said it would be good for Debian to put pressure on other projects to provide Unicode support. "We cannot find these kind of bugs if we accept translating everybody's given name to the American alphabet." Bálint Réczey, though, asked that Debian avoid opening that can of worms and imposing needless work on upstreams. "Keep what works reasonably well for decades."

A plan

Haber initially seemed bullish on allowing UTF-8 usernames in Debian "as a courtesy to those people who need non-ascii user names to write their name" and as an opportunity to find "bugs that are already here" in Debian's software. He acknowledged that it is late in the development cycle for trixie. But, since it was currently possible to create usernames with UTF-8 characters, he did not want to tighten restrictions in trixie versus Debian 12, only to revisit those restrictions for Debian 14. In a reply to Mollier he wondered about what advice to give in Debian's documentation "once we have decided to officially allow UTF-8 login names".

On December 3, however, Haber said that he "finally understood" that UTF-8 support would require more than the ability to create an UTF-8 encoded username and write it to /etc/passwd. Homograph characters, such as U+00E9 (é) and U+0065 plus U+0301 (é), could be used with adduser to create two separate users with lookalike usernames:

At the least, adduser should reject creating étienne if étienne already exists - those are different user names but look the same, and if you don't cut-and-paste user names instead of typing them you're bound to hit the wrong user depending on HOW you type and what input medium you use. Not good.

Haber said that he was the only active developer working on adduser and did not have time to implement a check against lookalike usernames in time for the trixie release. Worse, he said, the Perl module that he would use (Unicode::Precis) was not packaged for Debian and had not had a release in more than five years.

The next version of adduser, Haber said, would reject UTF-8 usernames by default. They would still be allowed when using the --allow-bad-names option, but he said he wanted to deprecate that option name in favor of something that doesn't use the word "bad". The --allow-all-names option will continue to pass everything verbatim to useradd.

Mollier thanked Haber for his work on the problem, and suggested some alternatives to the bad names option. Barabucci also thanked Haber for taking the time to research the issue, to which Haber replied dryly, "I have learned many things."

Haber's current course of action for adduser seems the most prudent. There may be a day when it is more practical to expand the allowed characters for usernames, but the work required to do so right now is far greater than the benefits that users would gain in the process.

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 5, 2024 17:22 UTC (Thu) by isotopp (subscriber, #99763) [Link] (6 responses)

Login Names in Unix have always been very restricted, they are not just part of 'ls' output, but also are transported in some tar formats as owners, are automatically parts of mail adresses and have other, not cataloged use-cases.

If you allow utf-8 here, and relax length restrictions, it is unclear and unknowable what will happen downstream with other applications.

If you want to login as 'Kristian Köhntopp', it is probably useful to have an LDAP like name canonicalization mechanism that does a lookup to get a unix username and then tries the password with that. Anything else is very likely to break unexpected things.

In my personal opinion, even a --badnames option is wrong.

Or you go, and actually perform the work to define a username format for Unix (not just Linux), catalog use-cases and make sure that they actually work with full UTF-8, and whatever relaxed length limit you define. And then be prepared to handle a login with クリス (kurisu) instead of kris.

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 6, 2024 9:33 UTC (Fri) by kleptog (subscriber, #1183) [Link] (2 responses)

I've gotten used to using --allow-bad-names all the time at my work because the default prohibits periods (".") and our company uses them in usernames as firstname.lastname.

Though I've now checked the docs and apparently it's possible to change the allowed characters in the configuration file so maybe that's a better approach for ansible deployed machines.

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 6, 2024 19:28 UTC (Fri) by raven667 (subscriber, #5198) [Link]

A place I used to work was purchased by a company which had standardized on "Firstname Lastname" in their AD for username/samAccountName so when we started joining the Unix (Mac/Linux) machines to their AD with winbind I was surprised with how few things actually broke, "/home/First Last/Documents" was OK in most GUI tools, but there were definitely things that broke. Eventually they did re-standardize on usernames without spaces, but I was surprised it worked at all.

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 12, 2024 12:29 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

On old systems this was a valid command

$ chown user.group somefile

OK, somewhere along the Red Hat line, chown started being more picky and insisting on user:group, but even so, might there be legacy boxes out there sharing usernames via some centralized system?

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 6, 2024 10:21 UTC (Fri) by smurf (subscriber, #17840) [Link]

Heh. I had to troubleshoot a system with a UTF8ified username last year. Let me tell you, the number of programs that format columns by counting bytes is, umm, truly staggering.

Anyway. A little bit of safety should be in everybody's interest, i.e. no mixed-charset names, and use some normal form to check for existing usernames.

Writing the above sentences is significantly easier than implementing them. While cyrillic vs. greek definitely is a problem, but latin vs. CJK? not so much IMHO. Normalize to exactly which normal form using which version of the Unicode standard? What do I do on the console, type \U4E52\U4E53 instead of 乒乓? what if my username is "🧪420"?

On the other hand … I never type my username anyway. When logging in on the GUI I click on my avatar, when connecting to a remote system with SSH or whatever it's the default, and a fresh text-only console login is easy because there the username is "root". 😎

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 6, 2024 19:14 UTC (Fri) by rgb (guest, #57129) [Link] (1 responses)

> Not only was it difficult to type "é" on some keyboards, it could also be encoded in multiple ways.
I think that says it all. Unicode is made to display text, not to create IDs.

Anything but POSIX portable filename set with a conservative length restriction is dangerous

Posted Dec 6, 2024 19:35 UTC (Fri) by raven667 (subscriber, #5198) [Link]

I think you are right here and restricting the username to a limited subset of bytes that existing tools don't have any trouble interpreting and displaying makes sense, but the GECOS field definitely should be extended to support full UTF-8 encoded names as a courtesy and to be friendly to actual humans and their real written names they want to use. Having machine-readable username/uid/gid(s) distinct from a human display name makes sense and changes the requirements quite a bit where having two different encodings for "é" or "🧪420" isn't really a problem that needs to be solved.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 5, 2024 18:52 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (14 responses)

Maybe this is my Anglophone chauvinism speaking, but you can already set an arbitrary human-readable display name in the GECOS field, and most login GUIs prefer to display that name over the username when both are available. Is it really critical to allow non-ASCII characters in the username itself? How many people are trying to log into a command line environment *and* cannot type in ASCII?

Doesn't the GECOS field already cover some of this use case?

Posted Dec 5, 2024 20:57 UTC (Thu) by zeha (subscriber, #61580) [Link] (13 responses)

> Maybe this is my Anglophone chauvinism speaking

Yes.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 5, 2024 22:00 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

As someone speaking several languages with non-Latin alphabets, sometimes it makes sense to stick to ASCII. Otherwise, you're just setting yourself for a world of pain. Imagine entering Chinese text on a terminal in text mode.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 9, 2024 9:28 UTC (Mon) by taladar (subscriber, #68407) [Link] (2 responses)

Or imagine dealing with users with a mix of Chinese, Thai, Japanese, Cyrillic,... usernames on the same system in your audit logs.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 9, 2024 17:53 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

That's not the worst. Unbalanced right-to-left switches are.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 10, 2024 6:39 UTC (Tue) by pvaneynd (subscriber, #898) [Link]

It's even worse then that. You need to know which language a text is in to know which fonts to use to display it.
The main cause of this is the https://en.wikipedia.org/wiki/Han_unification in unicode, which maps different Chinese, Korean, Japanese and Vietnamese characters to the same unicode code point.So the whole "let's juse use UTF-8" isn't remotely enough :(.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 6, 2024 14:32 UTC (Fri) by khim (subscriber, #9252) [Link] (8 responses)

I would say it's “yes” and “no”, simultaneously.

I have meet a lot of people who simply don't know English well enough to type name in ASCII.

Unfortunately the majority of them I have meet when they cried on various forums about how unfair it is that they “have only just used Cyrillic (Arabic, Farsi, etc) name” – and now have so many broken programs they couldn't even count them all.

Yes, it's deeply anglophonic, yes, it's unfair, true, people genuinely suffer if your force that on them…

But the experience says that it's still better for them to lean 1 (one) English world (their account name) once then suffer through innumerable programs that don't support any other names properly.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 6, 2024 21:46 UTC (Fri) by epk (guest, #174765) [Link]

I must sadly approve of this answer.

And it's not as though a non-Latin-alphabet username would really help that much, since so much text - especially in path names and URLs - is in English. There is, however, the full name of each user, and I'm guessing that should be much easier to have non-Latin UTF-8 in. And for non-computer-literate users who need a lot of hand-holding, they might actually see mostly/only their full names.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 7, 2024 10:18 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

I would like to reiterate that I was specifically thinking of logging in at the console, since (as I mentioned) the GUIs are already displaying the GECOS field (which supports full Unicode) in an interactive picker, at least under most reasonable configurations. I do not understand how you're going to get very far at the command line if you can't type in ASCII.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 7, 2024 10:59 UTC (Sat) by khim (subscriber, #9252) [Link]

You wouldn't. Obviously people who couldn't type ASCII wouldn't ever do (and don't plan to do) anything in the command line.

That's fine, the majority of computer users don't ever use command line and are not interested in the command line (many don't even know it exists).

But even for them using non-ASCII letters in the $HOME is PITA. Simply because programs stop working – and changing $HOME temporarily brings another layer of pain.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 18, 2024 21:34 UTC (Wed) by ssmith32 (subscriber, #72404) [Link]

>for them to lean 1 (one) English world

Yes, what a shame they can't just lean one world. Or just learn to live in one English world.
Or just lean on one word to avoid learning English.
Or something.

(yeah, cheap shot, but come on, if you're gonna get on a soapbox about folks learning to spell one word, you really should double check that you spelled *word" correctly, and avoid inadvertently proclaiming that there is one English World - it's the kind of thing that could end up really getting under a Scot's skin).

Doesn't the GECOS field already cover some of this use case?

Posted Dec 21, 2024 21:18 UTC (Sat) by steffen780 (guest, #68142) [Link] (3 responses)

I used to have a non-ASCII character in my surname, an "ö". After spending many hours of travel mentally preparing for the possibility of a heavily armed but dumb border guard making trouble for me because the UK travel company's software supplier apparently didn't realise that travellers on international journeys might have non-ASCII names (turned the "ö" into like 5 random characters) I started booking tickets with "o" instead. I figured that the kind of ignoranus who wrote the ticketing software above wouldn't notice the missing dots - and anyone who does notice will understand my explanation why I gave a false name. And this wasn't in 1960, this was ca 2001 or later. Remember, this was with a travel company. Not a local bus company, these people did substantial international travel. In fact the brand was "Eurolines" - but they couldn't even handle German names.

Similarly, until 2010 or so I would not use äöüß in filenames. Ever. To this day I still only use my native languages properly for low-risk "user-only" files - so I might use it for a LibreOffice file or a video, but I would not use it for a login username, anything in /etc, and so on. I just don't want the extra hassle. But I'm fairly advanced with IT - how is a typical user supposed to know that some software still can't handle such things, many DECADES after the problem was partially solved with Unicode? Do we really expect children today to learn a 1950s (!) encoding just so they know what characters they can use in a username? Surely there's more useful things that can be taught instead. E.g. pretty much anything else ;)

That being said: I wouldn't hold my breath for non-ASCII login usernames to become reliably usable with the infamous "long tail" of software. But huge progress has been made, and I think it's important to keep going.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 22, 2024 12:26 UTC (Sun) by NAR (subscriber, #1313) [Link] (1 responses)

One of the restrictions I set when we chose our children's name was to avoid accented characters - for the very same reason, to avoid possible problems during travel. For various reasons I lifted this restriction for our third child - of course he was the one born abroad :-) I was very (and pleasantly) surprised when the British clerk managed to produce a proper ó for the birth certificate - I think she saved us quite a headache.

we really expect children today to learn a 1950s (!) encoding

What they need to know is the English alphabet. And as English is the international language nowadays, we can expect them to learn this while they learn English. Besides, we're using lot of stuff "hardcoded" in the previous centuries, from the metric system to normal gauge, the Latin alphabet itself, etc. the list of characters in the original ASCII charset is just one of them.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 22, 2024 13:55 UTC (Sun) by zdzichu (subscriber, #17118) [Link]

> surprised when the British clerk managed to produce a proper ó for the birth certificate

I wouldn't be surprised, given the number of Polish people in the UK.

Doesn't the GECOS field already cover some of this use case?

Posted Dec 23, 2024 9:25 UTC (Mon) by taladar (subscriber, #68407) [Link]

Optimistically you could consider the problem "solved" by Unicode two decades ago, pessimistically it isn't even fully there yet today (e.g. some popular database products still do or only recently switched to full UTF-8 encodings as the default and did not support anything outside the Unicode BMP in their previous default encoding).

I wouldn't call that "solved it many decades ago".

Once upon a time in the past ...

Posted Dec 5, 2024 19:11 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (14 responses)

… people used a weird text messaging system called email to send, well, text messages to users supposed to be delivered into their mailboxes at certain hosts. The email address of a user user on host host would be user@host. IOW, a unix username is also what RFC5322 calls the local-part of an email address. And this means ASCII, even according to the newest specification.

https://datatracker.ietf.org/doc/html/rfc5322

Once upon a time in the past ...

Posted Dec 5, 2024 20:25 UTC (Thu) by storner (subscriber, #119) [Link] (2 responses)

I think you are mistaken about requiring ASCII for the name of the mailbox. The RFC you refer to says (section 3.4.1):

The local-part portion is a domain-dependent string. In addresses,
it is simply interpreted on the particular host as a name of a
particular mailbox.

"Domain-dependent" means that there are really no rules as to which characters can be used. It can even be quoted to allow whitespace.

Once upon a time in the past ...

Posted Dec 5, 2024 21:02 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

The grammar rule defining local-part,

local-part = dot-atom / quoted-string / obs-local-part

is at the beginning of the RFC page which contains the statement

The local-part portion is a domain-dependent string.

The claim that domain-dependent would mean "no requirements" is thus obviously wrong. dot-atom and quoted-string are defined in sections 3.2.3 ("Atom") and 3.2.4 ("Quoted Strings"). Drilling down to the actual character set specifications always ends with a subset of ASCII, the most liberal one being the one for quoted strings which includes whitespace and all printable characters, ie, codepoints 32 - 126.

Once upon a time in the past ...

Posted Dec 5, 2024 21:04 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

Slight correction: The quoted-string character set excludes \ and ".

Once upon a time in the past ...

Posted Dec 6, 2024 4:38 UTC (Fri) by jheiss (subscriber, #62556) [Link] (10 responses)

RFC 6532 extends 5322 to allow UTF-8 in email addresses in message headers, and 6531 extends 5321 (SMTP) to allow UTF-8 in SMTP addressing. There are some draft documents in the IETF mailmaint working group which try to set some guidelines about mixed languages and other possibly confusing situations, but servers that implement 6531/6532 could allow any combination of UTF-8 characters in usernames.

Once upon a time in the past ...

Posted Dec 9, 2024 13:24 UTC (Mon) by zdzichu (subscriber, #17118) [Link] (9 responses)

Just tried it. I've put 😒 (unamused face) into /etc/aliases for consumption by my Postfix. Then I've sent an email from Gmail to 😒@pipebreaker.pl. It came through!

I'll remove this alias next week. Until them you can email me at above address to check your email setup ;)

Once upon a time in the past ...

Posted Dec 9, 2024 14:42 UTC (Mon) by geert (subscriber, #98403) [Link]

git send-email doesn't like it:

| error: unable to extract a valid address from: 😒@pipebreaker.pl
| What to do with this address? ([q]uit|[d]rop|[e]dit):

Where's the "[f]orce" option? ;-)

Once upon a time in the past ...

Posted Dec 9, 2024 15:51 UTC (Mon) by dskoll (subscriber, #1630) [Link]

The Postfix server on my LAN had no problems with 😒@... but my Sendmail relay host rejected it.

<😒@pipebreaker.pl>: host 192.168.xx.yy[192.168.xx.yy] said: 501 5.1.3 8-bit
    character in mailbox address "<p???@pipebreaker.pl>" (in reply to RCPT TO
    command)

Once upon a time in the past ...

Posted Dec 9, 2024 17:21 UTC (Mon) by raven667 (subscriber, #5198) [Link] (2 responses)

Fun, I tried thunderbird through o365 and Gmail, we'll see how it goes. The Gmail phone app rejected the address with an error popup as did the outlook phone app by marking it red and silently disabling the send button. No bounces yet.

Once upon a time in the past ...

Posted Dec 9, 2024 18:24 UTC (Mon) by zdzichu (subscriber, #17118) [Link] (1 responses)

Got one from you, replied.

I also got one from Alejandro, but did not manage to reply with fancy From:

SMTPUTF8 is required, but was not offered by host smtp3.kernel.org[44.230.10.245]

Once upon a time in the past ...

Posted Dec 9, 2024 20:05 UTC (Mon) by raven667 (subscriber, #5198) [Link]

My attempt through O365 was thwarted because our ProofPoint anti-malware/phishing system doesn't support utf8 addresses, but the `550 5.1.17 SMTPSEND.Utf8RecipientAddress; UTF-8 recipient address not supported.` bounce was filtered into Junk, lol.

The reply to Gmail with an ASCII address worked but the utf8 From reply was also filtered into Junk, but it did work.

I'm guessing that fake bounce messages are more commonly used for spam/phishing than real notifications which is why they are Junked repeatedly on totally different systems.

Once upon a time in the past ...

Posted Dec 9, 2024 18:24 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

I don't even have that key on my keyboard. Good luck to those trying to do funny things with UTF-8, often the first problem is to permit others to type them. That's by far the least inclusive thing that was ever invented in the computer world :-/

GNUS

Posted Dec 10, 2024 8:18 UTC (Tue) by SiB (subscriber, #4048) [Link] (1 responses)

Emacs GNUS:
Address ‘😒@pipebreaker.pl’ (=?utf-8?Q?=F0=9F=98=92?=@pipebreaker.pl) might be bogus. Continue? (y or n) y
Sending...
Sending via mail...
message-send-mail-with-sendmail: Sending...failed to 2024-12-10 09:17:47 1tKvRD-000000003Sv-2FNl bad addresses found in headers;

Exim4

Posted Dec 10, 2024 8:27 UTC (Tue) by SiB (subscriber, #4048) [Link]

Correction: The error message came fom exim4:

=?utf-8?Q?=3C=F0=9F=98=92?=@pipebreaker.pl>: malformed address: >
may not follow =?utf-8?Q?=3C=F0=9F=98=92?=@pipebreaker.pl

Somehow, gnus was confused by the <> around the address. Without <>, exim4 did a graylisted delivery attempt.

Once upon a time in the past ...

Posted Dec 10, 2024 11:29 UTC (Tue) by farnz (subscriber, #17727) [Link]

I've tried this with Exim and KMail; I can send just fine, but because I've not turned on Exim SMTPUTF8 support (in part because I need to test what Cyrus IMAPd thinks of UTF-8 in local parts), no reply comes through.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 19:19 UTC (Thu) by rhowe (subscriber, #102862) [Link] (12 responses)

Thinking about usernames in the OS more broadly, winbind (I think by default, but if not it's certainly one of the main options) generates usernames of the form "DOMAIN\username" and I know of at least one deployment which uses this.

Now, these users do not exist in the passwd file and therefore aren't created via useradd or adduser so this isn't directly relevant to the issue being discussed here, but it is certainly legitimate for usernames to contain "funky" characters and indeed potentially problematic ones. For example, if something were to treat the backslash as an escape character then all sorts of fun could occur from injecting of newlines into logs to injection of null terminators. Inadequate quoting in shell scripts being a prime example.

Also, both the domain and username portions are determined by the records in Windows' Active Directory and therefore need to follow the rules for that system. For the 'sAMAccountName' field, it's documented at https://learn.microsoft.com/en-us/windows/win32/adschema/... where interestingly it's defined as a Unicode string but not containing any of: "/ \ [ ] : ; | = , + * ? < >
The more modern userPrincipalName attribute is defined as following RFC822 which is not very helpful given the broad nature of that RFC: https://learn.microsoft.com/en-us/windows/win32/adschema/...

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 19:31 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (11 responses)

RFC822 is the historic internet email RFC. That's still the local-part of an email address which is a sequence of words separated by . characters, word being defined in 3.3 as either an atom or a quoted string. In the given context, this also means no UTF8 and some restrictions beyond that.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 19:40 UTC (Thu) by dskoll (subscriber, #1630) [Link] (10 responses)

The local-part of your email address doesn't have to be your UNIX user name, though. It often is for convenience, but while the local-part of my email address is dianne, that is not my UNIX login name.

So appealing to email as a reason to restrict UNIX login names is not a great argument. I think a better argument is simply to make life easier for programs that need to deal with login names and that don't want to worry about UTF-8 canonicalization, etc.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 19:51 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (9 responses)

Regardless of what someone's public email address might be, username@hostname, hostname here both referring to the actual hostname and the host FQDN, is always also a valid email address. The implication of this is mostly that "programs dealing with login names" include any MTA ever written for UNIX and very likely, all other programs ever written to handle email on UNIX, IOW, to name the (probably) most scary example, if you want to allow UTF8 in usernames, are prepared to patch sendmail and procmail to support that?

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 20:57 UTC (Thu) by zeha (subscriber, #61580) [Link] (2 responses)

It was already discovered that various MTAs and MUAs cannot deal with non-ascii in gecos, so clearly these programs no longer matter.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 21:09 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

These were some technical remarks about usernames and not an invitation to an open-ended policy discussion about who dictates (or believe he should really get to dictate) what has to "matter" to other people.

Real-world non-alphanumeric usernames

Posted Dec 6, 2024 19:43 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> It was already discovered that various MTAs and MUAs cannot deal with non-ascii in gecos, so clearly these programs no longer matter.

I think it's worth the effort to identify and fix those programs so people can use their real name for display in the way they prefer to see it regardless of what language they use. If there is no one maintaining a particular MTA or MUA or whatever that breaks because of this, then you've learned that unmaintained software eventually breaks when the world changes around it, but this kind of change could be eased into over several release cycles by making it optional while bug reports and testing are done, before accepting it as the default and a blocker.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 21:23 UTC (Thu) by dskoll (subscriber, #1630) [Link] (5 responses)

No, that would not be fun, but still... appealing to email addresses as a reason to restrict usernames isn't a good argument. Some email systems store email in ways that don't necessarily depend on UNIX login names at all (for example, Cyrus IMAP.)

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 21:59 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

Email systems which weren't based on UNIX have existed since before UNIX gained any networking capabilities (AFAICT, even before UUCP) but that's besides the point. A UNIX system is also an email system and this system is based on using the UNIX username as local-part of an internet email address. That's just a technical fact people considering to extend the username syntax to include octets outside of the range of printable ASCII characters might want to take into account. Or not, depending on what their priorities are.

Real-world non-alphanumeric usernames

Posted Dec 5, 2024 23:26 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Email systems which weren't based on UNIX have existed since before UNIX gained any networking capabilities

I think the birth of email actually predates the birth of Unix?

Cheers,
Wol

UNIX and email

Posted Dec 5, 2024 23:47 UTC (Thu) by KJ7RRV (subscriber, #153595) [Link]

> A UNIX system is also an email system and this system is based on using the UNIX username as local-part of an internet email address.

I think I'm misunderstanding this part? It seems to mean that all UNIX systems are email servers; is that correct?

Real-world non-alphanumeric usernames

Posted Dec 6, 2024 0:15 UTC (Fri) by dvdeug (guest, #10998) [Link] (1 responses)

> A UNIX system is also an email system

A fully POSIX-compliant UNIX system has an email system, though in the modern world, very few UNIX systems are connected to Internet email. I wouldn't say it's not UNIX if it doesn't have an email system. I removed mailutils, mailx, and mailcap from my Debian unstable system, and nothing depended on them. The concept of open access to email via Internet has been lost, and system-wide email isn't very useful on a single-user system.

Real-world non-alphanumeric usernames

Posted Dec 6, 2024 4:42 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link]

Thank you! I didn't realize that POSIX requires email; that explains it.

usernames are a low-level implementation detail

Posted Dec 6, 2024 4:55 UTC (Fri) by marcH (subscriber, #57642) [Link]

> people use it to address me in written and spoken conversations with it, etc.

Just ask these people to stop. Then, do as many other people do and simply treat _both_ usernames and uids as low-level implementation details; that's what they are.

Asking all programs in the universe to agree on some UTF-8 subset for usernames is totally unrealistic. This discussion and article barely scratch that surface.

> I see and type my username hundreds times a day

Not sure what the problem is here. Surely, anyone can find something in ASCII that's not unpleasant to look at?

The simple and reliable way forward is to allow UTF-8 in some non-key, free-form, pure display field like "gecos" or similar and pressure applications to display that in User Interfaces and as many places as possible - while still relying on portable, unique and bug-free ASCII usernames in code and other implementation details. Isn't it what's happening already?

French people who believe É does not exist

Posted Dec 6, 2024 5:03 UTC (Fri) by marcH (subscriber, #57642) [Link] (27 responses)

> Étienne Mollier noted that he had "one weird enough" character in his first name that posed a problem if he had to log in using a keyboard layout that lacked the capability to transcribe the lower-case or upper-case 'e' acute characters ("é" or "É")

A French "fun fact" is that many French people wrongly believe that É, À, Ù etc. "do not exist" because... the default _Windows_ keyboard layout for France makes these incredibly hard to type! fr_FR layouts on Mac and Linux are not affected and neither are some other French-speaking countries.

é/É is one of the most common characters in French.

Note this is pure software issue: there's no relevant, physical difference between Windows and Macs keyboard.

https://www.google.com/search?q=majuscules+accentu%C3%A9es

Even more fun: you can tell whether newspapers and other editors use Windows or not by simply looking at their front page. Examples:

https://www.lemonde.fr/ -> Économie

https://www.liberation.fr/ -> Economie

French people who believe É does not exist

Posted Dec 6, 2024 6:49 UTC (Fri) by victrid (subscriber, #163116) [Link] (3 responses)

I think that's what GECOS field is all about.

In fact, you can type Japanese characters directly on the keyboard, but you cannot expect to see them in text mode. Supporting CJK characters included in UTF-8 is too complicated compared to supporting Latin-1.

Imagine desperate ops logging in to rescue via the console and nothing except blank diamond symbols can be displayed.

French people who believe É does not exist

Posted Dec 6, 2024 11:30 UTC (Fri) by mbunkus (subscriber, #87248) [Link] (2 responses)

I'm kinda confused when you say "In fact, you can type Japanese characters directly on the keyboard, but you cannot expect to see them in text mode.". There are tons of CLI programs out there with translations into languages that include more than ASCII characters, including but not limited to Chinese Traditional, Chinese Simplified, Japanese, and Korean. They can display their messages just fine.

French people who believe É does not exist

Posted Dec 6, 2024 19:59 UTC (Fri) by wahern (subscriber, #37304) [Link]

Maybe they had in mind VGA text mode console screens. But a little Googling suggests that many consoles these days do display at least some CJK characters. The UEFI specification explicitly references VT-UTF8, for example, but I would assume products targeted at East Asian customers had solutions (e.g. PC-98) long before these problems were addressed in common standards.

French people who believe É does not exist

Posted Dec 7, 2024 3:05 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> They can display their messages just fine.

Not in the pure text mode and even with graphical framebuffers it's hit-and-miss.

French people who believe É does not exist

Posted Dec 7, 2024 13:09 UTC (Sat) by geuder (subscriber, #62854) [Link] (22 responses)

Windows might be one factor.

I studied French as a foreign language in school about a decade before Windows existed. We were taught that on capital letters accents are completely optional, leaving them out is not a mistake.
No idea whether the same has ever been told in French schools.

Last time I looked it up I found there is a strong recommendation to use accents on capital letters, too.

Of course computers should allow you to follow recommendations.

Unfortunately Unicode is a train weck when it comes to security / canonicalization issues, so in real life the whole world needs to restrict them to limitations anglophonic computer programmers imposed on them when ASCII was invented. For URLs I find this much worse than for user names.

French people who believe É does not exist

Posted Dec 8, 2024 1:36 UTC (Sun) by marcH (subscriber, #57642) [Link] (21 responses)

You're right, it started before Windows; with older, foreign "machines" like... typewriters. But Personal Computers put a keyboard in front of virtually everyone and then Windows really hurt.

> No idea whether the same has ever been told in French schools.

It was but it's a bit more complicated.

French schools have always taught _cursive_. It's faster when you know it. I think they still do. I've read that other countries with Roman languages tend to teach cursive too? I don't think upper cursive ever had accents for some reason and that's indeed what they use to teach in schools. BTW lower cursives are still standard in France but I think upper cursives are dying. You can still find samples easily; just search for "cursive majuscules".

But serious professionals in the _printing_ industry never stopped using accents on capitals. All dictionaries always had them and all professional guidelines always required them.

I've many times "stunned" French people who were arguing for the lack of accents on capitals by simply... opening a dictionary or book found on their own shelves. I think taht's bucaese fnleut rerades do not pay aoietttnn to ivainuddil cacrahetrs.

BTW Macs have been dominant in the publishing industry for a long time, not sure now.

There is also a subtle difference between "majuscule" (grammar) and "capitale" (typography) that only professionals tend to know. That difference shows when you write the initial character with a "big" capital and the rest in small caps. Most people in France make no difference and just say "majuscule" for both.
https://fr.wikipedia.org/wiki/Capitale_et_majuscule

> Unfortunately Unicode is a train weck when it comes to security / canonicalization issues, so in real life the whole world needs to restrict them to limitations anglophonic computer programmers imposed on them when ASCII was invented.

+1, writing is everything but an exact science; don't let it pollute code and hard logic or you'll be in a world of pain.

It may feel good trying not to be an evil American dominating world culture once again but the cold truth is: ASCII is universal also because it's dead SIMPLE; not just because it came from the top superpower.

Diacritical marks on capital letters

Posted Dec 8, 2024 3:38 UTC (Sun) by KJ7RRV (subscriber, #153595) [Link]

It seems to be common for diacritical marks on capitals to be different from on lowercase letters: Greek, for example, uses them, but positions them before the letter rather than above, e.g. Ἔ vs. ἔ. They are omitted, however, in all-caps text.

Interestingly, my phone puts the diacritics on Ἔ above the character before it; I'm not sure if that's correct formatting or a bug. It might be part of the reason why diacritics aren't used in all-caps text; when they're used on initial capitals, they're (almost, at least) always used after a space, whereas in all-caps, they would not be.

French people who believe É does not exist

Posted Dec 8, 2024 15:44 UTC (Sun) by ballombe (subscriber, #9523) [Link] (19 responses)

I completely agree with you. Two points:
1. Usual cursive majuscules do not carry accents.
2. French keyboard (azerty) have labels for accented minuscule but not for accented majuscules, so as a result a lot of people do not know how to input them at all.

French people who believe É does not exist

Posted Dec 8, 2024 16:18 UTC (Sun) by farnz (subscriber, #17727) [Link] (18 responses)

As someone only peripherally aware of the French 80s computer options; did the home grown systems (Minitel, Thomson MO5 and TO-7 plus other Nanoréseau machines, Groupe Bull minis and mainframes etc) provide convenient ways to enter accented majuscules, or were they also restricted to miniscules?

French people who believe É does not exist

Posted Dec 8, 2024 19:39 UTC (Sun) by marcH (subscriber, #57642) [Link] (17 responses)

I obviously forgot how it worked on the Minitel, but because its lifespan overlapped a lot with the internet, there is a ton of information about the former available on the latter.

Pictures and documents clearly show that accents were available as dead keys (no dedicated key for é or other)

Section 3.3.2 of this scanned specification mentions accented capitals explicitly:
https://www.minitel-alcatel.fr/documents/M1_1983-1984/STU...

The encoding was apparently ISO 2022 G2
https://en.wikipedia.org/wiki/ISO/IEC_2022

So unlike Windows, it looks like accented capitals were not missed by the Minitel!
That does not mean they were popular but it was clearly possible.

French people who believe É does not exist

Posted Dec 10, 2024 20:42 UTC (Tue) by rschroev (subscriber, #4164) [Link] (16 responses)

Haven't they accents always been available as dead keys on Windows? I can't be sure but I seem to remember I've always been able to type accented letters on Windows, even those that don't have a dedicated key (on Belgian azerty, not French, but I don't think that makes a difference in this case). Or maybe that only worked for lowercase letters, not for uppercase?

Or is the issue here that people often don't know how to use dead keys these days?

French people who believe É does not exist

Posted Dec 11, 2024 9:58 UTC (Wed) by taladar (subscriber, #68407) [Link]

Not sure if it does make a difference for those two particular layouts but which dead keys exist and behave as dead keys is absolutely part of the keyboard layout.

French people who believe É does not exist

Posted Dec 11, 2024 12:30 UTC (Wed) by Wol (subscriber, #4433) [Link] (10 responses)

> Or is the issue here that people often don't know how to use dead keys these days?

That's almost certainly true for the Anglo-Saxon world. There's nothing on my (105-key UK keyboard) that looks like a "compose" key, and I wouldn't know where to start. I used to have a Cyrillic keyboard, but that had the old DIN connector, so is long gone ...

Cheers,
Wol

French people who believe É does not exist

Posted Dec 11, 2024 12:49 UTC (Wed) by farnz (subscriber, #17727) [Link] (9 responses)

Compose is different to a dead key. A dead key is a key you can press that appears to do nothing, but where the next keypress is modified by the dead key - for example, if you have a dead key for `, then pressing it does nothing, but pressing ` followed by e gets you è.

A compose key is a mode switch for the keyboard; the next few (at least two) keypresses are combined into a single character. I've configured Shift-CapsLock as a compose key, so I typed è as Shift-CapsLock, e, `.

French people who believe É does not exist

Posted Dec 11, 2024 12:55 UTC (Wed) by Wol (subscriber, #4433) [Link] (8 responses)

Ah. So a "dead key" is basically like a normal key, except it displays a particular accent, so it can't display until the following key is pressed and they display together. In other words, a dead key is "compose", "accent" combined into one?

Cheers,
Wol

French people who believe É does not exist

Posted Dec 11, 2024 13:06 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Exactly, and it results in a different set of tradeoffs. A compose key is strictly more flexible than dead keys, since I can type things like Compose o c, and get ©, and both Compose , c and Compose c , get me ç. A dead key is clearer to the user - you type accent, base character, and always get the combination of the accent with the base character.

How to enter weird characters under X11

Posted Dec 12, 2024 0:16 UTC (Thu) by geuder (subscriber, #62854) [Link] (1 responses)

The compose aka multi key is truly amazing, someone must have had too much time...


$ grep -c '^<Multi'  /usr/share/X11/locale/en_US.UTF-8/Compose

3580

I don't even use en_US locale, but somehow the definitions seem to get included anyway. Some of the more useless examples:


<Multi_key> <C> <C> <C> <P>             : "☭"   U262D # HAMMER AND SICKLE

<Multi_key> <p> <o> <o>                     : "💩"  U1F4A9 # PILE OF POO

Not sure whether all of this is available under Wayland or is it another reason to postpone upgrading ☺

(the last character was typed as <Multi_key> <colon> <parenright>)

How to enter weird characters under X11

Posted Dec 12, 2024 11:20 UTC (Thu) by yaap (subscriber, #71398) [Link]

Compose works on Wayland for me, with KDE Plasma 6. Just tried your second option, got it in color on konsole ;)

Dead keys

Posted Dec 11, 2024 23:10 UTC (Wed) by rschroev (subscriber, #4164) [Link] (3 responses)

Dead keys come from mechanical typewriters: a dead key prints an accent, but does not advance the carriage like normal keys do; that's why it's called "dead". Therefore the next character is printed in the same position, resulting in a letter with an accent above it.

On those mechanical typewriters the vertical position of the dead key accents is fixed obviously, on the correct height for lowercase letters but not high enough for uppercase letters. The typewriter won't stop you (and can't stop you) from using a dead key in combination with uppercase letters, but the result will not be satisfactory.

Computers are smarter than mechanical typewriters and can produce the correct glyph, with the correct vertical position of the accent to match the letter it's combined with. On computers the user experience is different though: when you press a dead key, nothing seems to happen. No output appears. Only on the next key press is output generated. What that output looks like depends on the combination. For example, if I press ´ followed by a, I get á. But if I press ´ followed by space I simply get ´, and combined with z it gives me ´z (because z with an acute accent doesn't exist, I guess).

Belgian azerty has a number of dedicated keys for the most common letters with accents, so you don't need to use dead keys for those (though you could, if you wanted to): é è à ù (ù doesn't seem all that common to me though; I would think ê is more common). All other accented characters require the use of one of the dead keys (^ ¨ ´ ` ~). ^ is a special case in that it appears twice on the keyboard: once as a dead key to produce e.g. ê , and once as a normal key to produce ^ (which I can also produce by first pressing the dead key ^ followed by space, just as I can with all the other dead keys). I don't know why ^ is special enough to get two appearances.

Side note: "Dedicated" is not entirely the correct term here: all those keys produce other letters when combined with Shift, and often also with AltGr. For example, é is on the same key as 2 (Shift) and @ (AltGr) (yes, azerty keyboards require Shift to type digits, which is why people using azerty have a somewhat higher tendency to use the numerical keypad for numerical entry).

Second side note: The term "dead key" is not exactly correct either, since being dead or not is not a feature of the key itself anymore like it was on mechanical typewriters. For example the key with ^ produces a very normal non-dead [ when used with AltGr. There are no fully dead keys anymore (on Belgian azerty, at least).

(That's probably more than you wanted to know about dead keys and accents on azerty keyboards)

Dead keys

Posted Dec 20, 2024 21:55 UTC (Fri) by sammythesnake (guest, #17693) [Link] (2 responses)

> On those mechanical typewriters the vertical position of the dead key accents is fixed obviously, on the correct height for lowercase letters but not high enough for uppercase letters. The typewriter won't stop you (and can't stop you) from using a dead key in combination with uppercase letters, but the result will not be satisfactory.

I imagine it would be optimistic to expect you to have such a typewriter on hand to experiment with, but the first thing I'd try would be to use the SHIFT key while pressing the accent key...

Dead keys

Posted Dec 21, 2024 12:05 UTC (Sat) by johill (subscriber, #25196) [Link] (1 responses)

IIRC (but I haven't used one in probably about two decades) that's simply how you get the other (another) accent, e.g. for é vs è. I think we still have one somewhere, so I guess I could check.

Dead keys

Posted Dec 21, 2024 16:42 UTC (Sat) by rschroev (subscriber, #4164) [Link]

Exactly. Each dead key on a typewriter has two accents: one when you press the key normally, another one when you press the key in combination with shift.

French people who believe É does not exist

Posted Dec 12, 2024 9:06 UTC (Thu) by MortenSickel (subscriber, #3238) [Link]

àÀ öôÔÖÈè typed by pressing the accent dead key (some needing shift or AltGr) and then the relevant key (typing on a Norwegian keyboard having æøåÆØÅ as separate keys, but no other accented letters) I happend to learn about the dead keys since I am often typing German.

But I have no idea if the first lettes in the comments comes out as an accented letter or as a accent+letter combo.

French people who believe É does not exist

Posted Dec 12, 2024 5:23 UTC (Thu) by marcH (subscriber, #57642) [Link] (3 responses)

>Haven't they accents always been available as dead keys on Windows?

Not for é (extremely common character) with the default Windows layout in France. é has its own key. It becomes É with caps lock on a mac but it becomes a number on Windows. There are some dead keys available but not for the acute accent.

There are apps that let you visualize any keyboard layout.

French people who believe É does not exist

Posted Dec 20, 2024 22:05 UTC (Fri) by sammythesnake (guest, #17693) [Link] (2 responses)

"caps lock" and "shift" don't usually do the same thing - caps lock is for capitals only, but shift *also* changes lots of other keys, mostly punctuation.

E.g. pressing the "2" key will do the same with our without caps lock, but shift will change that to a double quote or @ symbol or whatever.

I don't ever use windows, but I'd like to see if the "é" key does the same thing with shift as with caps lock, or something different. My (possibly optimistic) guess is that caps lock would give you the Élusive character :-P

French people who believe É does not exist

Posted Dec 21, 2024 15:10 UTC (Sat) by Wol (subscriber, #4433) [Link]

And iirc typewriters didn't have caps-lock, they had shift-lock. (Ie they mechanically locked the shift key down, hence the name.)

I seem to remember one computer layout that had both, certainly shift-lock can be damn useful and its (apparent) lack on modern keyboards could be a pain. I don't feel that any more, it's been too long ago, but if I had it back I'd probably find a use for it :-)

Cheers,
Wol

French people who believe É does not exist

Posted Dec 21, 2024 18:00 UTC (Sat) by marcH (subscriber, #57642) [Link]

> E.g. pressing the "2" key will do the same with our without caps lock, but shift will change that to a double quote or @ symbol or whatever.

With a France keyboard, it depends whether you use mac or Windows.

> I don't ever use windows, but I'd like to see if the "é" key does the same thing with shift as with caps lock, or something different.

On Windows it does the same thing. That's the problem and one of the reasons why É is so hard to get on Windows.

> My (possibly optimistic) guess is that caps lock would give you the Élusive character :-P

Instead of wrongly guessing, you could just search the internet or open one of the references already mentioned above.

It's bad

Posted Dec 6, 2024 19:41 UTC (Fri) by rgb (guest, #57129) [Link]

As long as UTF-8 names are not fully supported, the "bad" in "--allow-bad-names" serves as a crucial hint to the unaware user that "bad" things can happen.

Sing it: https://genius.com/Michael-jackson-bad-lyrics

RFC 8265 defines how to normalize and compare Unicode usernames

Posted Dec 6, 2024 20:48 UTC (Fri) by gioele (subscriber, #61675) [Link] (1 responses)

> He asked if POSIX or other standards provided a normalization form for UTF-8 encoded usernames.

Later in the thread [1] Michal Politowski pointed out that RFC 8265 "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" and its sibling RFC 8264 "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols" do in fact describe which normalization forms should be used when comparing Unicode usernames (as well as a number of other low-level details).

[1] https://lists.debian.org/debian-devel/2024/11/msg00507.html

RFC 8265 defines how to normalize and compare Unicode usernames

Posted Dec 7, 2024 13:19 UTC (Sat) by geuder (subscriber, #62854) [Link]

Are there commonly available (of course on LWN commonly means packaged in distros...) libraries for those RFCs? The numbers look rather high to me, so my hunch is not yet widely used or even available. No idea whether they well ever be, but of course if Debian could make a start that would be a good try. I think the answer was in the article, too much work, no time :(

Read the Unicode standard

Posted Dec 6, 2024 23:56 UTC (Fri) by peter-b (guest, #66996) [Link] (1 responses)

Usernames are identifiers and, as such, there is a Unicode Standard Annex which specifies a stable and reliable scheme for Unicode identifiers:

UAX #31 Unicode Identifiers and Syntax
https://www.unicode.org/reports/tr31/

Please use it instead of endlessly arguing over a problem already solved by domain experts.

Read the Unicode standard

Posted Dec 7, 2024 5:40 UTC (Sat) by gioele (subscriber, #61675) [Link]

> Usernames are identifiers and, as such, there is a Unicode Standard Annex which specifies a stable and reliable scheme for Unicode identifiers:
>
> UAX #31 Unicode Identifiers and Syntax
> https://www.unicode.org/reports/tr31/
>
> Please use it instead of endlessly arguing over a problem already solved by domain experts.

Annex 31 is not prescriptive enough. For example, when it comes to normalization forms:

> UAX31-R4. Equivalent Normalized Identifiers: To meet this requirement, an implementation shall specify the Normalization Form and shall provide a precise specification of the characters that are excluded from normalization, if any.

Instead, PRECIS and RFC 8265 "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" are more prescriptive and actionable. From <https://www.rfc-editor.org/rfc/rfc8265.html#section-3.3.1>:

> 4. Normalization Rule: Apply Unicode Normalization Form C (NFC) to all strings.

adduser and useradd

Posted Dec 8, 2024 1:44 UTC (Sun) by marcH (subscriber, #57642) [Link]

> It has long been said that naming things is one of the hard things to do in computer science.

No, it's very easy! Look:

> Most Debian users don't work with useradd, or groupadd, directly. Instead, Debian has long supplied its own adduser (and addgroup) utilities. These act as simpler front ends to useradd.

Couldn't resist sorry.

Damages i18n has done?

Posted Dec 9, 2024 9:43 UTC (Mon) by taladar (subscriber, #68407) [Link] (18 responses)

I sometimes wonder what the economic damage amounts to that i18n and l10n have done over the years by localizing and translating things that would be better off untranslated and unlocalized.

I am thinking of examples like scripts breaking because systems unnecessarily auto-switch output of basic unix utilities to use commas instead of periods or to use column headers in a different language, imports failing because a locale set a different default character set, admins and users looking for error messages online but not finding any results because the results are partitioned by language of the error message or even ridiculous examples like the VBA keyword translations,...

Don't get me wrong, for user-facing input and output of course it should be supported to display and produce content in every language but some people just take it too far into the low level details without thinking about the question if the natural language used is even a significant barrier to the intended audience and which negative impacts translations can have.

Damages i18n has done?

Posted Dec 9, 2024 10:28 UTC (Mon) by mbunkus (subscriber, #87248) [Link] (16 responses)

Only people well-versed in English can question whether English is "actually a significant barrier". Unless you define your "target audience" as "all English speaking admins" instead of "all admins", of course. SMH.

Damages i18n has done?

Posted Dec 9, 2024 12:22 UTC (Mon) by taladar (subscriber, #68407) [Link]

If your admin does not speak English and is stuck in a walled garden with other admins who do not speak English I would argue that that will actually make it harder for them to solve any and all issues with localized error messages because there is going to be very little cross-communication between that walled garden and the larger international community so they might literally be better off searching for the error message they can't read in the English threads about it they also can't understand but that contain the actual solution.

Like it or not, there is a reason a lingua franca is common in many fields among experts and that reason is that the only thing worse than content in a single language you don't speak is content split over dozens of languages where each speaker speaks none of the others.

Damages i18n has done?

Posted Dec 9, 2024 17:10 UTC (Mon) by raven667 (subscriber, #5198) [Link] (11 responses)

Clearly all error messages should be standardized to Esperanto (j/k)

What might be useful in that case is to promote the use of unique ASCII identifiers for error messages that make them searchable across languages or text edits, eg %FOO-PLORT-12345: this is a well worn solution to this problem

Damages i18n has done?

Posted Dec 10, 2024 11:14 UTC (Tue) by taladar (subscriber, #68407) [Link] (5 responses)

That only really works if the parameters in the error message (e.g. filenames, ports,...) do not matter to finding others with the problem though.

Damages i18n has done?

Posted Dec 10, 2024 11:46 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

The well-worn solution has messages that look like %FOO-PLORT-12345:"filename","example.com","2001:db8:1::42/64"%. The idea is that you look up %FOO-PLORT-12345 in your catalogue of possible messages, and get told that it's "could not download {1} over HTTP from https://{2}/ (resolved IP {3})". You can then fill in the parameters (by hand, back in the day, computer can do it now), and discover what the error meant.

Damages i18n has done?

Posted Dec 11, 2024 9:45 UTC (Wed) by taladar (subscriber, #68407) [Link] (3 responses)

While that does seem like a good solution to the issue I can't say I have ever encountered a program using that in 20+ years of professional Linux administration.

Damages i18n has done?

Posted Dec 11, 2024 10:50 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

The UNIX world never went this way; I encountered it interacting with mainframes and minicomputers, back 30-odd years ago.

Damages i18n has done?

Posted Dec 11, 2024 12:13 UTC (Wed) by Wol (subscriber, #4433) [Link]

That sounds exactly like what I was thinking of - of course, using a database, there was a MESSAGES file which the error function searched - keyed on message id and language, then it printed the appropriate error message for the locale. Prefixed by the message id, to make it easy to search for / report a problem. If you're dealing with a support team who speak a different language, the message id makes much more sense than the error message.

Cheers,
Wol

Damages i18n has done?

Posted Dec 11, 2024 18:42 UTC (Wed) by raven667 (subscriber, #5198) [Link]

I've seen this standard in various vendor software, eg Cisco IOS and variants use a similar kind of system

https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/16_xe/s...
https://www.cisco.com/c/en/us/td/docs/ios/12_2/sem2/syste...

found an example from some IBM system that is this style where every log is numbered

https://publibz.boulder.ibm.com/epubs/pdf/ispzmc90.pdf

grabbing one at random ISRB0001 is searchable and leads to further docs https://www.ibm.com/docs/en/zos/2.4.0?topic=codes-ispf-me... which would be a searchable tag even if the text of the message was localized or changed between versions

Damages i18n has done?

Posted Dec 10, 2024 17:23 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (4 responses)

It's not just about error messages, though. If you don't speak any of the languages a tool is available in (including its help output, man page etc.), then that tool is most likely completely unusable for them.

It's kind of hard to know how many people across the world do not speak English. There are several statistics out there that say that up to 1.45 billion people do speak English[1], but there are 8.2 billion people across the globe (or something like that). For whatever reason. Lack of education (or even educational possibilities), too young, too old, learning disabilities, socio-economic pressure & limitations etc. etc. "Just learn English" is not going to cut it just yet, maybe never.

For example, I started using computers when I was eight, I think. I was able to learn to program in it because the manual it came with was in German, even though the software itself was in English. I could not speak English at that point, but having documentation in my native language enabled me to at least associate several English words (PRINT, IF…) & short phrases (SYNTAX ERROR IN…) with their German counterparts, but only because I had the German stuff to learn from. If that hadn't been available, I might only have started doing stuff with computers years later if ever at the scale I'm doing it now. Having stuff available in your own language enables you to learn, to use, to create. Saying things like "everyone needs to learn English in our field" and "i18n has cost businesses a lot" is really thinking from inside a certain bubble, and it's really excluding & limiting.

All I'm asking for here is to be more open to make software, especially Open Source software, available and usable to all, not just the English-speaking system admin clique.

[1] https://www.statista.com/statistics/266808/the-most-spoke...

Damages i18n has done?

Posted Dec 11, 2024 9:57 UTC (Wed) by taladar (subscriber, #68407) [Link] (3 responses)

Please note that I was specifically talking about error messages and auto-switching of output to another language on relatively low level interfaces, the kind most likely used directly only by relatively skilled computer users.

I am absolutely in favor of translating interfaces used by laymen (but only those parts they want skip over anyway like error messages) and documentation.

I have the opposite experience to yours though, when i was younger, in the 1990s, a lot of computer books were translated by clueless translators so every publishing house had a different German version of the standardized English IT terminology and some of the coding examples in programming books were broken because the translators didn't understand how to translate e.g. a regex replacing part of a string.

Similarly, even in entertainment media, once I learned English I noticed how many of the German dubs contain English idioms that do not exist in German and were just translated word for word (presumably to make the lip-sync work).

I am also not talking about the cost to business here, I am talking to the cost i18n has to the communication itself by making that worse, not the financial cost.

Damages i18n has done?

Posted Dec 11, 2024 12:20 UTC (Wed) by Wol (subscriber, #4433) [Link]

> Similarly, even in entertainment media, once I learned English I noticed how many of the German dubs contain English idioms that do not exist in German and were just translated word for word (presumably to make the lip-sync work).

This! As someone who's German is passable, and who's French has mostly been forgotten (plus ancient smatterings of Russian and Khmer), so much information is passed *by reference* in conversation, that if you're not a native speaker it's extremely easy to miss what is actually being said. Or (as has happened to me) the "meaning as written" can be very different to the "meaning as understood", so you end up saying something completely different from what you thought you had said!

Cheers,
Wol

Damages i18n has done?

Posted Dec 11, 2024 16:50 UTC (Wed) by mbunkus (subscriber, #87248) [Link] (1 responses)

> Please note that I was specifically talking about error messages and auto-switching of output to another language on relatively low level interfaces

You're trying to enforce permanence on human language here. Error messages may change for a number of reasons, including them being unclear or even plain wrong, having to be extended to include additional information, include examples to the user how to fix the error/use the program correctly, or just stylistic changes. Even error messages written in English might contain non-ASCII characters if they include user-generated content, and that might not even be validly encoded (e.g. a file name). Note that all of those can happen with English as well.

If you want "I don't want to have to change my things, ever", then you're in well-trotten territory of e.g. REST APIs & similar. Argue for your low-level tools to implement best practices from those APIs, including:

- structured, versioned output
- a status indicator
- machine-parseable, stable error codes (that don't change) alongside human-readable error messages (that are subject to change & translation)
- one imposed language on all identifiers, most likely English (e.g. hash keys, status strings etc.)

That gets you everything you want while also allowing the tools to be translated, their messages changed in whatever way, to be easier to use by more people. This is something that I would very much like to see as well.

As for two examples, the "ip" tool & the "restic" backup command have JSON output in addition to the well-known, default human-readable one. It's easy to handle. Unfortunately in both cases error messages (and in the case of Restic certain verbose status messages) are still printed as human-readable messages instead of using JSON for it as well, falling short of what I'd like to see.

Damages i18n has done?

Posted Dec 16, 2024 10:28 UTC (Mon) by taladar (subscriber, #68407) [Link]

Oh, I would absolutely be for machine-readable output for all of those situations.

Unfortunately as long as you have some sort of output that isn't fully pre-specified (like an enum) but a free form value you would then soon get the feature request to translate those parts of the output too because someone wants to build some sort of user-facing UI based on the machine-readable output.

My argument is more that certain messages should not be translated because translations are literally hurting communication when compared to the use of a single language.

Damages i18n has done?

Posted Dec 9, 2024 18:43 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

This is 2020-s already, automatic translators exist. Google exists. You can consider the English-language text as message IDs that you use for lookups, kinda like un-automated GNU gettext.

Damages i18n has done?

Posted Dec 10, 2024 11:18 UTC (Tue) by taladar (subscriber, #68407) [Link] (1 responses)

The existence of automated translators that badly translate the message without anyone who speaks the language ever looking at it to see if it makes sense is half of the problem because reversing that nonsense-translation to look up the actual error message or communicate with the developers about it is often not possible or unreasonably hard.

Damages i18n has done?

Posted Dec 11, 2024 6:33 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Again, realistically ChatGPT and Google will provide you enough context. And it's not like error messages are James Joyce's poems, they are typically very formulaic.

I maintain a couple of small projects, and I actually communicated with Arabic speakers via a translator.

Damages i18n has done?

Posted Dec 9, 2024 14:47 UTC (Mon) by Wol (subscriber, #4433) [Link]

> I am thinking of examples like scripts breaking because systems unnecessarily auto-switch output of basic unix utilities to use commas instead of periods or to use column headers in a different language, imports failing because a locale set a different default character set, admins and users looking for error messages online but not finding any results because the results are partitioned by language of the error message or even ridiculous examples like the VBA keyword translations,...

And yet it should be so easy to fix ... I use lilypond, who's default language is DUTCH. Yet it works in English fine (except all the docu is in American :-)

You just need something like "#pragma English" or whatever, to say what language the keywords are. lily has "#include english.ly", which redefines all the notes as the American names (mostly the same as English). And you could redefine everything else if you chose - although it does help that most music terms are universal (and Italian!).

Cheers,
Wol

It's the human

Posted Dec 9, 2024 18:30 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (6 responses)

As usual, the problems in the computer world is the human, and since that one cannot be fixed (at least in legal ways), the poor little computer is forced to adapt at any cost and will even be blamed later for introducing vulnerabilities.

With that said, I don't understand why some people want their *name* as a user name. Maybe just because that's the way it's presented. Would it be called "a system identifier", it wouldn't be a problem at all. I've had logins made of one letter and 6 digits for many years and nobody complained at all. It could be said that as a convenience, since the system supports this or that alphabet, you're free to select a system identifier that more or less looks like your name provided that it's available, and that would be fine.

The problem really seems to be how it was presented to users in the first place.

Also those trying hard to get access to "root" (or "administator" in some other environments) suddenly love their new permissions regardless of the accepted character set to write these identifiers.

Maybe the situation could progressively be reversed by changing the way tools present these logins to call these "system identifiers" exclusively and recalling the list of allowed chars at creationn time.

It's the human

Posted Dec 10, 2024 7:13 UTC (Tue) by micka (subscriber, #38720) [Link]

Well, yes you could just attribute an uuid instead of asking for a username, then it would be clear it doesn’t need to support utf8.

It's the human

Posted Dec 10, 2024 7:50 UTC (Tue) by mb (subscriber, #50428) [Link] (3 responses)

> With that said, I don't understand why some people want their *name* as a user name.

Well, because it's easy to remember.
I don't want to remember a cryptic user name and a cryptic password.

What would help is if the system would never actually ask for the user name, if there's only a single (non-system) account.
Just automatically create the user "main" in the background while installing and never tell the user about it, unless a second account is created.

It's the human

Posted Dec 10, 2024 8:08 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> What would help is if the system would never actually ask for the user name, if there's only a single (non-system) account.

> Just automatically create the user "main" in the background while installing and never tell the user about it, unless a second account is create

I don't know how (un)usual my setup is, but my main system is a desktop. And because I hate people screwing up my defaults, it's a very firm policy that EVERYone has their own account on that system. I suspect that is pretty much a normal setup for geeks ...

Security may be *****, but as it's a home system we're not worried about family members.

So that policy wouldn't work for us, or for a lot of other people I suspect ...

Cheers,
Wol

It's the human

Posted Dec 10, 2024 9:41 UTC (Tue) by mb (subscriber, #50428) [Link]

Of course it would work for you, too.
Just create any amount of users you want.

Its just not how normal people use computers, though. See RaspberyPi OS. It just creates a "pi" user and doesn't bother the user with details. It would be even better, if it never showed this to the user, as long as pi is the only name.

It's the human

Posted Dec 10, 2024 13:13 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

You can let users choose within certain limits. Others use tri/quadri/pentagrams. I used to have "tarw" when I was a student for example. I've also had 2 different numeric accounts as a student. Nobody complained, the account was there for the whole year, plenty of time to remind it. Also for a very long time you were limited to 8 chars. Many valid names wouldn't fit so that was another good justification to use a short name different from your real one. Actually it's only at home and on my work laptop that I'm using my first name since I'm not creative and it's short. People I know with long or composed names (for example "jean-françois") just use a short variation such as "jf" or "jeff" to avoid conflicts. On the few machines I am/have been managing with up to ~20 persons, often short names are used, with roughly 50% matching the user's first name, indicating that it's more like "I don't know what to use" than "I want my name".

BTW, just look here on LWN: most logins are short (including yours which is among the shortest). When I registered I tried "willy" and it was already taken by Matthew so I switched back to something simple. In any case I needed to note it somewhere, so I could have used anything else, like people have on gmail for example.

It's the human

Posted Dec 10, 2024 17:09 UTC (Tue) by mbunkus (subscriber, #87248) [Link]

> With that said, I don't understand why some people want their *name* as a user name.

Easier to remember. Emotional attachment. The fact that outside of computers you usually do use your name & not some arbitrarily restricted identifier to refer to yourself. Far less understanding for technical quirks & anachronisms in the general, not-too-tech-savvy public.

When I see some of my relatives having problems remembering their PINs for their EC cards which they use several times a week, I'm sure they're not keen on having to remember arbitrary identifiers for websites on end.

There are plenty of reasons. I guess you wouldn't consider them valid or important enough. Others may disagree.

Is this a real problem?

Posted Dec 10, 2024 13:08 UTC (Tue) by alx.manpages (subscriber, #145117) [Link] (3 responses)

Is this a real discussion? Or are we discussing about the sex of the angels?

- How many people are currently using --badname (or the equivalent in the wrapper programs)?

I suspect the number is low.

Those that understand computers most likely avoid it,
knowing that it can trigger bugs in so many places.
I would say the number is exactly 0,
except maybe for a few cases just for fun testing a system.

Even more so if they have ever used different keyboards for input.
Restricting oneself to [a-z] for passwords is a good recommendation for similar reasons.
You might get locked out of your own system if you can't type the symbol.

There might be some people that do use UTF8 symbols in their usernames,
and I expect it's people that have no clue of how that works. So:

- How many of the people already using it would continue using it if you explain them the possible consequences?

Then, can we justify discussing support for one feature that has been available for a very long time (in Debian) with close to 0 users --especially those informed--, where we suspect it's quite dangerous?

Have we learnt something from allowing \n in file names?

Is this a real problem?

Posted Dec 12, 2024 13:43 UTC (Thu) by MortenSickel (subscriber, #3238) [Link] (2 responses)

The question is rather,

How many people are each and every day annoied that there is a limit on allowable characters in your username?

Since usernames usually are closely connected to your given name, for my (Norwegian) friend Bjørn, the name bjorn is not his real name, and although bjoern to a certain degree is a correct spelling, it feels wrong, and for people using completely other character sets as cyrillic, arabic or ... it is even worse.

So the answer is clearly, the usernames should allow UTF-8, but as this article has clearly shown, it is not an easy part to get there, but hopefully one day. The way it is today is more or less a user name standard saying "ascii ought to be enough for anybody"

Is this a real problem?

Posted Dec 16, 2024 10:33 UTC (Mon) by taladar (subscriber, #68407) [Link]

The real question is how many admins and users would be much more than just slightly annoyed that they can't ban someone because that person performing some harmful action uses a UTF-8 username they can neither pronounce nor type.

Is this a real problem?

Posted Dec 17, 2024 14:11 UTC (Tue) by tao (subscriber, #17563) [Link]

Isn't it enough to be able to use UTF-8 in the full name field? In most larger multi-user systems you'll inevitably end up with name collisions anyway, so even with UTF-8 support in the username your friend would probably end up named something like bjørn3 (or, more likely, bj<5letters of last name>3).