Debian opens a can of username worms
It has long been said that naming things is one of the hard things to do in computer science. That may be so, but it pales in comparison to the challenge of handling usernames properly in applications. This is especially true when multiple applications are involved, and they are all supposed to agree on what characters are, and are not, allowed. The Debian project is facing that problem right now, as two user-creation utilities disagreed about which names are allowable. A plan is in place to sort this out before the release of Debian 13 ("trixie") sometime next year.
The useradd utility is part of the shadow-utils project, which includes programs for managing user and group accounts. The shadow-utils suite is included in Debian's passwd package. For historical reasons, and to avoid confusion with the upstream project, Debian's version of the shadow-utils sources are often referred to as "src:shadow".
Most Debian users don't work with useradd, or groupadd, directly. Instead, Debian has long supplied its own adduser (and addgroup) utilities, originally written by founder Ian Murdock. These act as simpler front ends to useradd and use Debian-supplied system defaults for creating users' home directories and configurations. It should be noted that useradd, et al., have become much more full-featured since Debian's utilities were introduced, but the project continues to maintain them nonetheless.
Little Bobby Tables
In June, Debian developer and src:shadow maintainer Chris Hofstaedtler filed a bug against the adduser package. The src:shadow package had dropped a Debian-specific patch, originally introduced in 2003 by Karl Ramm, to allow characters far beyond what were allowed by the upstream shadow-utils project. In the patch, Ramm wrote:
I can't come up with a good justification as to why characters other than ':'s and '\0's should be disallowed in group and usernames (other than '-' as the leading character). Thus, the maintenance tools don't anymore.
Hofstaedtler said that he had puzzled out some of the patch's
purpose from old bug reports that had been "fixed" by the patch, and
those asked for two things not allowed by the upstream
shadow-utils: usernames with upper-case characters or that
are purely numeric. Hofstaedtler said that upper-case names had been
allowed in the upstream shadow-utils project "a long time
ago
", but it seemed like a bad idea to allow purely numeric
usernames.
The patch enabled much more than upper-case and purely numeric names,
though. With the patch dropped in version 1:4.15.2-2 of the
shadow source package, one of adduser's
bob;>/hacked
")—had failed
For src:shadow, I would really like to not have a divergence from upstream in this regard. I think if we have clear requirements then we (I) can submit them upstream and I would expect upstream to accept patches.
I do feel that making the case for "bob;>/hacked" would be very hard.
Hofstaedtler said that the patch had been reapplied for the time being, it was included again in version 1:4.15.2-3, but he asked if username requirements could be sorted out in time for the Debian "trixie" release. If the patch were dropped entirely, then useradd would restrict usernames to the POSIX standard, with the exception of allowing a "$" character at the end of a username
Debian developer and adduser maintainer Marc Haber replied
in late October that other tests were failing as well, and thought
that "useradd upstream is being too picky here
". Since
adduser depends on useradd it could not create users
that useradd would reject, he said he would like to
synchronize on what would be allowed or not.
As part of the research into what should be allowed in usernames, Haber took over Debian's UserAccounts wiki page, which outlines Debian's username tools and policies, and started looking into whether the project should relax its requirements around usernames.
Limits on usernames
One of the questions that bubbles up when looking at usernames is not just allowable characters, but the allowable length of the username. The documentation for shadow-utils does not specify a length for usernames or what encoding is being used.
However, in order to be portable between systems, the POSIX standard says that usernames should not include
non-ASCII characters. The standard says
that usernames should be "composed of characters from the portable
filename character set
". That set is comprised of numbers 0
through 9, upper-case and lower-case "a" through "z", the period (.),
underscore (_), and hyphen (-). It also specifies that usernames
should not begin with a hyphen.
It is, however, possible to assign characters outside that set with the tools at hand. But Linux distributions usually put up some guardrails in the adduser and useradd configurations to prevent administrators from creating usernames with non-ASCII characters unintentionally. These configurations can be overridden with adduser's --allow-bad-names option or useradd's --badname option.
In November, Haber posted
a message on debian-devel that he had "opened an especially nasty
can of worms
" and was finding that things were more complicated
than he had understood. He sought input and opinions on a number of
questions about whether Debian should allow non-ASCII characters for usernames, how
to do that if so, and if it was more appropriate to document username
guidance in Debian's Policy Manual
rather than its wiki. His suggestion was to allow UTF-8 for regular
user accounts, but to restrict to ASCII for system accounts created by
Debian packages.
Richard Lewis asked
if enabling UTF-8 would open the door to "some of the abuse
described
" in a 2021 LWN article about flaws in Unicode handling
that led to security exploits. He said that it seemed to be a bad
idea to make the change, even if it would be nicer for users to have
the option.
Haber said
that he was not sure if it would be dangerous to allow UTF-8 usernames,
"since we can expect other commands to gracefully handle a byte
stream, can't we?
" Additionally, local administrators already
can loosen restrictions to allow UTF-8 usernames, but Debian does
not test for such use cases. Debian would become "more robust
"
if it assumed UTF-8 characters would be used in usernames.
"Vulnerabilities that could be exploited by having non-ascii
user names are already here and present today, just not uncovered yet.
"
It would be reasonable, Timo Röhling said,
to mitigate possible homograph attacks by disallowing mixed alphabets
"such as cyrillic and latin letters in the same name
". Haber said
that was not going to help if a user could directly write to
/etc/passwd, and he was unwilling to implement that himself
in adduser. He would accept code and test cases written by
others, though.
Keyboards
Security concerns aside, there are other practical problems with
supporting non-ASCII usernames. Étienne Mollier noted that he had "one weird
enough
" character in his first name that posed a problem if he had
to log in using a keyboard layout that lacked the capability to
transcribe the lower-case or upper-case 'e' acute characters ("é" or
"É"). For that reason, he said, he felt better about keeping a full
ASCII username and "wouldn't feel strongly if unicode support for
login never happens
". But it would be good if the gecos field of
the passwd file had proper Unicode support to properly
display users' real names.
Not only was it difficult to type "é" on some keyboards, it could
also be encoded in multiple ways. Gioele Barabucci pointed
out that it could be "e
with acute
" which is encoded in UTF as U+00E9, or it
could be "e, combined with an [acute] accent
" which would be
U+0065 plus U+0301:
If a keyboard input system provides the former sequence of bytes, but the username is stored in the login infrastructure using the latter sequence of [bytes], then a naive comparison will not find the user "émollier" in the system. Unicode defines in Annex 15 a few normalization forms as a way to work around this problem. But a correct use of these normalization forms still requires coordination and standardization among all programs accessing the data.
He asked if POSIX or other standards provided a normalization form
for UTF-8 encoded usernames. Peter Pentchev responded
that POSIX said to stick to the portable filename character set to
ensure portability. Haber argued
that it should be up to local admins to decide whether they wanted
their local user database to be portable. "I don't think that we should restrict
local admins who don't need that kind of portability.
"
Simon McVittie recommended that Debian consider adopting systemd's user name syntax and concepts of "strict mode" and "relaxed mode". The systemd tooling adheres to a strict naming convention when creating usernames, but it has a relaxed convention for accepting usernames created by other tools. McVittie said that seemed like a good principle for Debian to follow, even if its specific rules might differ from systemd's.
Haber seemed to agree in part, but said systemd's strict mode was
"even stricter than what we currently allow for system
accounts
", and he did not like that systemd's policies (especially with
systemd-homed, which LWN covered recently) were not configurable.
This time it's personal
The discussion, perhaps not surprisingly, brought out some strong feelings about how names and usernames were represented. Especially when, as Hofstaedtler noted, usernames can be important to some users:
I see and type my username hundreds times a day, people use it to address me in written and spoken conversations with it, etc.
If it were my uid, which I see maybe once a week and don't have to remember, I wouldn't care.
Indeed, it's not uncommon in open-source communities or within organizations to use a person's username rather than their given name—so it is unsurprising that some people feel strongly that usernames should be composed of a wider range of characters than POSIX recommends. Others dislike the practice of conflating usernames with real-world names, and see little reason to go to any trouble to go beyond ASCII.
Johannes Schauer Marin Rodrigues supported
allowing more than ASCII in usernames. He said it would be good for
Debian to put pressure on other projects to provide Unicode
support. "We cannot find these kind of bugs if we accept
translating everybody's given name to the American alphabet.
"
Bálint Réczey, though, asked
that Debian avoid opening that can of worms and imposing needless work
on upstreams. "Keep what works reasonably well for decades.
"
A plan
Haber initially seemed
bullish on allowing UTF-8 usernames in Debian "as a courtesy to those people who need non-ascii user names to
write their name
" and as an opportunity to find "bugs that are
already here
" in Debian's software. He acknowledged that it is late
in the development cycle for trixie. But, since it was currently
possible to create usernames with UTF-8 characters, he did not want
to tighten restrictions in trixie versus Debian 12, only to
revisit those restrictions for Debian 14. In a reply to Mollier
he wondered
about what advice to give in Debian's documentation "once we have
decided to officially allow UTF-8 login names
".
On December 3, however, Haber said
that he "finally understood
" that UTF-8 support would require
more than the ability to create an UTF-8 encoded username and write
it to /etc/passwd. Homograph characters, such as U+00E9 (é)
and U+0065 plus U+0301 (é), could be used with adduser to
create two separate users with lookalike usernames:
At the least, adduser should reject creating étienne if étienne already exists - those are different user names but look the same, and if you don't cut-and-paste user names instead of typing them you're bound to hit the wrong user depending on HOW you type and what input medium you use. Not good.
Haber said that he was the only active developer working on adduser and did not have time to implement a check against lookalike usernames in time for the trixie release. Worse, he said, the Perl module that he would use (Unicode::Precis) was not packaged for Debian and had not had a release in more than five years.
The next version of adduser, Haber said, would reject UTF-8 usernames by default. They would still be allowed when using the --allow-bad-names option, but he said he wanted to deprecate that option name in favor of something that doesn't use the word "bad". The --allow-all-names option will continue to pass everything verbatim to useradd.
Mollier thanked
Haber for his work on the problem, and suggested some
alternatives to the bad names option. Barabucci also thanked
Haber for taking the time to research the issue, to which Haber
replied
dryly, "I have learned many things.
"
Haber's current course of action for adduser seems the most prudent. There may be a day when it is more practical to expand the allowed characters for usernames, but the work required to do so right now is far greater than the benefits that users would gain in the process.
Posted Dec 5, 2024 17:22 UTC (Thu)
by isotopp (subscriber, #99763)
[Link] (6 responses)
If you allow utf-8 here, and relax length restrictions, it is unclear and unknowable what will happen downstream with other applications.
If you want to login as 'Kristian Köhntopp', it is probably useful to have an LDAP like name canonicalization mechanism that does a lookup to get a unix username and then tries the password with that. Anything else is very likely to break unexpected things.
In my personal opinion, even a --badnames option is wrong.
Or you go, and actually perform the work to define a username format for Unix (not just Linux), catalog use-cases and make sure that they actually work with full UTF-8, and whatever relaxed length limit you define. And then be prepared to handle a login with クリス (kurisu) instead of kris.
Posted Dec 6, 2024 9:33 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (2 responses)
Though I've now checked the docs and apparently it's possible to change the allowed characters in the configuration file so maybe that's a better approach for ansible deployed machines.
Posted Dec 6, 2024 19:28 UTC (Fri)
by raven667 (subscriber, #5198)
[Link]
Posted Dec 12, 2024 12:29 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link]
$ chown user.group somefile
OK, somewhere along the Red Hat line, chown started being more picky and insisting on user:group, but even so, might there be legacy boxes out there sharing usernames via some centralized system?
Posted Dec 6, 2024 10:21 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Anyway. A little bit of safety should be in everybody's interest, i.e. no mixed-charset names, and use some normal form to check for existing usernames.
Writing the above sentences is significantly easier than implementing them. While cyrillic vs. greek definitely is a problem, but latin vs. CJK? not so much IMHO. Normalize to exactly which normal form using which version of the Unicode standard? What do I do on the console, type \U4E52\U4E53 instead of 乒乓? what if my username is "🧪420"?
On the other hand … I never type my username anyway. When logging in on the GUI I click on my avatar, when connecting to a remote system with SSH or whatever it's the default, and a fresh text-only console login is easy because there the username is "root". 😎
Posted Dec 6, 2024 19:14 UTC (Fri)
by rgb (subscriber, #57129)
[Link] (1 responses)
Posted Dec 6, 2024 19:35 UTC (Fri)
by raven667 (subscriber, #5198)
[Link]
Posted Dec 5, 2024 18:52 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (14 responses)
Posted Dec 5, 2024 20:57 UTC (Thu)
by zeha (subscriber, #61580)
[Link] (13 responses)
Yes.
Posted Dec 5, 2024 22:00 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Dec 9, 2024 9:28 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (2 responses)
Posted Dec 9, 2024 17:53 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Dec 10, 2024 6:39 UTC (Tue)
by pvaneynd (subscriber, #898)
[Link]
Posted Dec 6, 2024 14:32 UTC (Fri)
by khim (subscriber, #9252)
[Link] (8 responses)
I would say it's “yes” and “no”, simultaneously. I have meet a lot of people who simply don't know English well enough to type name in ASCII. Unfortunately the majority of them I have meet when they cried on various forums about how unfair it is that they “have only just used Cyrillic (Arabic, Farsi, etc) name” – and now have so many broken programs they couldn't even count them all. Yes, it's deeply anglophonic, yes, it's unfair, true, people genuinely suffer if your force that on them… But the experience says that it's still better for them to lean 1 (one) English world (their account name) once then suffer through innumerable programs that don't support any other names properly.
Posted Dec 6, 2024 21:46 UTC (Fri)
by epk (guest, #174765)
[Link]
And it's not as though a non-Latin-alphabet username would really help that much, since so much text - especially in path names and URLs - is in English. There is, however, the full name of each user, and I'm guessing that should be much easier to have non-Latin UTF-8 in. And for non-computer-literate users who need a lot of hand-holding, they might actually see mostly/only their full names.
Posted Dec 7, 2024 10:18 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Dec 7, 2024 10:59 UTC (Sat)
by khim (subscriber, #9252)
[Link]
You wouldn't. Obviously people who couldn't type ASCII wouldn't ever do (and don't plan to do) anything in the command line. That's fine, the majority of computer users don't ever use command line and are not interested in the command line (many don't even know it exists). But even for them using non-ASCII letters in the $HOME is PITA. Simply because programs stop working – and changing $HOME temporarily brings another layer of pain.
Posted Dec 18, 2024 21:34 UTC (Wed)
by ssmith32 (subscriber, #72404)
[Link]
Yes, what a shame they can't just lean one world. Or just learn to live in one English world.
(yeah, cheap shot, but come on, if you're gonna get on a soapbox about folks learning to spell one word, you really should double check that you spelled *word" correctly, and avoid inadvertently proclaiming that there is one English World - it's the kind of thing that could end up really getting under a Scot's skin).
Posted Dec 21, 2024 21:18 UTC (Sat)
by steffen780 (guest, #68142)
[Link] (3 responses)
Similarly, until 2010 or so I would not use äöüß in filenames. Ever. To this day I still only use my native languages properly for low-risk "user-only" files - so I might use it for a LibreOffice file or a video, but I would not use it for a login username, anything in /etc, and so on. I just don't want the extra hassle. But I'm fairly advanced with IT - how is a typical user supposed to know that some software still can't handle such things, many DECADES after the problem was partially solved with Unicode? Do we really expect children today to learn a 1950s (!) encoding just so they know what characters they can use in a username? Surely there's more useful things that can be taught instead. E.g. pretty much anything else ;)
That being said: I wouldn't hold my breath for non-ASCII login usernames to become reliably usable with the infamous "long tail" of software. But huge progress has been made, and I think it's important to keep going.
Posted Dec 22, 2024 12:26 UTC (Sun)
by NAR (subscriber, #1313)
[Link] (1 responses)
we really expect children today to learn a 1950s (!) encoding
What they need to know is the English alphabet. And as English is the international language nowadays, we can expect them to learn this while they learn English. Besides, we're using lot of stuff "hardcoded" in the previous centuries, from the metric system to normal gauge, the Latin alphabet itself, etc. the list of characters in the original ASCII charset is just one of them.
Posted Dec 22, 2024 13:55 UTC (Sun)
by zdzichu (subscriber, #17118)
[Link]
I wouldn't be surprised, given the number of Polish people in the UK.
Posted Dec 23, 2024 9:25 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
I wouldn't call that "solved it many decades ago".
Posted Dec 5, 2024 19:11 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (14 responses)
Posted Dec 5, 2024 20:25 UTC (Thu)
by storner (subscriber, #119)
[Link] (2 responses)
The local-part portion is a domain-dependent string. In addresses,
"Domain-dependent" means that there are really no rules as to which characters can be used. It can even be quoted to allow whitespace.
Posted Dec 5, 2024 21:02 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
local-part = dot-atom / quoted-string / obs-local-part
is at the beginning of the RFC page which contains the statement
The local-part portion is a domain-dependent string.
The claim that domain-dependent would mean "no requirements" is thus obviously wrong. dot-atom and quoted-string are defined in sections 3.2.3 ("Atom") and 3.2.4 ("Quoted Strings"). Drilling down to the actual character set specifications always ends with a subset of ASCII, the most liberal one being the one for quoted strings which includes whitespace and all printable characters, ie, codepoints 32 - 126.
Posted Dec 5, 2024 21:04 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link]
Posted Dec 6, 2024 4:38 UTC (Fri)
by jheiss (subscriber, #62556)
[Link] (10 responses)
Posted Dec 9, 2024 13:24 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link] (9 responses)
I'll remove this alias next week. Until them you can email me at above address to check your email setup ;)
Posted Dec 9, 2024 14:42 UTC (Mon)
by geert (subscriber, #98403)
[Link]
| error: unable to extract a valid address from: 😒@pipebreaker.pl
Where's the "[f]orce" option? ;-)
Posted Dec 9, 2024 15:51 UTC (Mon)
by dskoll (subscriber, #1630)
[Link]
The Postfix server on my LAN had no problems with 😒@... but my Sendmail relay host rejected it.
Posted Dec 9, 2024 17:21 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (2 responses)
Posted Dec 9, 2024 18:24 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
I also got one from Alejandro, but did not manage to reply with fancy From:
SMTPUTF8 is required, but was not offered by host smtp3.kernel.org[44.230.10.245]
Posted Dec 9, 2024 20:05 UTC (Mon)
by raven667 (subscriber, #5198)
[Link]
The reply to Gmail with an ASCII address worked but the utf8 From reply was also filtered into Junk, but it did work.
I'm guessing that fake bounce messages are more commonly used for spam/phishing than real notifications which is why they are Junked repeatedly on totally different systems.
Posted Dec 9, 2024 18:24 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link]
Posted Dec 10, 2024 8:18 UTC (Tue)
by SiB (subscriber, #4048)
[Link] (1 responses)
Posted Dec 10, 2024 8:27 UTC (Tue)
by SiB (subscriber, #4048)
[Link]
=?utf-8?Q?=3C=F0=9F=98=92?=@pipebreaker.pl>: malformed address: >
Somehow, gnus was confused by the <> around the address. Without <>, exim4 did a graylisted delivery attempt.
Posted Dec 10, 2024 11:29 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
I've tried this with Exim and KMail; I can send just fine, but because I've not turned on Exim SMTPUTF8 support (in part because I need to test what Cyrus IMAPd thinks of UTF-8 in local parts), no reply comes through.
Posted Dec 5, 2024 19:19 UTC (Thu)
by rhowe (subscriber, #102862)
[Link] (12 responses)
Now, these users do not exist in the passwd file and therefore aren't created via useradd or adduser so this isn't directly relevant to the issue being discussed here, but it is certainly legitimate for usernames to contain "funky" characters and indeed potentially problematic ones. For example, if something were to treat the backslash as an escape character then all sorts of fun could occur from injecting of newlines into logs to injection of null terminators. Inadequate quoting in shell scripts being a prime example.
Also, both the domain and username portions are determined by the records in Windows' Active Directory and therefore need to follow the rules for that system. For the 'sAMAccountName' field, it's documented at https://learn.microsoft.com/en-us/windows/win32/adschema/... where interestingly it's defined as a Unicode string but not containing any of: "/ \ [ ] : ; | = , + * ? < >
Posted Dec 5, 2024 19:31 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (11 responses)
Posted Dec 5, 2024 19:40 UTC (Thu)
by dskoll (subscriber, #1630)
[Link] (10 responses)
The local-part of your email address doesn't have to be your UNIX user name, though. It often is for convenience, but while the local-part of my email address is dianne, that is not my UNIX login name.
So appealing to email as a reason to restrict UNIX login names is not a great argument. I think a better argument is simply to make life easier for programs that need to deal with login names and that don't want to worry about UTF-8 canonicalization, etc.
Posted Dec 5, 2024 19:51 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (9 responses)
Posted Dec 5, 2024 20:57 UTC (Thu)
by zeha (subscriber, #61580)
[Link] (2 responses)
Posted Dec 5, 2024 21:09 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link]
Posted Dec 6, 2024 19:43 UTC (Fri)
by raven667 (subscriber, #5198)
[Link]
I think it's worth the effort to identify and fix those programs so people can use their real name for display in the way they prefer to see it regardless of what language they use. If there is no one maintaining a particular MTA or MUA or whatever that breaks because of this, then you've learned that unmaintained software eventually breaks when the world changes around it, but this kind of change could be eased into over several release cycles by making it optional while bug reports and testing are done, before accepting it as the default and a blocker.
Posted Dec 5, 2024 21:23 UTC (Thu)
by dskoll (subscriber, #1630)
[Link] (5 responses)
No, that would not be fun, but still... appealing to email addresses as a reason to restrict usernames isn't a good argument. Some email systems store email in ways that don't necessarily depend on UNIX login names at all (for example, Cyrus IMAP.)
Posted Dec 5, 2024 21:59 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (4 responses)
Posted Dec 5, 2024 23:26 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
I think the birth of email actually predates the birth of Unix?
Cheers,
Posted Dec 5, 2024 23:47 UTC (Thu)
by KJ7RRV (subscriber, #153595)
[Link]
I think I'm misunderstanding this part? It seems to mean that all UNIX systems are email servers; is that correct?
Posted Dec 6, 2024 0:15 UTC (Fri)
by dvdeug (guest, #10998)
[Link] (1 responses)
A fully POSIX-compliant UNIX system has an email system, though in the modern world, very few UNIX systems are connected to Internet email. I wouldn't say it's not UNIX if it doesn't have an email system. I removed mailutils, mailx, and mailcap from my Debian unstable system, and nothing depended on them. The concept of open access to email via Internet has been lost, and system-wide email isn't very useful on a single-user system.
Posted Dec 6, 2024 4:42 UTC (Fri)
by KJ7RRV (subscriber, #153595)
[Link]
Posted Dec 6, 2024 4:55 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
Just ask these people to stop. Then, do as many other people do and simply treat _both_ usernames and uids as low-level implementation details; that's what they are.
Asking all programs in the universe to agree on some UTF-8 subset for usernames is totally unrealistic. This discussion and article barely scratch that surface.
> I see and type my username hundreds times a day
Not sure what the problem is here. Surely, anyone can find something in ASCII that's not unpleasant to look at?
The simple and reliable way forward is to allow UTF-8 in some non-key, free-form, pure display field like "gecos" or similar and pressure applications to display that in User Interfaces and as many places as possible - while still relying on portable, unique and bug-free ASCII usernames in code and other implementation details. Isn't it what's happening already?
Posted Dec 6, 2024 5:03 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (27 responses)
A French "fun fact" is that many French people wrongly believe that É, À, Ù etc. "do not exist" because... the default _Windows_ keyboard layout for France makes these incredibly hard to type! fr_FR layouts on Mac and Linux are not affected and neither are some other French-speaking countries.
é/É is one of the most common characters in French.
Note this is pure software issue: there's no relevant, physical difference between Windows and Macs keyboard.
https://www.google.com/search?q=majuscules+accentu%C3%A9es
Even more fun: you can tell whether newspapers and other editors use Windows or not by simply looking at their front page. Examples:
https://www.lemonde.fr/ -> Économie
https://www.liberation.fr/ -> Economie
Posted Dec 6, 2024 6:49 UTC (Fri)
by victrid (subscriber, #163116)
[Link] (3 responses)
In fact, you can type Japanese characters directly on the keyboard, but you cannot expect to see them in text mode. Supporting CJK characters included in UTF-8 is too complicated compared to supporting Latin-1.
Imagine desperate ops logging in to rescue via the console and nothing except blank diamond symbols can be displayed.
Posted Dec 6, 2024 11:30 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link] (2 responses)
Posted Dec 6, 2024 19:59 UTC (Fri)
by wahern (subscriber, #37304)
[Link]
Posted Dec 7, 2024 3:05 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Not in the pure text mode and even with graphical framebuffers it's hit-and-miss.
Posted Dec 7, 2024 13:09 UTC (Sat)
by geuder (subscriber, #62854)
[Link] (22 responses)
I studied French as a foreign language in school about a decade before Windows existed. We were taught that on capital letters accents are completely optional, leaving them out is not a mistake.
Last time I looked it up I found there is a strong recommendation to use accents on capital letters, too.
Of course computers should allow you to follow recommendations.
Unfortunately Unicode is a train weck when it comes to security / canonicalization issues, so in real life the whole world needs to restrict them to limitations anglophonic computer programmers imposed on them when ASCII was invented. For URLs I find this much worse than for user names.
Posted Dec 8, 2024 1:36 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (21 responses)
> No idea whether the same has ever been told in French schools.
It was but it's a bit more complicated.
French schools have always taught _cursive_. It's faster when you know it. I think they still do. I've read that other countries with Roman languages tend to teach cursive too? I don't think upper cursive ever had accents for some reason and that's indeed what they use to teach in schools. BTW lower cursives are still standard in France but I think upper cursives are dying. You can still find samples easily; just search for "cursive majuscules".
But serious professionals in the _printing_ industry never stopped using accents on capitals. All dictionaries always had them and all professional guidelines always required them.
I've many times "stunned" French people who were arguing for the lack of accents on capitals by simply... opening a dictionary or book found on their own shelves. I think taht's bucaese fnleut rerades do not pay aoietttnn to ivainuddil cacrahetrs.
BTW Macs have been dominant in the publishing industry for a long time, not sure now.
There is also a subtle difference between "majuscule" (grammar) and "capitale" (typography) that only professionals tend to know. That difference shows when you write the initial character with a "big" capital and the rest in small caps. Most people in France make no difference and just say "majuscule" for both.
> Unfortunately Unicode is a train weck when it comes to security / canonicalization issues, so in real life the whole world needs to restrict them to limitations anglophonic computer programmers imposed on them when ASCII was invented.
+1, writing is everything but an exact science; don't let it pollute code and hard logic or you'll be in a world of pain.
It may feel good trying not to be an evil American dominating world culture once again but the cold truth is: ASCII is universal also because it's dead SIMPLE; not just because it came from the top superpower.
Posted Dec 8, 2024 3:38 UTC (Sun)
by KJ7RRV (subscriber, #153595)
[Link]
Interestingly, my phone puts the diacritics on Ἔ above the character before it; I'm not sure if that's correct formatting or a bug. It might be part of the reason why diacritics aren't used in all-caps text; when they're used on initial capitals, they're (almost, at least) always used after a space, whereas in all-caps, they would not be.
Posted Dec 8, 2024 15:44 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (19 responses)
Posted Dec 8, 2024 16:18 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (18 responses)
As someone only peripherally aware of the French 80s computer options; did the home grown systems (Minitel, Thomson MO5 and TO-7 plus other Nanoréseau machines, Groupe Bull minis and mainframes etc) provide convenient ways to enter accented majuscules, or were they also restricted to miniscules?
Posted Dec 8, 2024 19:39 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (17 responses)
Pictures and documents clearly show that accents were available as dead keys (no dedicated key for é or other)
Section 3.3.2 of this scanned specification mentions accented capitals explicitly:
The encoding was apparently ISO 2022 G2
So unlike Windows, it looks like accented capitals were not missed by the Minitel!
Posted Dec 10, 2024 20:42 UTC (Tue)
by rschroev (subscriber, #4164)
[Link] (16 responses)
Or is the issue here that people often don't know how to use dead keys these days?
Posted Dec 11, 2024 9:58 UTC (Wed)
by taladar (subscriber, #68407)
[Link]
Posted Dec 11, 2024 12:30 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (10 responses)
That's almost certainly true for the Anglo-Saxon world. There's nothing on my (105-key UK keyboard) that looks like a "compose" key, and I wouldn't know where to start. I used to have a Cyrillic keyboard, but that had the old DIN connector, so is long gone ...
Cheers,
Posted Dec 11, 2024 12:49 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (9 responses)
A compose key is a mode switch for the keyboard; the next few (at least two) keypresses are combined into a single character. I've configured Shift-CapsLock as a compose key, so I typed è as Shift-CapsLock, e, `.
Posted Dec 11, 2024 12:55 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (8 responses)
Cheers,
Posted Dec 11, 2024 13:06 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
Exactly, and it results in a different set of tradeoffs. A compose key is strictly more flexible than dead keys, since I can type things like Compose o c, and get ©, and both Compose , c and Compose c , get me ç. A dead key is clearer to the user - you type accent, base character, and always get the combination of the accent with the base character.
Posted Dec 12, 2024 0:16 UTC (Thu)
by geuder (subscriber, #62854)
[Link] (1 responses)
Posted Dec 12, 2024 11:20 UTC (Thu)
by yaap (subscriber, #71398)
[Link]
Posted Dec 11, 2024 23:10 UTC (Wed)
by rschroev (subscriber, #4164)
[Link] (3 responses)
On those mechanical typewriters the vertical position of the dead key accents is fixed obviously, on the correct height for lowercase letters but not high enough for uppercase letters. The typewriter won't stop you (and can't stop you) from using a dead key in combination with uppercase letters, but the result will not be satisfactory.
Computers are smarter than mechanical typewriters and can produce the correct glyph, with the correct vertical position of the accent to match the letter it's combined with. On computers the user experience is different though: when you press a dead key, nothing seems to happen. No output appears. Only on the next key press is output generated. What that output looks like depends on the combination. For example, if I press ´ followed by a, I get á. But if I press ´ followed by space I simply get ´, and combined with z it gives me ´z (because z with an acute accent doesn't exist, I guess).
Belgian azerty has a number of dedicated keys for the most common letters with accents, so you don't need to use dead keys for those (though you could, if you wanted to): é è à ù (ù doesn't seem all that common to me though; I would think ê is more common). All other accented characters require the use of one of the dead keys (^ ¨ ´ ` ~). ^ is a special case in that it appears twice on the keyboard: once as a dead key to produce e.g. ê , and once as a normal key to produce ^ (which I can also produce by first pressing the dead key ^ followed by space, just as I can with all the other dead keys). I don't know why ^ is special enough to get two appearances.
Side note: "Dedicated" is not entirely the correct term here: all those keys produce other letters when combined with Shift, and often also with AltGr. For example, é is on the same key as 2 (Shift) and @ (AltGr) (yes, azerty keyboards require Shift to type digits, which is why people using azerty have a somewhat higher tendency to use the numerical keypad for numerical entry).
Second side note: The term "dead key" is not exactly correct either, since being dead or not is not a feature of the key itself anymore like it was on mechanical typewriters. For example the key with ^ produces a very normal non-dead [ when used with AltGr. There are no fully dead keys anymore (on Belgian azerty, at least).
(That's probably more than you wanted to know about dead keys and accents on azerty keyboards)
Posted Dec 20, 2024 21:55 UTC (Fri)
by sammythesnake (guest, #17693)
[Link] (2 responses)
I imagine it would be optimistic to expect you to have such a typewriter on hand to experiment with, but the first thing I'd try would be to use the SHIFT key while pressing the accent key...
Posted Dec 21, 2024 12:05 UTC (Sat)
by johill (subscriber, #25196)
[Link] (1 responses)
Posted Dec 21, 2024 16:42 UTC (Sat)
by rschroev (subscriber, #4164)
[Link]
Posted Dec 12, 2024 9:06 UTC (Thu)
by MortenSickel (subscriber, #3238)
[Link]
But I have no idea if the first lettes in the comments comes out as an accented letter or as a accent+letter combo.
Posted Dec 12, 2024 5:23 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (3 responses)
Not for é (extremely common character) with the default Windows layout in France. é has its own key. It becomes É with caps lock on a mac but it becomes a number on Windows. There are some dead keys available but not for the acute accent.
There are apps that let you visualize any keyboard layout.
Posted Dec 20, 2024 22:05 UTC (Fri)
by sammythesnake (guest, #17693)
[Link] (2 responses)
E.g. pressing the "2" key will do the same with our without caps lock, but shift will change that to a double quote or @ symbol or whatever.
I don't ever use windows, but I'd like to see if the "é" key does the same thing with shift as with caps lock, or something different. My (possibly optimistic) guess is that caps lock would give you the Élusive character :-P
Posted Dec 21, 2024 15:10 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
I seem to remember one computer layout that had both, certainly shift-lock can be damn useful and its (apparent) lack on modern keyboards could be a pain. I don't feel that any more, it's been too long ago, but if I had it back I'd probably find a use for it :-)
Cheers,
Posted Dec 21, 2024 18:00 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
With a France keyboard, it depends whether you use mac or Windows.
> I don't ever use windows, but I'd like to see if the "é" key does the same thing with shift as with caps lock, or something different.
On Windows it does the same thing. That's the problem and one of the reasons why É is so hard to get on Windows.
> My (possibly optimistic) guess is that caps lock would give you the Élusive character :-P
Instead of wrongly guessing, you could just search the internet or open one of the references already mentioned above.
Posted Dec 6, 2024 19:41 UTC (Fri)
by rgb (subscriber, #57129)
[Link]
Posted Dec 6, 2024 20:48 UTC (Fri)
by gioele (subscriber, #61675)
[Link] (1 responses)
Later in the thread [1] Michal Politowski pointed out that RFC 8265 "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" and its sibling RFC 8264 "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols" do in fact describe which normalization forms should be used when comparing Unicode usernames (as well as a number of other low-level details).
[1] https://lists.debian.org/debian-devel/2024/11/msg00507.html
Posted Dec 7, 2024 13:19 UTC (Sat)
by geuder (subscriber, #62854)
[Link]
Posted Dec 6, 2024 23:56 UTC (Fri)
by peter-b (guest, #66996)
[Link] (1 responses)
UAX #31 Unicode Identifiers and Syntax
Please use it instead of endlessly arguing over a problem already solved by domain experts.
Posted Dec 7, 2024 5:40 UTC (Sat)
by gioele (subscriber, #61675)
[Link]
Annex 31 is not prescriptive enough. For example, when it comes to normalization forms:
> UAX31-R4. Equivalent Normalized Identifiers: To meet this requirement, an implementation shall specify the Normalization Form and shall provide a precise specification of the characters that are excluded from normalization, if any.
Instead, PRECIS and RFC 8265 "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" are more prescriptive and actionable. From <https://www.rfc-editor.org/rfc/rfc8265.html#section-3.3.1>:
> 4. Normalization Rule: Apply Unicode Normalization Form C (NFC) to all strings.
Posted Dec 8, 2024 1:44 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
No, it's very easy! Look:
> Most Debian users don't work with useradd, or groupadd, directly. Instead, Debian has long supplied its own adduser (and addgroup) utilities. These act as simpler front ends to useradd.
Couldn't resist sorry.
Posted Dec 9, 2024 9:43 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (18 responses)
I am thinking of examples like scripts breaking because systems unnecessarily auto-switch output of basic unix utilities to use commas instead of periods or to use column headers in a different language, imports failing because a locale set a different default character set, admins and users looking for error messages online but not finding any results because the results are partitioned by language of the error message or even ridiculous examples like the VBA keyword translations,...
Don't get me wrong, for user-facing input and output of course it should be supported to display and produce content in every language but some people just take it too far into the low level details without thinking about the question if the natural language used is even a significant barrier to the intended audience and which negative impacts translations can have.
Posted Dec 9, 2024 10:28 UTC (Mon)
by mbunkus (subscriber, #87248)
[Link] (16 responses)
Posted Dec 9, 2024 12:22 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
Like it or not, there is a reason a lingua franca is common in many fields among experts and that reason is that the only thing worse than content in a single language you don't speak is content split over dozens of languages where each speaker speaks none of the others.
Posted Dec 9, 2024 17:10 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (11 responses)
What might be useful in that case is to promote the use of unique ASCII identifiers for error messages that make them searchable across languages or text edits, eg %FOO-PLORT-12345: this is a well worn solution to this problem
Posted Dec 10, 2024 11:14 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (5 responses)
Posted Dec 10, 2024 11:46 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (4 responses)
The well-worn solution has messages that look like %FOO-PLORT-12345:"filename","example.com","2001:db8:1::42/64"%. The idea is that you look up %FOO-PLORT-12345 in your catalogue of possible messages, and get told that it's "could not download {1} over HTTP from https://{2}/ (resolved IP {3})". You can then fill in the parameters (by hand, back in the day, computer can do it now), and discover what the error meant.
Posted Dec 11, 2024 9:45 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (3 responses)
Posted Dec 11, 2024 10:50 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
The UNIX world never went this way; I encountered it interacting with mainframes and minicomputers, back 30-odd years ago.
Posted Dec 11, 2024 12:13 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Dec 11, 2024 18:42 UTC (Wed)
by raven667 (subscriber, #5198)
[Link]
https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/16_xe/s...
found an example from some IBM system that is this style where every log is numbered
https://publibz.boulder.ibm.com/epubs/pdf/ispzmc90.pdf
grabbing one at random ISRB0001 is searchable and leads to further docs https://www.ibm.com/docs/en/zos/2.4.0?topic=codes-ispf-me... which would be a searchable tag even if the text of the message was localized or changed between versions
Posted Dec 10, 2024 17:23 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link] (4 responses)
It's kind of hard to know how many people across the world do not speak English. There are several statistics out there that say that up to 1.45 billion people do speak English[1], but there are 8.2 billion people across the globe (or something like that). For whatever reason. Lack of education (or even educational possibilities), too young, too old, learning disabilities, socio-economic pressure & limitations etc. etc. "Just learn English" is not going to cut it just yet, maybe never.
For example, I started using computers when I was eight, I think. I was able to learn to program in it because the manual it came with was in German, even though the software itself was in English. I could not speak English at that point, but having documentation in my native language enabled me to at least associate several English words (PRINT, IF…) & short phrases (SYNTAX ERROR IN…) with their German counterparts, but only because I had the German stuff to learn from. If that hadn't been available, I might only have started doing stuff with computers years later if ever at the scale I'm doing it now. Having stuff available in your own language enables you to learn, to use, to create. Saying things like "everyone needs to learn English in our field" and "i18n has cost businesses a lot" is really thinking from inside a certain bubble, and it's really excluding & limiting.
All I'm asking for here is to be more open to make software, especially Open Source software, available and usable to all, not just the English-speaking system admin clique.
[1] https://www.statista.com/statistics/266808/the-most-spoke...
Posted Dec 11, 2024 9:57 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (3 responses)
I am absolutely in favor of translating interfaces used by laymen (but only those parts they want skip over anyway like error messages) and documentation.
I have the opposite experience to yours though, when i was younger, in the 1990s, a lot of computer books were translated by clueless translators so every publishing house had a different German version of the standardized English IT terminology and some of the coding examples in programming books were broken because the translators didn't understand how to translate e.g. a regex replacing part of a string.
Similarly, even in entertainment media, once I learned English I noticed how many of the German dubs contain English idioms that do not exist in German and were just translated word for word (presumably to make the lip-sync work).
I am also not talking about the cost to business here, I am talking to the cost i18n has to the communication itself by making that worse, not the financial cost.
Posted Dec 11, 2024 12:20 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
This! As someone who's German is passable, and who's French has mostly been forgotten (plus ancient smatterings of Russian and Khmer), so much information is passed *by reference* in conversation, that if you're not a native speaker it's extremely easy to miss what is actually being said. Or (as has happened to me) the "meaning as written" can be very different to the "meaning as understood", so you end up saying something completely different from what you thought you had said!
Cheers,
Posted Dec 11, 2024 16:50 UTC (Wed)
by mbunkus (subscriber, #87248)
[Link] (1 responses)
You're trying to enforce permanence on human language here. Error messages may change for a number of reasons, including them being unclear or even plain wrong, having to be extended to include additional information, include examples to the user how to fix the error/use the program correctly, or just stylistic changes. Even error messages written in English might contain non-ASCII characters if they include user-generated content, and that might not even be validly encoded (e.g. a file name). Note that all of those can happen with English as well.
If you want "I don't want to have to change my things, ever", then you're in well-trotten territory of e.g. REST APIs & similar. Argue for your low-level tools to implement best practices from those APIs, including:
- structured, versioned output
That gets you everything you want while also allowing the tools to be translated, their messages changed in whatever way, to be easier to use by more people. This is something that I would very much like to see as well.
As for two examples, the "ip" tool & the "restic" backup command have JSON output in addition to the well-known, default human-readable one. It's easy to handle. Unfortunately in both cases error messages (and in the case of Restic certain verbose status messages) are still printed as human-readable messages instead of using JSON for it as well, falling short of what I'd like to see.
Posted Dec 16, 2024 10:28 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
Unfortunately as long as you have some sort of output that isn't fully pre-specified (like an enum) but a free form value you would then soon get the feature request to translate those parts of the output too because someone wants to build some sort of user-facing UI based on the machine-readable output.
My argument is more that certain messages should not be translated because translations are literally hurting communication when compared to the use of a single language.
Posted Dec 9, 2024 18:43 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Dec 10, 2024 11:18 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Dec 11, 2024 6:33 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I maintain a couple of small projects, and I actually communicated with Arabic speakers via a translator.
Posted Dec 9, 2024 14:47 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
And yet it should be so easy to fix ... I use lilypond, who's default language is DUTCH. Yet it works in English fine (except all the docu is in American :-)
You just need something like "#pragma English" or whatever, to say what language the keywords are. lily has "#include english.ly", which redefines all the notes as the American names (mostly the same as English). And you could redefine everything else if you chose - although it does help that most music terms are universal (and Italian!).
Cheers,
Posted Dec 9, 2024 18:30 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (6 responses)
With that said, I don't understand why some people want their *name* as a user name. Maybe just because that's the way it's presented. Would it be called "a system identifier", it wouldn't be a problem at all. I've had logins made of one letter and 6 digits for many years and nobody complained at all. It could be said that as a convenience, since the system supports this or that alphabet, you're free to select a system identifier that more or less looks like your name provided that it's available, and that would be fine.
The problem really seems to be how it was presented to users in the first place.
Also those trying hard to get access to "root" (or "administator" in some other environments) suddenly love their new permissions regardless of the accepted character set to write these identifiers.
Maybe the situation could progressively be reversed by changing the way tools present these logins to call these "system identifiers" exclusively and recalling the list of allowed chars at creationn time.
Posted Dec 10, 2024 7:13 UTC (Tue)
by micka (subscriber, #38720)
[Link]
Posted Dec 10, 2024 7:50 UTC (Tue)
by mb (subscriber, #50428)
[Link] (3 responses)
Well, because it's easy to remember.
What would help is if the system would never actually ask for the user name, if there's only a single (non-system) account.
Posted Dec 10, 2024 8:08 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
> Just automatically create the user "main" in the background while installing and never tell the user about it, unless a second account is create
I don't know how (un)usual my setup is, but my main system is a desktop. And because I hate people screwing up my defaults, it's a very firm policy that EVERYone has their own account on that system. I suspect that is pretty much a normal setup for geeks ...
Security may be *****, but as it's a home system we're not worried about family members.
So that policy wouldn't work for us, or for a lot of other people I suspect ...
Cheers,
Posted Dec 10, 2024 9:41 UTC (Tue)
by mb (subscriber, #50428)
[Link]
Its just not how normal people use computers, though. See RaspberyPi OS. It just creates a "pi" user and doesn't bother the user with details. It would be even better, if it never showed this to the user, as long as pi is the only name.
Posted Dec 10, 2024 13:13 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link]
BTW, just look here on LWN: most logins are short (including yours which is among the shortest). When I registered I tried "willy" and it was already taken by Matthew so I switched back to something simple. In any case I needed to note it somewhere, so I could have used anything else, like people have on gmail for example.
Posted Dec 10, 2024 17:09 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link]
Easier to remember. Emotional attachment. The fact that outside of computers you usually do use your name & not some arbitrarily restricted identifier to refer to yourself. Far less understanding for technical quirks & anachronisms in the general, not-too-tech-savvy public.
When I see some of my relatives having problems remembering their PINs for their EC cards which they use several times a week, I'm sure they're not keen on having to remember arbitrary identifiers for websites on end.
There are plenty of reasons. I guess you wouldn't consider them valid or important enough. Others may disagree.
Posted Dec 10, 2024 13:08 UTC (Tue)
by alx.manpages (subscriber, #145117)
[Link] (3 responses)
- How many people are currently using --badname (or the equivalent in the wrapper programs)?
I suspect the number is low.
Those that understand computers most likely avoid it,
Even more so if they have ever used different keyboards for input.
There might be some people that do use UTF8 symbols in their usernames,
- How many of the people already using it would continue using it if you explain them the possible consequences?
Then, can we justify discussing support for one feature that has been available for a very long time (in Debian) with close to 0 users --especially those informed--, where we suspect it's quite dangerous?
Have we learnt something from allowing \n in file names?
Posted Dec 12, 2024 13:43 UTC (Thu)
by MortenSickel (subscriber, #3238)
[Link] (2 responses)
How many people are each and every day annoied that there is a limit on allowable characters in your username?
Since usernames usually are closely connected to your given name, for my (Norwegian) friend Bjørn, the name bjorn is not his real name, and although bjoern to a certain degree is a correct spelling, it feels wrong, and for people using completely other character sets as cyrillic, arabic or ... it is even worse.
So the answer is clearly, the usernames should allow UTF-8, but as this article has clearly shown, it is not an easy part to get there, but hopefully one day. The way it is today is more or less a user name standard saying "ascii ought to be enough for anybody"
Posted Dec 16, 2024 10:33 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
Posted Dec 17, 2024 14:11 UTC (Tue)
by tao (subscriber, #17563)
[Link]
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Anything but POSIX portable filename set with a conservative length restriction is dangerous
I think that says it all. Unicode is made to display text, not to create IDs.
Anything but POSIX portable filename set with a conservative length restriction is dangerous
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
The main cause of this is the https://en.wikipedia.org/wiki/Han_unification in unicode, which maps different Chinese, Korean, Japanese and Vietnamese characters to the same unicode code point.So the whole "let's juse use UTF-8" isn't remotely enough :(.
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Or just lean on one word to avoid learning English.
Or something.
Doesn't the GECOS field already cover some of this use case?
One of the restrictions I set when we chose our children's name was to avoid accented characters - for the very same reason, to avoid possible problems during travel. For various reasons I lifted this restriction for our third child - of course he was the one born abroad :-) I was very (and pleasantly) surprised when the British clerk managed to produce a proper ó for the birth certificate - I think she saved us quite a headache.
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Doesn't the GECOS field already cover some of this use case?
Once upon a time in the past ...
Once upon a time in the past ...
it is simply interpreted on the particular host as a name of a
particular mailbox.
Once upon a time in the past ...
Once upon a time in the past ...
Once upon a time in the past ...
Once upon a time in the past ...
Once upon a time in the past ...
| What to do with this address? ([q]uit|[d]rop|[e]dit):
Once upon a time in the past ...
<😒@pipebreaker.pl>: host 192.168.xx.yy[192.168.xx.yy] said: 501 5.1.3 8-bit
character in mailbox address "<p???@pipebreaker.pl>" (in reply to RCPT TO
command)
Once upon a time in the past ...
Once upon a time in the past ...
Once upon a time in the past ...
Once upon a time in the past ...
GNUS
Address ‘😒@pipebreaker.pl’ (=?utf-8?Q?=F0=9F=98=92?=@pipebreaker.pl) might be bogus. Continue? (y or n) y
Sending...
Sending via mail...
message-send-mail-with-sendmail: Sending...failed to 2024-12-10 09:17:47 1tKvRD-000000003Sv-2FNl bad addresses found in headers;
Exim4
may not follow =?utf-8?Q?=3C=F0=9F=98=92?=@pipebreaker.pl
Once upon a time in the past ...
Real-world non-alphanumeric usernames
The more modern userPrincipalName attribute is defined as following RFC822 which is not very helpful given the broad nature of that RFC: https://learn.microsoft.com/en-us/windows/win32/adschema/...
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
Wol
UNIX and email
Real-world non-alphanumeric usernames
Real-world non-alphanumeric usernames
usernames are a low-level implementation detail
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
No idea whether the same has ever been told in French schools.
French people who believe É does not exist
https://fr.wikipedia.org/wiki/Capitale_et_majuscule
Diacritical marks on capital letters
French people who believe É does not exist
1. Usual cursive majuscules do not carry accents.
2. French keyboard (azerty) have labels for accented minuscule but not for accented majuscules, so as a result a lot of people do not know how to input them at all.
French people who believe É does not exist
French people who believe É does not exist
https://www.minitel-alcatel.fr/documents/M1_1983-1984/STU...
https://en.wikipedia.org/wiki/ISO/IEC_2022
That does not mean they were popular but it was clearly possible.
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
Wol
Compose is different to a dead key. A dead key is a key you can press that appears to do nothing, but where the next keypress is modified by the dead key - for example, if you have a dead key for `, then pressing it does nothing, but pressing ` followed by e gets you è.
French people who believe É does not exist
French people who believe É does not exist
Wol
French people who believe É does not exist
The compose aka multi key is truly amazing, someone must have had too much time...How to enter weird characters under X11
$ grep -c '^<Multi' /usr/share/X11/locale/en_US.UTF-8/Compose
3580
I don't even use en_US
locale, but somehow the definitions seem to get included anyway.
Some of the more useless examples:
<Multi_key> <C> <C> <C> <P> : "☭" U262D # HAMMER AND SICKLE
<Multi_key> <p> <o> <o> : "💩" U1F4A9 # PILE OF POO
Not sure whether all of this is available under Wayland or is it another reason to postpone upgrading ☺
(the last character was typed as <Multi_key> <colon> <parenright>
)
How to enter weird characters under X11
Dead keys
Dead keys
Dead keys
Dead keys
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
French people who believe É does not exist
Wol
French people who believe É does not exist
It's bad
RFC 8265 defines how to normalize and compare Unicode usernames
RFC 8265 defines how to normalize and compare Unicode usernames
Read the Unicode standard
https://www.unicode.org/reports/tr31/
Read the Unicode standard
>
> UAX #31 Unicode Identifiers and Syntax
> https://www.unicode.org/reports/tr31/
>
> Please use it instead of endlessly arguing over a problem already solved by domain experts.
adduser and useradd
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Wol
Damages i18n has done?
https://www.cisco.com/c/en/us/td/docs/ios/12_2/sem2/syste...
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Wol
Damages i18n has done?
- a status indicator
- machine-parseable, stable error codes (that don't change) alongside human-readable error messages (that are subject to change & translation)
- one imposed language on all identifiers, most likely English (e.g. hash keys, status strings etc.)
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Damages i18n has done?
Wol
It's the human
It's the human
It's the human
I don't want to remember a cryptic user name and a cryptic password.
Just automatically create the user "main" in the background while installing and never tell the user about it, unless a second account is created.
It's the human
Wol
It's the human
Just create any amount of users you want.
It's the human
It's the human
Is this a real problem?
knowing that it can trigger bugs in so many places.
I would say the number is exactly 0,
except maybe for a few cases just for fun testing a system.
Restricting oneself to [a-z] for passwords is a good recommendation for similar reasons.
You might get locked out of your own system if you can't type the symbol.
and I expect it's people that have no clue of how that works. So:
Is this a real problem?
Is this a real problem?
Is this a real problem?