|
|
Subscribe / Log in / New account

Szorc: Mercurial's Journey to and Reflections on Python 3

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 19:17 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

The TLDR; version should be: "Use Rust or Go. They care about their users"


to post comments

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:03 UTC (Mon) by koh (subscriber, #101482) [Link] (92 responses)

There is no mentioning of "Go" in the blog post.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (91 responses)

That's why "should".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:13 UTC (Tue) by koh (subscriber, #101482) [Link] (90 responses)

I suppose "the TLDR; version" should be about your upcoming blog post describing how Go cares about its users and not about the one this article talks about.

The TL;DR version for this one here should be "removal of u'...' literals, '%' on objects of bytes type, **kwargs only on str instead of bytes are backwards-incompatible changes making the transitition harder - all in all: a global change that for C as a language has basically had no effect (ascii -> utf8, probably by design) in Python 2/3 results in huge ramifications."

> [...] "if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3". As crazy as it initially sounded, I think I agree with that assessment.
Might also enter, but Go is entirely out of scope.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:59 UTC (Tue) by flussence (guest, #85566) [Link] (32 responses)

Porting Python apps to Go is currently quite a lot more viable than Rust, which is both handicapped by LLVM's limited platform support and doesn't have any batteries included.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 5:29 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (29 responses)

Just started with Rust, and liking the language a lot more than Go ... so far..

I haven't run into any issues with LLVM support, what have you seen?
Odd that this would be unique to Rust, and not the plethora of other languages that use LLVM, with no complaints. Or is this something that crops up, no matter the language, and I just haven't heard of the use case yet? Linux & Mac should definitely be well supported. And I assume windows, if Mozilla is using Rust...

Also, the Rust language and standard library seems to have more features than Go: but that would be a very subjective opinion on my part. Depends on what you're focused on, I think - what do you find in Go, that was missing in Rust, and you found frustrating?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:11 UTC (Tue) by roc (subscriber, #30627) [Link] (18 responses)

"LLVM's limited platform support" is a reference to the fact that gcc supports a lot of obscure/obsolete architectures that LLVM doesn't, e.g. SH4, PA-RISC, m68k.

OTOH LLVM supports Qualcomm Hexagon while gcc doesn't, and is an alive-and-well architecture that has shipped billions of units. For some reason Rust's detractors do not see this as a problem for gcc or Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:19 UTC (Tue) by dvdeug (guest, #10998) [Link] (17 responses)

There's an out of date GCC port to Qualcomm Hexagon. I'm sure that GCC 4.5 will do the job much of the time.

LLVM doesn't support a number of architectures that distributions/OSes like Debian and NetBSD support using GCC. Linux does in theory run on Qualcomm Hexagon, but as far as I can tell most of those shipments have been in SnapDragon SoCs running ARM chips, and the Hexagon chips are used for Qualcomm-proprietary reasons or specialized multimedia or AI purposes. Mercurial is never going to run on that for anything besides a demo.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:55 UTC (Tue) by roc (subscriber, #30627) [Link] (16 responses)

The existence of Debian/NetBSD ports targeting museum architectures should not influence anyone's choice of programming language.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:38 UTC (Tue) by dvdeug (guest, #10998) [Link] (15 responses)

That surely depends on whether you believe the continued working of older computers, most realistically for aesthetic reasons. I'm not sure there's any more value in the Hexagon architecture; if you can support ARM, you support basically every system that has a Hexagon chip.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:32 UTC (Tue) by roc (subscriber, #30627) [Link] (14 responses)

Sure.

The problem is that there aren't enough people who want to be able to run the latest and greatest software on museum architectures to actually support those architectures through just their own efforts, e.g. by maintaining an out-of-tree LLVM backend. Thus they try to influence other developers to do work to help them out. Sometimes they get their preferences enshrined as distro policy to compel other developers to work for them. Sometimes it's borderline dishonest, raising deliberately vague concerns like "LLVM's limited platform support" to discourage technology choices that would require them to do work.

This is usually just annoying, but when it means steering developers from more secure technologies to less-secure ones, treating museum-architecture support as more important than cleaning up the toxic ecosystem of insecure, unreliable code that everyone actually depends on, I think it's egregious.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 3:25 UTC (Wed) by dvdeug (guest, #10998) [Link] (13 responses)

To quote someone else on this thread, arguing against Python 3:

> Eventually one of them will try to use your program on a FAT-formatted USB stick with Shift-JIS filenames or whatever, .... As a responsible programmer you want to make your program work for that user...

>That experience will encourage you to start your next project in a different language (one whose designers have already considered this problem and solved it properly) so you won't have the same pains if your project becomes popular and needs to handle obscure edge cases.

You know that getting these programs to work on these systems is a priority for some people. You can do the hard work like people did for C (and a number of other languages), and write a direct compiler. You can piggyback on GCC, like a half-dozen languages have. You can compile to one of the languages in the first two groups, like any number of languages have, or write an interpreter in a language in one of the three groups, like vastly more languages have. The Rust developers instead chose to handle this in a way that wouldn't support some of their userbase. That has nothing to do with being a "more secure technology"; that's "choosing to drop customer requirements that would take work to support".

I see where you're coming from, but on the other hand, if your competition supplies a feature that people want, perhaps it's on you to implement that feature, and perhaps developers will consider excors' advice above about using a language that won't have this problem.

(As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 10:23 UTC (Wed) by roc (subscriber, #30627) [Link] (12 responses)

> That has nothing to do with being a "more secure technology"; that's "choosing to drop customer requirements that would take work to support".

I think that's a fine way to look at it, as long we are clear about which "customers" are actually being dropped. "The Rust developers chose to handle this in a way that wouldn't support some of their userbase" sounds rather ominous, but when we clarify that "some of their userbase" means "a few obsolete and a few minor embedded-only architectures", it sounds more reasonable.

> (As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)

Oddly enough, this is not a theoretical question: https://lists.llvm.org/pipermail/llvm-dev/2018-August/125...
Looks like it was not rejected outright, though I'm dubious it would ever be accepted. Maintaining an out-of-tree fork is more work than having it upstream --- while it is out-of-tree, all the maintenance has to be done by the m68k community, but upstreaming the backend would shift a lot of the maintenance costs from the community fans to the rest of the LLVM developers. That is exactly why they would not accept it!

I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k. Everyone using m68k to run modern software for any real task could accomplish the same thing faster with lower power on modern hardware, therefore they are doing it strictly for fun. No-one *needs* to run a particular piece of modern software on m68k. I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty. Same goes for the other obsolete architectures.

I have more sympathy for potentially relevant new embedded architectures like C-Sky. But I suspect that sooner or later an LLVM port, upstream or not, is going to be just part of the cost of promulgating a viable architecture, especially for desktops. There are already a lot of LLVM-based languages, including Rust, Swift and Julia; clang-only applications like Firefox and Chromium (Firefox requires Rust too of course); and random other stuff like Gallium llvm-pipe.

I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc, and this particular issue will become moot.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 11:46 UTC (Wed) by dvdeug (guest, #10998) [Link] (10 responses)

> I think that's a fine way to look at it, as long we are clear about which "customers" are actually being dropped. "The Rust developers chose to handle this in a way that wouldn't support some of their userbase" sounds rather ominous, but when we clarify that "some of their userbase" means "a few obsolete and a few minor embedded-only architectures", it sounds more reasonable.

But somehow it doesn't sound reasonable that distributions that support those architectures don't support using Rust for core software? You're trying to have your cake and eat it too.

> I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k.

You don't see an argument for cooperating with your fellow open source developers on their projects, but you do see an argument for supporting a billion dollar company that produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.

We could start with community. If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.

> with lower power on modern hardware,

How much power does it take make a modern computer? The m68k (e.g.) is a little extreme, but it's certainly a signpost that Linux, as an operating system, is not going to drop support for hardware that's not the latest and greatest. I've got a laptop that has 20 times the processor and eight times the memory of what I went to college with, that has Windows 10 on it, and response times are vastly worse than that laptop I used in college. I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.

> I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc,

Cool. What you're saying is that if you have your way, the programming language that has my heart strings, Ada, will get much harder to use on modern systems, and should I return to active work on Debian I have an interest in discouraging Rust and LLVM and firmly grounding the importance of GCC in the system. Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 14:54 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Cool. What you're saying is that if you have your way, the programming language that has my heart strings, Ada, will get much harder to use on modern systems
GNAT developers saw the writing on the wall and are working on adding Ada support to LLVM: https://blog.adacore.com/combining-gnat-with-llvm

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:02 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

They also saw "the writing on the wall" when they wrote a port to the JVM, the last, mostly complete, release of which was almost two decades ago. The linked article calls it a "work-in-progress research project". There's a difference between usable code for serious projects and research projects.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:47 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

JVM project could have never been a real product, because JVM is simply too limited for full Ada implementation.

LLVM can replace GCC completely.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 12:03 UTC (Tue) by dvdeug (guest, #10998) [Link]

One would have to carefully parse both the Ada standard and the JVM standard, but I do not believe that Ada has any required features that would make a JVM target impossible.

LLVM could in theory replace GCC completely. And why would a company that's been working with GCC for 25 years find it worth giving up all that expertise to do so? A research project is useful, and it's possible a useful tool will come of this, but there seems to be no evidence GCC is in such a dire state for AdaCore to change the underlying platform on their core product.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 22:35 UTC (Wed) by roc (subscriber, #30627) [Link] (5 responses)

> But somehow it doesn't sound reasonable that distributions that support those architectures don't support using Rust for core software? You're trying to have your cake and eat it too.

That sounds reasonable in isolation, but when it's part of a causal chain that results in a few hobbyists holding back important improvements for the other 99.999% of users, it becomes unreasonable.

> produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.

While you're wielding "proprietary" as a slur, keep in mind that almost every architecture that you think it's important to support is also proprietary.

> We could start with community.

Can you elucidate the actual argument here?

> If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.

The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.

Even if you think that behavior should be tolerated, I doubt it would actually happen often enough in practice that it would be worth anyone taking it into account ahead of time.

> I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.

I'm not sure what arguments you're making here. I can imagine two:
1) Replacing an m68k machine with something more modern hurts the environment because of the manufacturing impact of the more modern machine.
In practice m68k is so obsolete you could replace it something much faster and more efficient that someone else was going to throw away.
2) Keeping modern software running on m68k helps keep that software efficient for all users.
In practice I have not seen, and have a hard time imagining, developers saying "we really need to optimize this or it'll be slow on m68k, even though it's fast enough for the rest of our users". To the extent they care about obscure architectures, if it works at all, that's good enough.

> What you're saying is that if you have your way,

FWIW "CPU vendors will start choosing to just implement an LLVM" is my prediction, not my goal. I actually slightly prefer a world where CPU vendors implement both an LLVM backend and a gcc backend, because I personally like gcc licensing more than LLVM's. I just don't think that's the future.

> Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?

Not sure what you're trying to say here. If the cost of supporting old systems is very low, I certainly wouldn't gratuitously break them. For example, in rr we support quite old Intel CPUs because it's no trouble. OTOH once in a while we increase the minimum kernel requirements for rr because new kernel features can make rr better for everyone running non-ancient kernels and maintaining compatibility code paths for older kernels is costly in some cases.

If systemd developers are spending a lot of energy supporting old systems used by a tiny fraction of users, when they could be spending that energy making significant improvements for the other 99.999%, then yeah I'd question that decision.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 9:10 UTC (Fri) by dvdeug (guest, #10998) [Link] (4 responses)

> when it's part of a causal chain that results in a few hobbyists holding back important improvements for the other 99.999% of users, it becomes unreasonable.

That's dramatic and silly. Red Hat doesn't care about the m68k, and presumably it's not going to hold back important improvements from them. Nor Suse. If Debian cares about the m68k, more than 0.001% of their users care about it in theory, even if they don't use it, and Ubuntu and other Debian derivatives can work around that if they care about it.

It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.

> While you're wielding "proprietary" as a slur, / Can you elucidate the actual argument here?

Ah, see, I was a developer for Debian GNU/Linux. So the idea that we should be working on Free Software as a team is important to me.

> keep in mind that almost every architecture that you think it's important to support is also proprietary.

At different levels, maybe. But a patent only runs 20 years, so any old enough CPU can be reimplemented without license. And the uses aren't proprietary; it's a bunch of hobbyists who benefit, not one big company. There's a difference between an x86-64 chip that's mass-marketed and used in a vast array of devices, and a chip that's only used on Qualcomm's SoCs, and primarily running Qualcomm's code. If LLVM is worried about the cost of bringing an architecture in house, then why let Qualcomm take your developer's time?

> The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.

So you get to judge whether a feature is important or not, but a Linux maintainer can't? You can choose what features you work on, but a Linux maintainer can't? A Linux maintainer can certainly say "your patch causes the kernel to crash; here's the traceback", and leave it at that, even if it will take fifteen minutes of their time or a dozen hours of yours to find the bug. I don't know if they can say that LLVM support isn't worth it--it's probably down to Linus himself--but they can at the very least quit if they feel they have to deal with pointless LLVM patches instead of important patches.

> I'm not sure what arguments you're making here.

I said the m68k is an extreme case. But it is a bellwether; a system that is quick to drop old hardware is much more likely to drop my old hardware, and a system that support the m68k is much less likely to go through and dump support for old hardware. It is something of a matter of pride that Linux works on old systems. Even passing the tests on many of these old systems puts a limit on how slow the software can be.

> once in a while we increase the minimum kernel requirements for rr

Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware. It would be a lot more frustrating if every time you had to upgrade rr, you had to upgrade the kernel and seriously worry about the system not coming back up or important parts not working; perhaps many people would stop doing both.

Basically it comes down this paragraph again: It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 11:10 UTC (Fri) by peter-b (guest, #66996) [Link] (1 responses)

> Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware.

Yes it does. https://lwn.net/Articles/769468/

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:16 UTC (Fri) by smurf (subscriber, #17840) [Link]

Well, if nobody uses that hardware any more, newer kernels won't get tested on it (assuming they even build, given that some aren't supported by mainline GCC), thus they are unlikely to work – but they still increase the load on other maintainers.

It's not the kernel [developers] that knock off hardware – it's the users who retired said hardware. If anybody had spoken up in favor of keeping (and maintaining …) the dropped architectures or drivers in question, they'd still be supported.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 14:59 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

> Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware.

The kernel dropped support for 386 years ago, despite it being the first CPU it ran on.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 22:36 UTC (Fri) by roc (subscriber, #30627) [Link]

> It's a free/Free market. Program developers can write what they want, and distributions can use what they want.

Of course. But we should still discuss the impact of those choices, which is not always obvious.

This subthread was triggered by just such a discussion:
> Porting Python apps to Go is currently quite a lot more viable than Rust, which is both handicapped by LLVM's limited platform support and doesn't have any batteries included.

which led us into a discussion of what exactly "LLVM's limited platform support" means and how important that is relative to other considerations. I learned something and I guess other readers did too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:34 UTC (Thu) by ndesaulniers (subscriber, #110768) [Link]

You can build Linux with clang. Android and CrOS do today and ship that. Check out clangbuildlinux.github.io.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:00 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

There was a great argument in Debian when someone who ported LLVM to an architecture found that was used as justification to transition one library to Rust, which broke that library for other architectures he was working on. It doesn't matter that another language can use LLVM; it matters that it only uses on LLVM, at least on Linux, and I can't name another language that is only supported by LLVM on Linux.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:52 UTC (Tue) by roc (subscriber, #30627) [Link]

Swift of course.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:02 UTC (Tue) by joib (subscriber, #8541) [Link]

Julia, and as 'roc' said, Swift.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:41 UTC (Tue) by flussence (guest, #85566) [Link] (6 responses)

> I haven't run into any issues with LLVM support, what have you seen?

See the Debian librsvg problems for one example. A few lines of Rust code leaving entire CPU arches having to choose between having a GUI or risk staying on an old library indefinitely.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 20:12 UTC (Tue) by foom (subscriber, #14868) [Link] (5 responses)

"Entire CPU arches" which could absolutely add an LLVM backend if people still cared enough about them to do so.

The problem with these obsolete architectures is that various groups of volunteers do care about them, but only just barely enough to keep them alive and operational while small amounts of work are required. But there's just not enough interest, ablility, or willingness available to implement anything new for them.

And that's certainly fine and understandable.

But, don't then pretend that the lack of maintenance effort available for these obscure/historical architectures is some sort of problem with the newer compilers and languages. Or, try to convince people that languages which use LLVM should be avoided because it has "limited platform support".

If you mean "I wish enough people still cared about 68k enough for it to remain a viable architecture so I could keep using it", just say that, instead of saying "LLVM's limited platform support", as if that was some sort of actual problem with LLVM.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 14:28 UTC (Wed) by dvdeug (guest, #10998) [Link] (4 responses)

Well, no.

For one, that's a biased view of how software development works. If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that. If my new compiler only targets ARM64, I don't get to complain at all the people who aren't rushing to retarget it to x86-64 yet consider that a missing feature. Yes, LLVM has limited platform support with respect to many of the older architectures people are trying to support in Debian or NetBSD.

For another, according to roc on this article, LLVM is not interested in adding backends for these architectures. If they're forcing people to try and maintain entire CPU arches out of tree, then that's adding quite a bit of trouble. And while you've been less dismissive than roc has here, it's still far from saying "we're recognize that it would be nice to have these architectures, and if you're familiar with them, we're happy to help you implement them in LLVM/Rust." Offering an adversarial approach to people who want these architectures is hardly the way to convince them to put the work in on them.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 21:27 UTC (Wed) by roc (subscriber, #30627) [Link]

> For another, according to roc on this article, LLVM is not interested in adding backends for these architectures.

Don't quote me on that; I'm just speculating. All we actually know is that an m68k backend was proposed and not rejected, and the developer never got around to moving it forward.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 7:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that.

*IF* the creators or promoters care about this particular set people. Maybe they don't? Try to imagine. It could be for any reason, good or bad. Logical or not.

> If my new compiler only targets ARM64, I don't get to complain at all the people who who aren't rushing to retarget it to x86-64 yet consider that a missing feature.

I don't remember reading so many ungrounded assumptions packed in such a small piece of text. Pure rhetoric, it's surreal. I spent an ordinate amount of time trying (and failing) to relate it to something real.

BTW: you keep misunderstanding that roc favors the reality that he merely tries to _describe_.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 8:09 UTC (Fri) by dvdeug (guest, #10998) [Link]

>> If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that.

> *IF* the creators or promoters care about this particular set people.

Rust promoters do care about this particular set of people, or else we wouldn't be having this discussion. Rust promoters are right here complaining that these users don't support the use of Rust because it would hurt portability to their systems. They're not saying "if Debian chooses to reject Rust in core packages over this, that's cool with me." They're telling people they're wrong for finding this particular feature important.

> I don't remember reading so many ungrounded assumptions packed in such a small piece of text.

And yet you don't name one. Implement the features people want or not, but don't get offended that they use alternatives if you don't.

> you keep misunderstanding that roc favors the reality that he merely tries to _describe_.

Roc:
>> I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty.
>> I have more sympathy for potentially relevant new embedded architectures

When you start describing something as "misguided" and saying "I have more sympathy for", you're not neutrally describing reality.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 2:40 UTC (Tue) by foom (subscriber, #14868) [Link]

I recognize it would be nice to support these architectures, and would be happy if there was support for anything people actively needed.

Speaking for myself, I'd welcome the addition of new backends upstream, even for somewhat fringe architectures. But, I'd want some reassurance that such a backend has enough developer interest to actually maintain it, so that it's not just an abandoned code-dump. (I believe this is also the general sentiment in the LLVM developer community),

And it is definitely a time commitment to get a backend accepted in the first place. Not only do you have to write it, you have to get a patch series in a reviewable shape, try to attract people to review such a large patch-series, and follow up throughout to requests.

Anyways, I'm not sure what happened with the previous attempt to get a m68k backend added to LLVM. Looks like maybe the submitter just gave up upon a suggestion to split the patch up for ease of review? Or maybe due to question of code owners? If so, someone else could pick it up from where they left off...I'd be supportive if you wanted to do so. (But not supportive enough to actually spend significant time on it, since I don't really care about m68k.)

I'll just note here that Debian does _not_ actually support m68k or any of the other oddball architectures mentioned. Some are unofficial/unsupported ports (which means, amongst other things, that the distro will not hold back a change just because it doesn't work on one of the unsupported architectures...)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 8:56 UTC (Tue) by ehiggs (subscriber, #90713) [Link] (1 responses)

I haven't ported a full program from Python to Rust but the ability to link Rust via FFI means you can port the program piecemeal. This seems a good deal easier than porting to Go since you can stop halfway if you run out of budget and still get a huge amount of the performance benefits.

Also, batteries being included in Python was useful but seems deprecated. Who writes anything in Python without leveraging PyPI/pip?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:54 UTC (Tue) by flussence (guest, #85566) [Link]

There is value in installing a language and immediately having the ability to be productive in it without having to set up a web of trust and always-ready internet connection. Perl does the same as Python; CPAN may be the core selling point of the language but it's almost always installed with hundreds of modules.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 3:25 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (56 responses)

But it wasn't a no-op for C, at least if you include Microsoft.* They have two entirely separate copies of their API (the *W functions and the *A functions) just to handle that transition, *and* a huge pile of preprocessor hacks to dynamically switch between those two APIs depending on #defines etc. And Raymond Chen is *still* regularly writing blog posts about compatibility problems with this approach.

* You should include Microsoft, because Microsoft was (probably) a significant factor in Python's decision to use Unicode-encoded filenames and paths rather than the "bags of bytes" model that Unix favors.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:45 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (55 responses)

The most significant factor was the web and usb keys, that caused all the islands of incompatible 8bit encodings to collide (badly).

Anyone who had to salvage data on a system when every app felt the "bag of bytes" model entitled it to use a different encoding than other apps will agree with the Python decision to go UTF-8 (there were lots of those in the 00’s; a lot less now thanks to Python authors and other courageous unicode implementers).

Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools. Relying on the "bag of bytes" model to perform all kinds of unicode incompatible tricks is as bad as when compiler authors rely on "undefined behaviour" to implement optimizations that break apps.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:54 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Python decision to go UTF-8
Python doesn't use UTF-8. Its "Unicode" strings are also not actually guaranteed to be Unicode.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:12 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (5 responses)

UTF-8 is an unicode representation. You can convert from one to the other, you can’t with bag of bytes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:50 UTC (Tue) by excors (subscriber, #95769) [Link] (3 responses)

You can't reliably convert Python strings to UTF-8 either. The standard library will happily give you strings that throw UnicodeEncodeError when you try.

(I like Unicode, and I agree bags of bytes are bad. But I don't like things that pretend to be Unicode and actually aren't quite, because that leads to obscure bugs and security vulnerabilities.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:43 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (2 responses)

I guess that just shows that the Python team was right to decide that transitioning to unicode required forcing devs to use unicode. And that, despite all the grief they got about it over years, (and continue to get in this article), they didn’t go far enough.

People still managed to find loopholes and other ways to sabotage the migration.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:56 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> People still managed to find loopholes and other ways to sabotage the migration.
Perhaps Python should have also required rewriting all the code backwards? It would have been just as useful!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:44 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

Yes, right, say the people who are quite happy to use filesystems with working filenames, but see no reason to make the effort to generate working filenames themselves.

That’s called the tragedy of the commons.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:46 UTC (Wed) by jamesh (guest, #1159) [Link]

Python's strings don't use UTF-8 as their Unicode representation. Strings use fixed size characters (either 8-bit, 16-bit, or 32-bit depending on the contents) so that indexing is fast.

As far as file system encoding goes, it will depend on the locale on Linux. If you've got a UTF-8 locale, then it defaults to UTF-8 file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:04 UTC (Wed) by HelloWorld (guest, #56129) [Link] (44 responses)

I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned, and every decent programming language should be able to handle whatever the kernel throws at it. Pretending that everything is Unicode on the file system doesn't make it true, and it causes problems the moment you run into an FS where that is not the case.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 7:56 UTC (Thu) by marcH (subscriber, #57642) [Link] (42 responses)

> I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned...

You didn't go far enough and missed that bit:

> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools

Users don't care who's in charge of _their_ files, kernel or whatever else.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:01 UTC (Thu) by marcH (subscriber, #57642) [Link]

Déjà vu right after pressing "publish":
https://lwn.net/Articles/325304/ "Wheeler: Fixing Unix/Linux/POSIX Filenames" 2009

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:43 UTC (Thu) by HelloWorld (guest, #56129) [Link] (40 responses)

The point is that that doesn't matter at all. There are file systems that contain non-UTF-8 file names, and Python should be able to read and write these.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:53 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (39 responses)

The point is that it does not matter at all. Non-UTF-8 filenames will crash and burn in interesting ways in apps and scripts (and the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels).

Therefore "being able to write these" means "being able to crash other apps". It’s an hostile behavior, not really on par with Python objectives.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:17 UTC (Thu) by roc (subscriber, #30627) [Link] (35 responses)

> the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels

Depends on what you mean by "cannot be avoided". All platforms that I know of allow passing any filename as a command-line argument. On Linux, it is easy to write a C or Rust program that spawns another program, passing a non-UTF8 filename as a command line argument. It is easy to write the spawned program in C or Rust and have it open that file. In fact, the idiomatic C and Rust code will handle non-UTF8 filenames correctly.

That C code won't work on Windows, you'll have to use wmain() or something, but the Rust code would work on Windows too.

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:49 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

Do not feed them time bombs.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:17 UTC (Thu) by roc (subscriber, #30627) [Link]

> Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

That's exactly why your app needs to be able to cope with any garbage filenames it finds there.

> Do not feed them time bombs.

I'm not arguing that non-Unicode filenames are a good thing or that apps should create them gratuitously. But widely-deployed apps and tools should not break when they encounter them.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by anselm (subscriber, #2796) [Link] (27 responses)

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

I personally would like my file names to work with the shell and standard utilities. I'm not about to write a C or Rust program just to copy a bunch of files, because their names are in a weird encoding that can't be typed in or will otherwise mess up my command lines.

In the 2020s, it's a reasonable assumption that file names will be encoded in UTF-8. We've had a few decades to get used to the idea, after all. If there are outlandish legacy file systems that insist on doing something else, then as far as I'm concerned these file systems are the problem and they ought to be fixed.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

There are plenty of other examples where Py3 falls flat because of its string insistence. For example, we had a problem with a transparent proxy that needed to work with non-UTF-8 headers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:57 UTC (Fri) by cortana (subscriber, #24596) [Link] (3 responses)

I'm honestly not sealioning but: I thought HTTP headers were Latin-1. So they should be bytestrings in Python, not strings?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

The reality is that nobody uses the RFC-specified method of encoding non-Latin-1 characters for HTTP headers. So in reality there are tons of agents sending headers in local codepages or with US ASCII characters.

This actually works fine with most servers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:12 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

For a real-world example of handling HTTP headers, see XMLHttpRequest.getResponseHeader(). That's defined to return a ByteString (https://xhr.spec.whatwg.org/#ref-for-dom-xmlhttprequest-g...), which is converted to a JavaScript String by effectively decoding as Latin-1 (i.e. each byte is translated directly into a single 16-bit JS character) (https://heycam.github.io/webidl/#es-ByteString). When setting a header, you should get a TypeError exception if the JS String contains any character above U+00FF.

The only restrictions on a header value (https://fetch.spec.whatwg.org/#concept-header-value) are that it can't contain 0x00, 0x0D or 0x0A, and can't have leading/trailing 0x20 or 0x09. (And browsers only agreed on rejecting 0x00 quite recently.)

So it's pretty much just bytes, and if you want to try interpreting it as Unicode then that's at your own risk.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Hah. I thought that HTTP2 fixed this, but apparently it's not: https://tools.ietf.org/html/rfc7230#section-3.2 - still allows "obs-text" which is basically any character.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 20:21 UTC (Thu) by rodgerd (guest, #58896) [Link]

It's unfortunate - to put it mildly - how many people seem wedded to "Speak ASCII or die" colonialism in their code, and then pivot to "but what about esoteric nonsense?" to try to block any progress, no?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:28 UTC (Thu) by roc (subscriber, #30627) [Link] (20 responses)

> I'm not about to write a C or Rust program just to copy a bunch of files

`cp` and many other utilities handle non-Unicode filenames correctly. That's not surprising; C programs that accept filenames in argv[] and treats them as a null-terminated char strings should work.

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does. Apparently that is not viable.

> as far as I'm concerned these file systems are the problem and they ought to be fixed.

Sounds good to me, but reality disagrees.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:46 UTC (Fri) by anselm (subscriber, #2796) [Link] (4 responses)

`cp` and many other utilities handle non-Unicode filenames correctly.

True, but you need to feed them such names in the first place. Given that, these days, Linux systems normally use UTF-8-based locales, non-Unicode filenames aren't going to be a whole lot of fun on a shell command line, or in the output of ls, long before Python 3 even comes into play.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 9:04 UTC (Fri) by mbunkus (subscriber, #87248) [Link]

zsh with tab-completion works nicely (I would think bash with tab-completion, too). It's my go to solution for fixing file names with invalid UTF-8 encodings.

Just last week I had such a file name generated by my email program when saving an attachment from a mail created by yet another bloody email program that fucks up attachment file name encoding. And the week before by unzipping a ZIP created on a German Windows.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 21:46 UTC (Fri) by Jandar (subscriber, #85683) [Link] (2 responses)

I keep to encounter Linux systems running application using 8-bit national encodings. The same appears in Samba exported directories from decades old software controlling equally old instruments.

Seeing systems with only UTF-8 filenames is a rarity for me.

Enforcing UTF-8 only filenames is a complete no-go, even considering it is crazy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 22:47 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> I keep to encounter Linux systems running application using 8-bit national encodings.

Interesting, how does software on these systems typically know how to decode, display, exchange and generally deal with these encodings?

I understand Python itself enforces explicit encodings, not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:01 UTC (Sun) by Jandar (subscriber, #85683) [Link]

I have no information about any Python programs on these systems. C programs using setlocate(3) seem to have no major problems. If once in a while mojibake occurs in filenames like "qwert�uiop" this is insignificant in comparison to being unable to handle these files at all.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:56 UTC (Fri) by anselm (subscriber, #2796) [Link] (14 responses)

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does.

This is really a user discipline/hygiene issue more than a Linux file system issue. In the 1980s, the official recommendation was that portable file names should stick to ASCII letters, digits, and a few choice special characters such as the dot, dash, and underscore – this wasn't enforced by the file system, but reasonable people adhered to this and stayed out of trouble. I don't have a huge problem with a similar recommendation that in the 21st century, reasonable people should stick to UTF-8 for portable file names even if the file system doesn't enforce it. Sure, there are careless ignorant bozos who will sh*t all over a file system given half a chance, but they need to be taught manners in any case. Let them suffer instead of the reasonable people.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:07 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> ignorant bozos who will sh*t all over a file system
Like people using ShiftJIS and writing file names in Japanese?

Or Russian people using KOI-8 encoding on Linux?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:16 UTC (Fri) by anselm (subscriber, #2796) [Link] (12 responses)

If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.

It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 10:48 UTC (Fri) by farnz (subscriber, #17727) [Link] (11 responses)

The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:36 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by excors (subscriber, #95769) [Link]

> If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.

You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)

Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:05 UTC (Fri) by marcH (subscriber, #57642) [Link] (8 responses)

> However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:45 UTC (Fri) by anselm (subscriber, #2796) [Link] (7 responses)

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.

Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (6 responses)

> Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place.

How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.

And now case insensitivity even - much trickier than filename encoding.

Or NTFS maybe.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> How do you know they use Linux?
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.

There was also DOS (original and alternative) and ISO code pages, but they were rarely used.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (4 responses)

Interesting, thanks!

So how did Linux and Windows users exchange files in Russia? Not?

The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> So how did Linux and Windows users exchange files in Russia? Not?
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.

At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.

This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.

However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (2 responses)

> So it's pretty clear that something like Mercurial needs to support these niche users.

A cleanup flag day is IMHO the best trade off.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:40 UTC (Sat) by togga (subscriber, #53103) [Link] (1 responses)

"A cleanup flag day is IMHO the best trade off."

Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:48 UTC (Sat) by marcH (subscriber, #57642) [Link]

> Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

s/language/encodings/

This entire debate summarized in less than 25 characters. My pleasure.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:49 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Yeah, sure you can feed any random string to argv[], but the equally important case for file names is that somebody tries to type or paste them into their favorite editor (or its command line).

If you no longer have any way to type them because, surprise, your environment has been UTF8 for the last decade or so, then you'll need an otherwise-transparent encoding that can be pasted (or generated manually via \Udcxx), and that doesn't clash with the rest of your environment (your locale is utf-8 – and that's unlikely to change). Surrogateescape works for that. It should even be copy+paste-able.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 13:01 UTC (Wed) by niner (subscriber, #26151) [Link] (1 responses)

Create a file with a clearly non-UTF-8 name:
nine@sphinx:~> perl6 -e 'spurt ("testfile".encode ~ Buf.new(0xff, 0xff)).decode("utf8-c8"), "foo"'

The shell dutifully shows the name with surrogate characters:
nine@sphinx:~> ll testfile*
-rw-r--r-- 1 nine users 6 17. Sep 2014 testfile.latin-1
-rw-r--r-- 1 nine users 3 22. Jän 13:42 testfile??

Get that name from a directory listing, treating it like a string with a regex grep:
nine@sphinx:~> perl6 -e 'dir(".").grep(/testfile/).say'
("testfile.latin-1".IO "testfile􏿽xFF􏿽xFF".IO)

And just for fun: select+paste the file name in konsole:
nine@sphinx:~> cat testfile??
foo

Actually it looks like file names with "broken" encodings work pretty well. It's only Python 3 that stumbles:

nine@sphinx:~> python3
Python 3.7.3 (default, Apr 09 2019, 05:18:21) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for f in [f for f in os.listdir(".") if "testfile" in f]: print(f)
...
testfile.latin-1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 23:00 UTC (Wed) by Jandar (subscriber, #85683) [Link]

> nine@sphinx:~> cat testfile??
> foo

'?' is a special character for pattern matching in sh.

$ echo foo >testfilexx
$ cat testfile??
foo

So maybe this wasn't a correct test to see if your filename worked with copy&paste.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:34 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Sure. The entire software world is going to fix all its filename bugs and assumptions just because some people name their files on some filesystems in funny ways. The programs that don't get fixed will die. That plan is so much simpler and easier than renaming files. /s

Oh, and all the developers who were repeatedly told to "sanitize your input" to protect themselves and the buggy programs above are all going to relax their checks a bit too.

Best of luck!

If you can't be happy, be at least realistic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:49 UTC (Thu) by roc (subscriber, #30627) [Link]

Not the entire software world, no.

But it is realistic to expect that common utilities handle arbitrary filenames correctly (the most common ones do). And it realistic to expect that common languages and libraries make idiomatic filename-handling code handle arbitrary filenames correctly, because many do (including C, Go, Rust, Python2, and even some parts of Python3).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:05 UTC (Thu) by HelloWorld (guest, #56129) [Link] (2 responses)

So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Not caring about funky filenames because most other programs don't care either: seems perfectly logic to me. You're confusing likeliness and logic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:15 UTC (Thu) by marcH (subscriber, #57642) [Link]

Speaking of likeliness and happiness, let me share my personal preference. I'll stay brief or let's say briefer; seems doable.

I'm very happy that Python catches funky filenames at a relatively low-level with a clear, generic, usual, googlable and stackoverflowable exception rather than with some obscure crash and/or security issue specific to each Python program. These references about "garbage-in, garbage-out" surrogates that I don't have time to read scare me, I wish there were a way to turn them off.

I do not claim Python made all the right unicode decisions, I don't know what. I bet not, nothing's perfect. This comment is only about funky file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:20 UTC (Thu) by marcH (subscriber, #57642) [Link]

> I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned,

"were"? https://lwn.net/Articles/784041/ Case-insensitive ext4

Now _that_ (case sensitivity) really never belonged to a kernel IMHO. Realism?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:58 UTC (Thu) by dgm (subscriber, #49227) [Link] (2 responses)

> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans

Absolutely. And I would add "and **only** by humans". Language run-times should not mess with them, period.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 18:43 UTC (Tue) by Wol (subscriber, #4433) [Link]

I've used a database where file names were NOT supposed to be interpreted by humans. And the database deliberately messed with them to make them unreadable ... :-)

Cheers,
Wol

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 19:32 UTC (Tue) by Jandar (subscriber, #85683) [Link]

On nearly any computer I use there are much more files generated by programs to be consumed exclusively by programs without any human looking at the filenames. In most cases these file-names have no more meaning than a raw pointer-value in C.

Although in case of trouble-shooting readable file-names are a remedy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:20 UTC (Mon) by pj (subscriber, #4506) [Link] (79 responses)

I think your statement begs the question: which users? IMO, Python obviously thought (insomuch as a large community has a single opinion on anything) it should care (wrt the 2-to-3 transition) more about _new_ users than maintainers of existing large codebases. I can't say their decision was wrong... or right. I suspect someone would complain no matter how it went down.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (62 responses)

> I think your statement begs the question: which users?
Indeed. Python maintainers decided to pick and choose them. Only good-behaving users who like eating their veggies ( https://snarky.ca/porting-to-python-3-is-like-eating-your... ) were allowed in.

As a result, Py3 has lost several huge codebases that started migrating to Go instead. Other projects like Mercurial or OpenStack started migration at the very last moment, because of 2.7 EoL.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 21:01 UTC (Mon) by vstinner (subscriber, #42675) [Link] (25 responses)

OpenStack project is way larger than Mercurial and has way more dependencies. OpenStack is more than 2 millions lines of Python code. OpenStack migration started in 2013, and I contributed to port like 90% of unit tests of almost all major projects (all except Nova and Swift where were less open for Python 3 changes), and I helped to port many 3rd party dependencies to Python 3 as well. All OpenStack projects have mandatory python3 CI since 2016 to, at least, not regress on what was already ported. See https://wiki.openstack.org/wiki/Python3 for more information. (I stopped working on OpenStack 2 years ago, so I don't know the current status.)

As Mercurial, Twisted is heavily based on bytes (networking framework) and it has been ported successfully to Python 3 a few years. Twisted can now be used with asyncio.

I tried to help porting Mercurial to Python 3, but their maintainers were not really open to discuss Python 3 when I tried. Well, I wanted to use Unicode for filenames, they didn't want to hear this idea. I gave up ;-)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 22:16 UTC (Mon) by excors (subscriber, #95769) [Link] (23 responses)

> Well, I wanted to use Unicode for filenames, they didn't want to hear this idea.

The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode. (There's PEP 383 but that produces strings that are not quite Unicode, e.g. it becomes impossible to encode them as UTF-16, so that's not good). And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode. For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

(The article also mentions the solution, as implemented in Rust: filenames are a platform-specific string type, with lossy conversions to Unicode if you really want that (e.g. to display to users).)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:19 UTC (Mon) by vstinner (subscriber, #42675) [Link] (12 responses)

> The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode.

On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

Read my articles https://vstinner.github.io/python30-listdir-undecodable-f... and https://vstinner.github.io/pep-383.html for the history the Unicode usage for filenames in the early days of Python 3 (Python 3.0 and Python 3.1).

The problem is that the UTF-8 codec of Python 2 doesn't respect the Unicode standard: it does encode surrogate characters. The Python 3 codec doesn't encode them, which makes possible to use surrogateescape error handler with UTF-8.

> And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.

I'm not sure of which problem you're talking about.

If you care of getting the same character on Windows and Linux (ex: é letter = U+00E9), you should encode the filename differently. Storing the filename as Unicode in the application is a convenient way for that. That's why Python prefers Unicode for filenames. But it also accepts filenames as bytes.

> For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

Well, it is where I gave up :-)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> I'm not sure of which problem you're talking about.
A VCS must be able to round-trip files on the same FS. Even if they are not encoded correctly.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:37 UTC (Mon) by roc (subscriber, #30627) [Link] (3 responses)

It sounds to me like on Windows you can round-trip arbitrary native filenames through Python "Unicode" strings because in both systems the strings are simply a list of 16-bit code units (which are normally interpreted as UTF-16 but may not be valid UTF-16). So maybe that 'surrogateescape' hack is enough. (But only because Python3 Unicode strings don't have to be valid Unicode after all.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:22 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

Python strings aren't 16-bit code units. b'\xf0\x92\x8d\x85'.decode('utf-8') is '\U00012345' with length 1, which is sensible.

You can't create a string like '\U00123456' (SyntaxError) or chr(0x123456) (ValueError); it's limited to the 21-bit range. But you *can* create a string like '\udccc' and Python will happily process it, at least until you try to encode it. '\udccc'.encode('utf-8') throws UnicodeEncodeError.

If you use the special decoding mode, b'\xcc'.decode('utf-8', 'surrogateescape') gives '\udccc'. If you (or some library) does that, now your application is tainted with not-really-Unicode strings, and I think if you ever try to encode without surrogateescape then you'll risk getting an exception.

If you tried to decode Windows filenames as round-trippable UCS-2, like

>>> ''.join(chr(c) for c, in struct.iter_unpack(b'>H', b'\xd8\x08\xdf\x45'))
'\ud808\udf45'

then you'd be introducing a third type of string (after Unicode and Unicode-plus-surrogate-escapes) which seems likely to make things even worse.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:44 UTC (Tue) by excors (subscriber, #95769) [Link]

> I think if you ever try to encode without surrogateescape then you'll risk getting an exception

Incidentally, that seems to include the default encoding performed by print() (at least in Python 3.6 on my system):

>>> for f in os.listdir('.'): print(f)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 4: surrogates not allowed

os.listdir() will surrogateescape-decode and functions like open() will surrogateescape-encode the filenames, but that doesn't help if you've got e.g. logging code that touches the filenames too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 4:47 UTC (Tue) by roc (subscriber, #30627) [Link]

Thanks for clearing that up.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:08 UTC (Thu) by marcH (subscriber, #57642) [Link]

> A VCS must be able to round-trip files on the same FS

Yet all VCS provide some sort of auto.crlf insanity, go figure.

Just in case someone wants to use Notepad-- from the last decade.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:32 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Huh, so Python3 "Unicode" strings aren't even necessarily valid Unicode :-(.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:40 UTC (Thu) by kjp (guest, #39639) [Link]

And the zen of python was forgotten long ago. Explicit is better than implicit? Errors should not pass silently? Nah. Let's just add math operators to dictionaries. Python has no direction, no stewardship, and I think it's been taken over by windows and perl folks.

Python: It's a [unstable] scripting language. NOT a systems or application language.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:37 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

> On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

But then you end up with a "Unicode" string in memory which can't be safely encoded as UTF-8 or UTF-16, so it's not really a Unicode string at all. (As far as I can see, the specifications are very clear that UTF-* can't encode U+D800..U+DFFF. An implementation that does encode/decode them is wrong or is not Unicode.)

That means Python applications that assume 'str' is Unicode are liable to get random exceptions when encoding properly (i.e. without surrogateescape).

> > And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
>
> I'm not sure of which problem you're talking about.

Windows (with NTFS) lets you create a file whose name is e.g. "\ud800". The APIs all handle filenames as strings of wchar_t (equivalent to uint16_t), so they're perfectly happy with that file. But it's clearly not a valid string of UTF-16 code units (because it would be an unpaired surrogate) so it can't be decoded, and it's not a valid string of Unicode scalar values so it can't be directly encoded as UTF-8 or UTF-16. It's simply not Unicode.

In practice most native Windows applications and APIs treat filenames as effectively UCS-2, and they never try to encode or decode so they don't care about surrogates, though the text rendering APIs try to decode as UTF-16 and go a bit weird if that fails. Python strings aren't UCS-2 so it has to convert to 'str' somehow, but there's no correct way to do that conversion.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:04 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Microsoft refers to it as an "extended character set":

https://docs.microsoft.com/en-us/windows/win32/fileio/nam...

Also, whatever your complaints are about whatever language, with respect to filenames, the win32 api is worse.

It's amazingly inconsistent. The level of insanity is just astonishing, especially if you're going across files created with the win api, and the .net libs.

You *have to p/invoke to read some files, and use the long filepath prefix, which doesn't support relative paths. And that's just the start.

Admittedly, I haven't touched it for almost a decade in any serious fashion, but, based on the docs linked above, it doesn't seem much has changed.

It's remarkable how easy they make it to write files that are quite hard to open..

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> It's amazingly inconsistent. The level of insanity is just astonishing
Just wait until you see POSIX!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 0:26 UTC (Wed) by gracinet (guest, #89400) [Link]

Hey Victor,

don't forget that Mercurial has to cope with filenames in its history that are 25 years old. Yes, that predates Mercurial but some of the older repos have had a long history as CVS then SVN.

Factor in the very strong stability requirements and the fact that risk to change a hash value is to be avoided, no wonder a VCS is one of the last to take the plundge. It's really not a matter of size of the codebase in this case.

Note: I wasn't directly involved in Mercurial at the time you were engaging with the project about that, I hope some good came out of it anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:18 UTC (Tue) by flussence (guest, #85566) [Link]

This was a sore point in Perl 6 too for many years due to its over-eagerness to destructively normalise everything on read. It fixed it eventually by adding a special encoding, similar to how Java has Modified UTF-8. It's not perfect, but without mandating a charset and normalisation at the filesystem level (something only Apple's dared to do) nothing is.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 7:57 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

How many Mercurial users store non-Unicode file names in a repository? Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames, their port to Python 3 would have gone much smoother.

If you do want a truly arbitrary ‘bag of bytes’ not just for file contents but for names too, I have the feeling you’d probably be using a different tool anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:39 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames

Losing the ability to read history of when the tool did not have such a restriction would not be a good thing. Losing the ability to manipulate those files (even to rename them to something valid) would also be tricky if it failed up front about bad filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 18:58 UTC (Wed) by hkario (subscriber, #94864) [Link]

it's easy to end up with malformed names in file system

just unzip a file from non-UTF-8 system, you're almost guaranteed to get mojibake as a result; then blindly commit files to the VCS and bam, you're set

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:35 UTC (Tue) by dvdeug (guest, #10998) [Link] (5 responses)

Which means there's no way to reliably share a Mercurial repository between Windows and Unix. You can either accept all filenames or make repositories portable between Windows and Unix, not both. Note that even pretending that you can support both systems ignores those whole "arbitrary byte strings" and "arbitrary uint16_t strings" issues. I'd certainly feel comfortable with Mercurial and other tools rejecting junk file names, though I can see where people with old 8-bit charset filenames in their history could have problems.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:59 UTC (Tue) by roc (subscriber, #30627) [Link] (4 responses)

> You can either accept all filenames or make repositories portable between Windows and Unix, not both.

You can accept all filenames and make repositories portable between Windows and Unix if they have valid Unicode filenames. AFAIK that's what Mercurial does, and I hope it's what git does.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:33 UTC (Tue) by dezgeg (subscriber, #92243) [Link] (3 responses)

Not quite enough... Let's not forget the portability troubles of Mac, where the filesystem does Unicode (de)normalization behind the application's back.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:21 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

OK sure. The point is: you can preserve native filenames, and also ensure that repos are portable to any OS/filesystem that can represent the repo filenames correctly. That's what I want any VCS to do.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:51 UTC (Tue) by Wol (subscriber, #4433) [Link]

Do what git does with line endings, maybe?

They had a load of grief with mixed Windows/linux repos, so there's now a switch that says "convert cr/lf on checkout/checkin".

Add a switch that says "enforce valid utf-8/utf-16/Apple filenames, and sort out the mess at checkout/checkin".

If that's off by default, or on by default for new repos, or whatever, then at least NEW stuff will be sane, even if older stuff isn't.

Cheers,
Wol

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:42 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

There are also the invalid path components on Windows. Other than the reserved names and characters, space and `.` are not allowed to be the end of a path component. All the gritty details: https://docs.microsoft.com/en-us/windows/win32/fileio/nam...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:57 UTC (Mon) by prometheanfire (subscriber, #65683) [Link]

Openstack is working on dropping python2 support this cycle. The problem is going to be ongoing support for older versions that still support python2. Just over the weekend gate crashed on setuptools newest version being installed in python2 when it's python3 only. It's gonna be rough, and we at least semi-prepared for this.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 5:40 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (35 responses)

So confused by this (but I don't really follow projects in either language.. well some Go ones that were always Go).

Python to Go seems like a weird switch. I tend to use them for very different tasks.

Unless you're bound to GCP as a platform or something similar.

But you're not the only one mentioning this: what projects have I missed that made the switch?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:02 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (34 responses)

> Python to Go seems like a weird switch. I tend to use them for very different tasks.
It actually is not, if you're writing something that is not a Jupyter notebook.

Stuff like command-line utilities and servers works really well in Go.

Several huge Python projects are migrating to Go as a result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:06 UTC (Tue) by mgedmin (subscriber, #34497) [Link] (33 responses)

> Several huge Python projects are migrating to Go as a result.

Can you name them?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:10 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (32 responses)

YouTube is one high-profile example. Salesforce also did a lot of rewriting internally.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:09 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (31 responses)

Youtube migrating to the Google programming language is not surprising.

As for the rest, a lot of infra-related things are being rewritten in Go just because containers (k8s and docker both use Go). That has little to do with the benefits offered by the language. It’s good old network effects. When you’re the container language, and everyone wants to do containers, being decent is sufficient to carry the day.

No one will argue that Go is less than decent. Many will argue it’s more than decent, but that’s irrelevant for its adoption curve.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Rewriting a project the scope of Youtube is not a small thing. And from my sources, Py2.7->3 migration was one of the motivating factors. After all, if you're rewriting everything then why not switch a language as well?

Mind you, Google actually tried to fix some of the Python issues by trying JIT compilation with unladen-swallow project before that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:05 UTC (Tue) by rra (subscriber, #99804) [Link]

Go is way, way faster than Python, consumes a lot less memory, and doesn't have the global interpreter lock so has much better multithreading. That's why you see a lot of infrastructure code move to Go.

For most applications, the speed changes don't matter and other concerns should dominate. But for core infrastructure code for large cloud providers, they absolutely matter in key places, and Python is not a good programming language for high-performance networking code.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 13:26 UTC (Sat) by HelloWorld (guest, #56129) [Link] (28 responses)

> No one will argue that Go is less than decent.
I will. Go is the single worst programming language design that achieved any kind of popularity in the last 10 years at least. It is archaic and outdated in pretty much every imaginable way. It puts stuff into the language that doesn't belong there, like containers and concurrency, and doesn't provide you with the tools that are needed to implement these where they belong, which is in a library. The designers of this programming language are actively pushing us back into the 1970s, and many people appear to be applauding that. It's nothing short of appalling.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 14:22 UTC (Sat) by smurf (subscriber, #17840) [Link] (27 responses)

Hmm. I agree with much of that. However: concurrency definitely does belong in a modern language. Just not the unstructured way Go does it. Cf. https://en.wikipedia.org/wiki/Structured_concurrency – which notes that the reasonable way to do it is via some cancellation mechanism, which also needs to be built into the language to be effective – but Go doesn't have one.

The other major gripe with Go which you missed, IMHO, is its appalling error handling; the requirement to return an "(err,result)" tuple and checking "err" *everywhere* (as opposed to a plain "result" and propagating exceptions via catch/throw or try/raise or however you'd call it) causes a "yay unreadable code" LOC explosion and can't protect against non-functional errors (division by zero, anybody?).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 0:09 UTC (Sun) by HelloWorld (guest, #56129) [Link] (26 responses)

> Hmm. I agree with much of that. However: concurrency definitely does belong in a modern language.
It absolutely does not. Concurrency is an ever-evolving, complex topic, and if you bake any particular approach into the language, it's impossible to change it when we discover better ways of doing it. Java tried this and failed miserably (synchronized keyword). Scala didn't put it into the language. Instead, what happened is that people came up with better and better libraries. First you had Scala standard library Futures, which was a vast improvement over anything Java had to offer at the time. But they were inefficient (many context switches), had no way to interrupt a concurrent computation or safely handle resources (open file handles etc.) and made stack traces useless. Over the years, a series of better and better libraries (Monix, cats-effect) were developed, and now the ZIO library solves every single one of these and a bunch more. And you know what? Two years from now, ZIO will be better still, or we'll have a new library that is even better.

By contrast, Scala does have language support for exceptions. It's pretty much the same as Java's try/catch/finally, how did that hold up? It's a steaming pile of crap. It interacts poorly with concurrency, it easily leads to resource leaks, it's hard to compose, it doesn't tell you which errors can occur where, and everybody who knows what they're doing is using a library instead, because libraries like ZIO don't have *any* of these problems.

So based on that experience, you're going to have a hard time convincing me that concurrency needs language support. Feel free to try anyway, but try ZIO first.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 1:03 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (25 responses)

The best and most maintainable way to write servers is still thread-per-request. Go makes that easy with its lightweight threads. Much better than any async library I've tried.

It really is that simple.

Plus, Go has a VERY practical runtime with zero dependency executables and a good interactive GC. It's amazing how much better Golang's simple mark&sweep is when compared to Java's neverending morass of CMS or G1GC (that constantly require 'tuning').

Sure, I would like a bit more structured concurrency in Go, but this can come later once Go team rolls out generics.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 5:47 UTC (Sun) by HelloWorld (guest, #56129) [Link] (24 responses)

> The best and most maintainable way to write servers is still thread-per-request. Go makes that easy with its lightweight threads. Much better than any async library I've tried.

Apparently you haven't tried ZIO, because it beats the pants off anything Go can do.

It really is that simple.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 6:01 UTC (Sun) by HelloWorld (guest, #56129) [Link] (23 responses)

I'll give just one example. With ZIO it is possible to terminate a fiber without writing custom code (e. g. checking a shared flag) and without leaking resources that the fiber may have acquired. This is simply not possible in Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 7:58 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

Zio is not even in contention, since it's built by pure functional erm... how to say it politly... adherents.

Meanwhile, Go is written by practical engineers. Cancellation and timeouts are done through the use of explicitly passed context.Context, resource cleanups are done through defered blocks.

This two simple methods in practice allow complicated systems comprising hundreds thousands of LOC to work reliably. While being easy to develop and iterate, not requiring multi-minute waits for one compile/run cycle.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:25 UTC (Sun) by smurf (subscriber, #17840) [Link] (16 responses)

Passing a Context around and checking of every function return for errors and manually terminating no-longer-needed Goroutines isn't exactly free. It bloats the code, it's error prone, too easy to get wrong accidentally, and makes the code much less readable.

If you come across a better paradigm sometime in the future, then bake it into a new version of the language and/or its libraries, and add interoperability features. Python3 is doing this, incidentally: asyncio is a heap of unstructured callbacks that evolved from somebody noticing that you can use "yield from" to build a coroutine runner, then Trio came along with a much better concept that actually enforces structure. Today the "anyio" module affords the same structured concept on top of asyncio, and in some probably-somewhat-distant future asyncio will support all that natively.

Languages, and their standard libraries, evolve.

With Go, this transition to Structured Concurrency is not going to happen any time soon because contexts and structure are nice-to-have features which are entirely optional and not supported by most libraries, thus it's much easier to simply ignore all that fancy structured stuff (another boilerplate argument you need to pass to every goroutine and another clause to add to every "select" because, surprise, there's no built-in cancellation? get real) and plod along as usual. The people in charge of Go do not want to change that. Their choice, just as it's my choice not to touch Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 12:35 UTC (Sun) by HelloWorld (guest, #56129) [Link] (15 responses)

> If you come across a better paradigm sometime in the future, then bake it into a new version of the language
You haven't yet demonstrated a single advantage of putting this into the language rather than a library, which is much more flexible and easier to evolve. Your thinking that this needs to be done in the language is probably a result of too much exposure to crippled languages like Python.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 14:35 UTC (Sun) by smurf (subscriber, #17840) [Link] (14 responses)

Go doesn't have automatic cleanup. Each call to "open file" requires a "defer close file". The point of structured code is that it's basically impossible, or at least a lot harder, to violate the structural requirements.

NB, Python also has the whole thing in a library. This is not primarily about language features. The problem is that it is simply impossible to add this to Go without either changing the language, or forcing people to write even more convoluted code.

Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:17 UTC (Sun) by HelloWorld (guest, #56129) [Link] (13 responses)

NB, Python also has the whole thing in a library. This is not primarily about language features.
It very much is about language features. Python has dedicated language support for list comprehensions, concurrency and error handling. But there is no need for that. Consider these:
x = await getX()
y = await getY(x)
return x + y

[ x + y
  for x in getX()
  for y in getY(x)
]
The structure is the same: we obtain an x, then we obtain a y that depends on x (expressed by the fact that getY takes x as a parameter), then we return x + y. The details are of course different, because in one case we obtain x from an async task, and in the other we obtain x from a list, but there's nevertheless a common structure. Hence, Scala offers syntax that covers both of these use cases:
for {
  x <- getX()
  y <- getY(x)
} yield x + y
And this is a much better solution than what Python does, because now you get to write generic code that works in a wide variety of contexts including error handling, concurrency, optionality, nondeterminism, statefulness and many, many others that we can't even imagine today.
Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.
This is a false dichotomy. Not having error handling built into the language doesn't mean you have to check for errors on every call.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 10:33 UTC (Mon) by smurf (subscriber, #17840) [Link] (12 responses)

> It very much is about language features.

Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.

But while Python is not such a language, neither is Go, so in effect you're shifting the playing ground here.

> Not having error handling built into the language doesn't mean you have to check for errors on every call.

No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 12:51 UTC (Mon) by HelloWorld (guest, #56129) [Link] (1 responses)

Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.
You don't need lazy evaluation for this to work. Scala is not lazily evaluated and it works great there.
No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.
You can factor out the error checking code into a function, so you don't need to write it more than once. After all, this is what we do as programmers: we detect common patterns, like “call a function, fail if it failed and proceed if it didn't” and factor them out into functions. This function is called flatMap in Scala, and it can be used like so:
getX().flatMap { x =>
  getY(x).map { y =>
    x + y
  }
}
But this is arguably hard to read, which is why we have for comprehensions. The following is equivalent to the above code:
for {
  x <- getX
  y <- getY x
} yield x + y
I would argue that if you write it like this, it is no harder to read than what Python gives you:
x = getX
y = getY(x)
return x + y
But the Scala version is much more informative. Every function now tells you in its type how it might fail (if at all), which is a huge boon to maintainability. You can also easily see which function calls might return an error, because you use <- instead of = to obtain their result. And it is much more flexible, because it's not limited to error handling but can be used for things like concurrency and other things as well. It's also compositional, meaning that if your function is concurrent and can fail, that works too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 18:30 UTC (Mon) by darwi (subscriber, #131202) [Link]

> But the Scala version is much more informative. Every function now tells you in its type how it might fail (if at all), which is a huge boon to maintainability

Long time ago (~2013), I worked as a backend SW engineer. We transformed our code from Java (~50K lines) to Scala (~7K lines, same functionality).

After the transition was complete, not a single NullPointerException was seen anywhere in the system, thanks to the Option[T] generics and pattern matching on Some()/None. It really made a huge difference.

NULL is a mistake in computing that no modern language should imitate :-( After my Scala experience, I dread using any language that openly accepts NULLs (python3, when used in a large 20k+ code-base, included!).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 15:46 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (9 responses)

> Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.

Yes, but with these types, *ignoring* (or passing on in Python) the error takes explicit steps rather than being implicit. IMO, that's a *far* better default. I would think the Zen of Python agrees…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 17:21 UTC (Mon) by HelloWorld (guest, #56129) [Link] (8 responses)

> Yes, but with these types, *ignoring* (or passing on in Python) the error takes explicit steps rather than being implicit. IMO, that's a *far* better default.

No, passing the error on does not take explicit steps, because the monadic bind operator (>>=) takes care of that for us. And that's a Good Thing, because in the vast majority of cases that is what you want to do. The problem with exceptions isn't that error propagation is implicit, that is actually a feature, but that it interacts poorly with the type system, resources that need to be closed, concurrency etc..

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 18:28 UTC (Mon) by smurf (subscriber, #17840) [Link] (6 responses)

Python doesn't have a problem with resources to be closed (that's what "with foo() as bar"-style context managers are for), nor concurrency (assuming that you use Trio or anyio).

Typing exceptions is an unsolved problem; conceivably it could be handled by a type checker like mypy. However, in actual practice most code is documented as possibly-raising a couple of well-known "special" exceptions derived from some base type ("HTTPError"), but might actually raise a couple of others (network error, cancellation, character encoding …). Neither listing them all separately (much too tedious) nor using a catch-all BaseException (defeats the purpose) is a reasonable solution.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 22:44 UTC (Mon) by HelloWorld (guest, #56129) [Link] (2 responses)

Python doesn't have a problem with resources to be closed (that's what "with foo() as bar"-style context managers are for), nor concurrency (assuming that you use Trio or anyio).
Sure, you can solve every problem that arises from adding magic features to the language by adding yet more magic. First, they added exceptions. But that interacted poorly with resource cleanup, so they added with to fix that. Then they realized that this fix interacts poorly with asynchronous code, and they added async with to cope with that. So yes, you can do it that way, because given enough thrust, pigs fly just fine. But you have yet to demonstrate a single advantage that comes from doing so.

On the other hand, there are trivial things that can't be done with with. For instance, if you want to acquire two resources, do stuff and then release them, you can just nest two with statements. But what if you want to acquire one resource for each element in a list? You can't, because that would require you to nest with statements as many times as there are elements in the list. In Scala with a decent library (ZIO or cats-effect), resources are a Monad, and lists have a traverse method that works with ALL monads, including the one for resources and the one for asynchronous tasks. But while asyncio.gather (which is basically the traverse equivalent for asynchronous code) does exist, there's no such thing in contextlib, which proves my point exactly: you end up with code that is constrained to specific use cases when it could be generic and thus much easier to learn because it works the same for _all_ monads.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 6:51 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

> But what if you want to acquire one resource for each element in a list?

You use an [Async]ExitStack. It's even in contextlib.

Yes, functional languages with Monads and all that stuff in them are super cool. No question. They're also super hard to learn compared to, say, Python.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 14:32 UTC (Tue) by HelloWorld (guest, #56129) [Link]

> You use an [Async]ExitStack. It's even in contextlib.
You can *always* write more code to fix any problem. That isn't the issue here, it's about code reuse. ExitStack shouldn't be needed, and neither should AsyncExitStack. These aren't solutions but symptoms.

> They're also super hard to learn compared to, say, Python.
For the first time, you're actually making an argument for putting the things in the language. But I'm not buying it, because I see how much more stuff I need to learn about in Python that just isn't necessary in fp. There's no ExitStack or AsyncExitStack in ZIO. There's no `with` statement. There's no try/except/finally, there's no ternary operator, no async/await, no assignment expressions, none of that nonsense. It's all just functions and equational reasoning. And equational reasoning is awesome _because_ it is so simple that we can teach it to high school students.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 2:32 UTC (Tue) by HelloWorld (guest, #56129) [Link] (2 responses)

I also think you're conflating two separate issues when it comes to error handling: language and library design. On the language side, this is mostly a solved problem. All you need is sum types, because they allow you to express that a computation either succeeded with a certain type or failed with another. The rest can be done in libraries.

If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem, because if you can't even list the errors that might happen in your function, you can't reasonably expect people to handle them either. You need to constrain the number of ways that a function can fail in, normally by categorising them in some way (e. g. technical errors vs. business domain errors). I think this is actually yet another way in which strongly typed functional programming pushes you towards better API design.

Unfortunately Scala hasn't proceeded along this path as far as I would like, because much of the ecosystem is based on cats-effect where type-safe error handling isn't the default. ZIO does much better, which is actually a good example of how innovation can happen when you implement these things in libraries as opposed to the language. Java has checked exceptions, and they're utterly useless now that everything is async...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:13 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

> Java has checked exceptions, and they're utterly useless now that everything is async

… and unstructured.

The Java people have indicated that they're going to migrate their async concepts towards Structured Concurrency, at which point they'll again be (somewhat) useful.

> If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem

That's one side of the medal. The other is that IMHO a library which insists on re-packaging every kind of error under the sun in its own exception type is intensely annoying because that loses or hides information.

There's not much commonality between a Cancellation, a JSON syntax error, a character encoding problem, or a HTTP 50x error, yet an HTTP client library might conceivably raise any one of those. And personally I have no problem with that – I teach my code to retry any 40x errors with exponential back-off and leave the rest to "retry *much* later and alert a human", thus the next-higher error handler is the Exception superclass anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Mar 19, 2020 16:52 UTC (Thu) by bjartur (guest, #67801) [Link]

Nesting result types explicitly is helpful because it makes you wonder when an exponential backoff is appropriate.

How about getWeather:: String→ DateTime→ IO (DnsResponse (TcpSession (HttpResponse (Json Weather)))) where each layer can fail? Of course, there's some leeway in choosing how to layer the types (although handling e.g. out-of memory errors this way would be unreasonable IMO).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 22:42 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

Even `>>=` is explicit error handling here (as it is with all the syntactic sugar that boils down to it). Using the convenience operators like >>= or Rust's Result::and_then or other similar methods are explicitly handling error conditions. Because the compiler knows about them it can clean up all the resources and such in a known way versus the unwinder figuring out what to do.

As a code reviewer, implicit codepaths are harder to reason about and don't make me as confident when reviewing such code (though the bar may also be lower in these cases because error reporting of escaping exceptions may be louder ignoring the `except BaseException: pass` anti-pattern instances).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:23 UTC (Sun) by HelloWorld (guest, #56129) [Link] (4 responses)

> Zio is not even in contention, since it's built by pure functional erm... how to say it politly... adherents.

You're free to stick with purely dysfunctional programming then. Have fun!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:49 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Indeed. It's way superior because it's actually used in practice.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:19 UTC (Sun) by HelloWorld (guest, #56129) [Link] (2 responses)

Well, so is pure FP that you have obviously no clue about and reject for purely ideological reasons. Oh well, your loss.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:25 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I actually worked on a rather large project in Haskell (a CPU circuit simulator) and I don't have many fond memories about it. I also spent probably several months in aggregate waiting for Scala code to compile.

My verdict is that pure FP languages are used only for ideological reasons and are totally impractical otherwise.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 22:42 UTC (Sun) by HelloWorld (guest, #56129) [Link]

So you worked on a bad Haskell project and that somehow makes functional programming impractical? That's not how logic works, but it does explain your unsubstantiated knee-jerk reactions to everything fp.

> I also spent probably several months in aggregate waiting for Scala code to compile.
There is some truth to this, it would be nice if the compiler were faster. That said, it has become significantly faster over the years and it's not nearly slow enough to make programming in Scala “totally impractical”. And the fact that I was able to name a very simple problem (“make an asynchronous operation interruptible without writing (error-prone) custom code and without leaking resources”) that has a trivial solution with ZIO and no solution at all in Go proves that pure fp has nothing to do with ideology. It solves real-world problem. There's a reason why React took over in the frontend space: it works better than anything else because it's functional.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 0:02 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (15 responses)

One of the points made in the blog post, though, is that the creators of Python 3 did some really stupid stuff that made it needlessly difficult to write code that worked in both versions. The specific example that stood out to me was the use of identifiers to specify whether a string literal was a string of bytes or of unicode points. In Python 2, it was possible to specify b'' to say it was a byte string and u'' to say it was a unicode strong. Python 3 kept the b'' syntax but initially eliminated the u'' for unicode strings, and only brought it back when users complained. That hurt people trying to move from Python 2 to Python 3 without providing any benefit to people starting with Python 3.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 10:12 UTC (Tue) by smurf (subscriber, #17840) [Link] (14 responses)

The u'' syntax was removed because the initial idea was that people would use 2to3 and similar tools to convert their code base to Python3 once, and they'd be done. Given the initial goal of quickly converting the whole infrastructure to Python3 that could even have worked.

What happened instead was an intense period of slowly converting to Py3, heaps of code that use "import six", and modules that ran, and run, with both 2 and 3 once some of those nits were reverted. And they were.

Thus, IMHO accusations of Python core developers not listening to (some of) their users are for the most part really unfounded. Hindsight is 20/20, yes they could have done some things better, but frankly my compassion for people who take their own sweet time to port their code to Python3 and complain *now*, when there's definitely no more chance to add anything to Py2 to make the transition easier, is severely limited.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:54 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> The u'' syntax was removed because the initial idea was that people would use 2to3 and similar tools to convert their code base to Python3 once, and they'd be done.
That wouldn't have worked because Python 3.0 lacked many required features, like being able to use format strings with bytes. They got re-added only in Python 3.5 released in late 2014 ( https://www.python.org/dev/peps/pep-0461/ ). So for many projects realistic porting could begin around 2015 when it trickled down to major distros.

These concerns were raised back in 2008, but Py3 developers ignored them because it was clear (to them) that only bad code needed it and developers should shut up and eat their veggies.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 18:47 UTC (Wed) by togga (subscriber, #53103) [Link] (12 responses)

> "like being able to use format strings with bytes. They got re-added only in Python 3.5 released in late 2014"

And then obviously removed later on.

Python 2.7.17 >>> b'{x}'.format(x=10)
'10'

Python 3.7.5 >>> b'{x}'.format(x=10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 19:47 UTC (Wed) by foom (subscriber, #14868) [Link] (11 responses)

In python 3.5, the "legacy" % formatting,
>>> b'%d' % (55,)
was supported again, but *not* the new and recommended format function.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 19:57 UTC (Wed) by togga (subscriber, #53103) [Link] (10 responses)

What an irony that the new and recommended format function is not working with latest Python3.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 8:50 UTC (Thu) by smurf (subscriber, #17840) [Link] (9 responses)

That's not an irony, it actually makes sense. %- and .format-formatting are typically used in different contexts.
.format was never "recommended" on bytestrings, in fact it was initially proposed for Python3. Neither was %, but lots of older code uses it in contexts which end up byte-ish when you migrate to Py3. That usage never was prevalent for .format, so why should the Python devs incur the additional complexity of adding it to bytes?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 14:18 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

> That usage never was prevalent for .format, so why should the Python devs incur the additional complexity of adding it to bytes?

So instead the burden is put on the coder to have to think about whether bytes or strings will be threaded through their code and can't use the newer API if they might have bytes floating about?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 16:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

You're not supposed to use bytes. Bytes are unhealthy and bad for you. Fake Unicode all the way!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 2:24 UTC (Sat) by togga (subscriber, #53103) [Link] (5 responses)

I think not being able to migrate developers to python3 for 10 years took a toll on pride resulting in politics and statements rather than sane language development. A number on weird decisions (some described in the article) pointing at.

There is no reason for not
* allowing byte strings as attributes
* being consistent with types and syntax for byte strings and strings
* being consistent with format options for strings and byte strings
* etc.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 11:03 UTC (Sat) by smurf (subscriber, #17840) [Link] (4 responses)

Python3 never had bytestrings as attributes, so I don't know how that would follow.

Python3 source code is Unicode. Python attribute access is written as "object.attr". This "attr" part therefore must be Unicode. Why would you want to use anything else? If you need bytestrings as keys, or integers for that matter, use a dict.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 21:44 UTC (Sat) by togga (subscriber, #53103) [Link] (3 responses)

Python source code doesn't have to be unicode. The encoding of the source code has nothing to do with attributes.

>Why would you want to use anything else?
Mostly due to library API:s requiring attributes for many thing. This is a big source for py3 encode/decode errors.

>"use a dict."
This is what attributes does:

>>> a=type('A', (object,), {})()
>>> setattr(a, 'b', 22)
>>> setattr(a, b'c', 12)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: attribute name must be string, not 'bytes'
>>> a.__dict__
{'b': 22}
>>> type(a.__dict__)
<class 'dict'>
>>> a.__dict__[b'c']=12
>>> a.__dict__
{'b': 22, b'c': 12}

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 29, 2020 12:32 UTC (Wed) by smurf (subscriber, #17840) [Link] (2 responses)

> Python source code doesn't have to be unicode

Sure, if you want to be pedantic you can use "-*- coding: iso-8859-1 -*-" (or whatever) in the first two lines and write your code in Latin1 (or whatever), but that's just the codec Python uses to read the source. It's still decoded to Unicode internally.

> >"use a dict."
> This is what attributes does:

Currently. In CPython. Other Python implementations, or future versions of CPython, may or may not use what is, or looks like, a generic dict to do that.

Yes, I do question why anybody would want to use attributes which then can't be accessed with `obj.attr` syntax. That's useless.
Also, it's not just bytes, arbitrary strings frequently contains hyphens, dots, or even start with digits.

Use a (real) dict.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Feb 12, 2020 20:40 UTC (Wed) by togga (subscriber, #53103) [Link] (1 responses)

>> " That's useless."

As I said above, it's a necessity due to the design of library APIs. Examples of needed, otherwise unnecessary, encode/decode are plenty (and error-prone). Article mentions a few, I've already mentioned ctypes where for instance structure field names (often read from binary APIs such as c strings, etc) is required to be attributes.

This thread has become a bit off topic. The interesting question for me is Python 2to3 language regressions or which migrations that are feasible, that stage was done in ~ 2010 to 2013 with several Python3 failed migration attempts. Nothing of value has changed since. Half of my Python2 use-cases is not suited for Python3 due to it's design choices and I do not intend to squeeze any problem in a tool not suited for it. That's more of a fact.

The question back of my head for me is about the other half of my use-cases that fits Python3. Given the experience of python leadership attitudes, decisions, migration pains, etc which the article is one example of. Is python3 a sound language choice for new projects?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Feb 12, 2020 20:47 UTC (Wed) by togga (subscriber, #53103) [Link]

>> interesting question for me is Python 2to3 language regressions

Oops.. it should read the opposite. "is not" Python 2to3 language regressions

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 13:26 UTC (Sat) by foom (subscriber, #14868) [Link]

The new format method was added because it was thought to be a better syntax for doing formatting. The % formatting was only kept in python3 (for strings) because it didn't seem feasible to migrate everyone's existing format strings, which might even be stored externally in config files.

Given the invention of better format syntax, forcing the continued use of the worse/legacy % format syntax for bytestrings seems a somewhat mystifying decision.

It's not as if the only use of bytestrings is in code converted from python 2...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:23 UTC (Tue) by atai (subscriber, #10977) [Link]

It is time for Python to stop care about its users so other languages can get their chances

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 8:02 UTC (Tue) by edomaur (subscriber, #14520) [Link]

RiiR is in progress, at Facebook :-)

Facebook is using Mercurial internally, because it works better than git as a monorepository, but they had to rewrite many hot paths, and are currently working on Mononoke, a full-Rust implementation of the Mercurial server. Still "The version that we provide on GitHub does not build yet." but I think it's an interesting project.

https://github.com/facebookexperimental/mononoke

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 0:43 UTC (Wed) by gracinet (guest, #89400) [Link]

Well, here's a nice coincidence: Greg Szorc is also one of the first proponents of Rust in Mercurial. He wrote the "OxidationPlan" (https://www.mercurial-scm.org/wiki/OxidationPlan) a while ago to that effect.

That got me on board, and if you're interested, my colleague Raphaël Gomès will give two talks on that subject in two weeks at FOSDEM.

I know from your posting history here what you think of Python3 and unicode strings, but even though Rust has dedicated types for filesystem paths, we still have some issues. For instance regex patterns are `&str`. It can be overcome by escaping, but that's less direct than reading them straight from the `.hgignore` file.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds