|
|
Subscribe / Log in / New account

Szorc: Mercurial's Journey to and Reflections on Python 3

Here is a longish blog entry from Mercurial maintainer Gregory Szorc on the painful process of converting Mercurial to Python 3. "I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years."

to post comments

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 19:17 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (176 responses)

The TLDR; version should be: "Use Rust or Go. They care about their users"

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:03 UTC (Mon) by koh (subscriber, #101482) [Link] (92 responses)

There is no mentioning of "Go" in the blog post.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (91 responses)

That's why "should".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:13 UTC (Tue) by koh (subscriber, #101482) [Link] (90 responses)

I suppose "the TLDR; version" should be about your upcoming blog post describing how Go cares about its users and not about the one this article talks about.

The TL;DR version for this one here should be "removal of u'...' literals, '%' on objects of bytes type, **kwargs only on str instead of bytes are backwards-incompatible changes making the transitition harder - all in all: a global change that for C as a language has basically had no effect (ascii -> utf8, probably by design) in Python 2/3 results in huge ramifications."

> [...] "if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3". As crazy as it initially sounded, I think I agree with that assessment.
Might also enter, but Go is entirely out of scope.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:59 UTC (Tue) by flussence (guest, #85566) [Link] (32 responses)

Porting Python apps to Go is currently quite a lot more viable than Rust, which is both handicapped by LLVM's limited platform support and doesn't have any batteries included.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 5:29 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (29 responses)

Just started with Rust, and liking the language a lot more than Go ... so far..

I haven't run into any issues with LLVM support, what have you seen?
Odd that this would be unique to Rust, and not the plethora of other languages that use LLVM, with no complaints. Or is this something that crops up, no matter the language, and I just haven't heard of the use case yet? Linux & Mac should definitely be well supported. And I assume windows, if Mozilla is using Rust...

Also, the Rust language and standard library seems to have more features than Go: but that would be a very subjective opinion on my part. Depends on what you're focused on, I think - what do you find in Go, that was missing in Rust, and you found frustrating?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:11 UTC (Tue) by roc (subscriber, #30627) [Link] (18 responses)

"LLVM's limited platform support" is a reference to the fact that gcc supports a lot of obscure/obsolete architectures that LLVM doesn't, e.g. SH4, PA-RISC, m68k.

OTOH LLVM supports Qualcomm Hexagon while gcc doesn't, and is an alive-and-well architecture that has shipped billions of units. For some reason Rust's detractors do not see this as a problem for gcc or Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:19 UTC (Tue) by dvdeug (guest, #10998) [Link] (17 responses)

There's an out of date GCC port to Qualcomm Hexagon. I'm sure that GCC 4.5 will do the job much of the time.

LLVM doesn't support a number of architectures that distributions/OSes like Debian and NetBSD support using GCC. Linux does in theory run on Qualcomm Hexagon, but as far as I can tell most of those shipments have been in SnapDragon SoCs running ARM chips, and the Hexagon chips are used for Qualcomm-proprietary reasons or specialized multimedia or AI purposes. Mercurial is never going to run on that for anything besides a demo.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:55 UTC (Tue) by roc (subscriber, #30627) [Link] (16 responses)

The existence of Debian/NetBSD ports targeting museum architectures should not influence anyone's choice of programming language.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:38 UTC (Tue) by dvdeug (guest, #10998) [Link] (15 responses)

That surely depends on whether you believe the continued working of older computers, most realistically for aesthetic reasons. I'm not sure there's any more value in the Hexagon architecture; if you can support ARM, you support basically every system that has a Hexagon chip.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:32 UTC (Tue) by roc (subscriber, #30627) [Link] (14 responses)

Sure.

The problem is that there aren't enough people who want to be able to run the latest and greatest software on museum architectures to actually support those architectures through just their own efforts, e.g. by maintaining an out-of-tree LLVM backend. Thus they try to influence other developers to do work to help them out. Sometimes they get their preferences enshrined as distro policy to compel other developers to work for them. Sometimes it's borderline dishonest, raising deliberately vague concerns like "LLVM's limited platform support" to discourage technology choices that would require them to do work.

This is usually just annoying, but when it means steering developers from more secure technologies to less-secure ones, treating museum-architecture support as more important than cleaning up the toxic ecosystem of insecure, unreliable code that everyone actually depends on, I think it's egregious.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 3:25 UTC (Wed) by dvdeug (guest, #10998) [Link] (13 responses)

To quote someone else on this thread, arguing against Python 3:

> Eventually one of them will try to use your program on a FAT-formatted USB stick with Shift-JIS filenames or whatever, .... As a responsible programmer you want to make your program work for that user...

>That experience will encourage you to start your next project in a different language (one whose designers have already considered this problem and solved it properly) so you won't have the same pains if your project becomes popular and needs to handle obscure edge cases.

You know that getting these programs to work on these systems is a priority for some people. You can do the hard work like people did for C (and a number of other languages), and write a direct compiler. You can piggyback on GCC, like a half-dozen languages have. You can compile to one of the languages in the first two groups, like any number of languages have, or write an interpreter in a language in one of the three groups, like vastly more languages have. The Rust developers instead chose to handle this in a way that wouldn't support some of their userbase. That has nothing to do with being a "more secure technology"; that's "choosing to drop customer requirements that would take work to support".

I see where you're coming from, but on the other hand, if your competition supplies a feature that people want, perhaps it's on you to implement that feature, and perhaps developers will consider excors' advice above about using a language that won't have this problem.

(As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 10:23 UTC (Wed) by roc (subscriber, #30627) [Link] (12 responses)

> That has nothing to do with being a "more secure technology"; that's "choosing to drop customer requirements that would take work to support".

I think that's a fine way to look at it, as long we are clear about which "customers" are actually being dropped. "The Rust developers chose to handle this in a way that wouldn't support some of their userbase" sounds rather ominous, but when we clarify that "some of their userbase" means "a few obsolete and a few minor embedded-only architectures", it sounds more reasonable.

> (As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)

Oddly enough, this is not a theoretical question: https://lists.llvm.org/pipermail/llvm-dev/2018-August/125...
Looks like it was not rejected outright, though I'm dubious it would ever be accepted. Maintaining an out-of-tree fork is more work than having it upstream --- while it is out-of-tree, all the maintenance has to be done by the m68k community, but upstreaming the backend would shift a lot of the maintenance costs from the community fans to the rest of the LLVM developers. That is exactly why they would not accept it!

I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k. Everyone using m68k to run modern software for any real task could accomplish the same thing faster with lower power on modern hardware, therefore they are doing it strictly for fun. No-one *needs* to run a particular piece of modern software on m68k. I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty. Same goes for the other obsolete architectures.

I have more sympathy for potentially relevant new embedded architectures like C-Sky. But I suspect that sooner or later an LLVM port, upstream or not, is going to be just part of the cost of promulgating a viable architecture, especially for desktops. There are already a lot of LLVM-based languages, including Rust, Swift and Julia; clang-only applications like Firefox and Chromium (Firefox requires Rust too of course); and random other stuff like Gallium llvm-pipe.

I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc, and this particular issue will become moot.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 11:46 UTC (Wed) by dvdeug (guest, #10998) [Link] (10 responses)

> I think that's a fine way to look at it, as long we are clear about which "customers" are actually being dropped. "The Rust developers chose to handle this in a way that wouldn't support some of their userbase" sounds rather ominous, but when we clarify that "some of their userbase" means "a few obsolete and a few minor embedded-only architectures", it sounds more reasonable.

But somehow it doesn't sound reasonable that distributions that support those architectures don't support using Rust for core software? You're trying to have your cake and eat it too.

> I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k.

You don't see an argument for cooperating with your fellow open source developers on their projects, but you do see an argument for supporting a billion dollar company that produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.

We could start with community. If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.

> with lower power on modern hardware,

How much power does it take make a modern computer? The m68k (e.g.) is a little extreme, but it's certainly a signpost that Linux, as an operating system, is not going to drop support for hardware that's not the latest and greatest. I've got a laptop that has 20 times the processor and eight times the memory of what I went to college with, that has Windows 10 on it, and response times are vastly worse than that laptop I used in college. I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.

> I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc,

Cool. What you're saying is that if you have your way, the programming language that has my heart strings, Ada, will get much harder to use on modern systems, and should I return to active work on Debian I have an interest in discouraging Rust and LLVM and firmly grounding the importance of GCC in the system. Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 14:54 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Cool. What you're saying is that if you have your way, the programming language that has my heart strings, Ada, will get much harder to use on modern systems
GNAT developers saw the writing on the wall and are working on adding Ada support to LLVM: https://blog.adacore.com/combining-gnat-with-llvm

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:02 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

They also saw "the writing on the wall" when they wrote a port to the JVM, the last, mostly complete, release of which was almost two decades ago. The linked article calls it a "work-in-progress research project". There's a difference between usable code for serious projects and research projects.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:47 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

JVM project could have never been a real product, because JVM is simply too limited for full Ada implementation.

LLVM can replace GCC completely.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 12:03 UTC (Tue) by dvdeug (guest, #10998) [Link]

One would have to carefully parse both the Ada standard and the JVM standard, but I do not believe that Ada has any required features that would make a JVM target impossible.

LLVM could in theory replace GCC completely. And why would a company that's been working with GCC for 25 years find it worth giving up all that expertise to do so? A research project is useful, and it's possible a useful tool will come of this, but there seems to be no evidence GCC is in such a dire state for AdaCore to change the underlying platform on their core product.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 22:35 UTC (Wed) by roc (subscriber, #30627) [Link] (5 responses)

> But somehow it doesn't sound reasonable that distributions that support those architectures don't support using Rust for core software? You're trying to have your cake and eat it too.

That sounds reasonable in isolation, but when it's part of a causal chain that results in a few hobbyists holding back important improvements for the other 99.999% of users, it becomes unreasonable.

> produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.

While you're wielding "proprietary" as a slur, keep in mind that almost every architecture that you think it's important to support is also proprietary.

> We could start with community.

Can you elucidate the actual argument here?

> If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.

The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.

Even if you think that behavior should be tolerated, I doubt it would actually happen often enough in practice that it would be worth anyone taking it into account ahead of time.

> I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.

I'm not sure what arguments you're making here. I can imagine two:
1) Replacing an m68k machine with something more modern hurts the environment because of the manufacturing impact of the more modern machine.
In practice m68k is so obsolete you could replace it something much faster and more efficient that someone else was going to throw away.
2) Keeping modern software running on m68k helps keep that software efficient for all users.
In practice I have not seen, and have a hard time imagining, developers saying "we really need to optimize this or it'll be slow on m68k, even though it's fast enough for the rest of our users". To the extent they care about obscure architectures, if it works at all, that's good enough.

> What you're saying is that if you have your way,

FWIW "CPU vendors will start choosing to just implement an LLVM" is my prediction, not my goal. I actually slightly prefer a world where CPU vendors implement both an LLVM backend and a gcc backend, because I personally like gcc licensing more than LLVM's. I just don't think that's the future.

> Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?

Not sure what you're trying to say here. If the cost of supporting old systems is very low, I certainly wouldn't gratuitously break them. For example, in rr we support quite old Intel CPUs because it's no trouble. OTOH once in a while we increase the minimum kernel requirements for rr because new kernel features can make rr better for everyone running non-ancient kernels and maintaining compatibility code paths for older kernels is costly in some cases.

If systemd developers are spending a lot of energy supporting old systems used by a tiny fraction of users, when they could be spending that energy making significant improvements for the other 99.999%, then yeah I'd question that decision.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 9:10 UTC (Fri) by dvdeug (guest, #10998) [Link] (4 responses)

> when it's part of a causal chain that results in a few hobbyists holding back important improvements for the other 99.999% of users, it becomes unreasonable.

That's dramatic and silly. Red Hat doesn't care about the m68k, and presumably it's not going to hold back important improvements from them. Nor Suse. If Debian cares about the m68k, more than 0.001% of their users care about it in theory, even if they don't use it, and Ubuntu and other Debian derivatives can work around that if they care about it.

It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.

> While you're wielding "proprietary" as a slur, / Can you elucidate the actual argument here?

Ah, see, I was a developer for Debian GNU/Linux. So the idea that we should be working on Free Software as a team is important to me.

> keep in mind that almost every architecture that you think it's important to support is also proprietary.

At different levels, maybe. But a patent only runs 20 years, so any old enough CPU can be reimplemented without license. And the uses aren't proprietary; it's a bunch of hobbyists who benefit, not one big company. There's a difference between an x86-64 chip that's mass-marketed and used in a vast array of devices, and a chip that's only used on Qualcomm's SoCs, and primarily running Qualcomm's code. If LLVM is worried about the cost of bringing an architecture in house, then why let Qualcomm take your developer's time?

> The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.

So you get to judge whether a feature is important or not, but a Linux maintainer can't? You can choose what features you work on, but a Linux maintainer can't? A Linux maintainer can certainly say "your patch causes the kernel to crash; here's the traceback", and leave it at that, even if it will take fifteen minutes of their time or a dozen hours of yours to find the bug. I don't know if they can say that LLVM support isn't worth it--it's probably down to Linus himself--but they can at the very least quit if they feel they have to deal with pointless LLVM patches instead of important patches.

> I'm not sure what arguments you're making here.

I said the m68k is an extreme case. But it is a bellwether; a system that is quick to drop old hardware is much more likely to drop my old hardware, and a system that support the m68k is much less likely to go through and dump support for old hardware. It is something of a matter of pride that Linux works on old systems. Even passing the tests on many of these old systems puts a limit on how slow the software can be.

> once in a while we increase the minimum kernel requirements for rr

Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware. It would be a lot more frustrating if every time you had to upgrade rr, you had to upgrade the kernel and seriously worry about the system not coming back up or important parts not working; perhaps many people would stop doing both.

Basically it comes down this paragraph again: It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 11:10 UTC (Fri) by peter-b (subscriber, #66996) [Link] (1 responses)

> Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware.

Yes it does. https://lwn.net/Articles/769468/

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:16 UTC (Fri) by smurf (subscriber, #17840) [Link]

Well, if nobody uses that hardware any more, newer kernels won't get tested on it (assuming they even build, given that some aren't supported by mainline GCC), thus they are unlikely to work – but they still increase the load on other maintainers.

It's not the kernel [developers] that knock off hardware – it's the users who retired said hardware. If anybody had spoken up in favor of keeping (and maintaining …) the dropped architectures or drivers in question, they'd still be supported.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 14:59 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

> Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware.

The kernel dropped support for 386 years ago, despite it being the first CPU it ran on.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 22:36 UTC (Fri) by roc (subscriber, #30627) [Link]

> It's a free/Free market. Program developers can write what they want, and distributions can use what they want.

Of course. But we should still discuss the impact of those choices, which is not always obvious.

This subthread was triggered by just such a discussion:
> Porting Python apps to Go is currently quite a lot more viable than Rust, which is both handicapped by LLVM's limited platform support and doesn't have any batteries included.

which led us into a discussion of what exactly "LLVM's limited platform support" means and how important that is relative to other considerations. I learned something and I guess other readers did too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:34 UTC (Thu) by ndesaulniers (subscriber, #110768) [Link]

You can build Linux with clang. Android and CrOS do today and ship that. Check out clangbuildlinux.github.io.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:00 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

There was a great argument in Debian when someone who ported LLVM to an architecture found that was used as justification to transition one library to Rust, which broke that library for other architectures he was working on. It doesn't matter that another language can use LLVM; it matters that it only uses on LLVM, at least on Linux, and I can't name another language that is only supported by LLVM on Linux.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:52 UTC (Tue) by roc (subscriber, #30627) [Link]

Swift of course.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:02 UTC (Tue) by joib (subscriber, #8541) [Link]

Julia, and as 'roc' said, Swift.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:41 UTC (Tue) by flussence (guest, #85566) [Link] (6 responses)

> I haven't run into any issues with LLVM support, what have you seen?

See the Debian librsvg problems for one example. A few lines of Rust code leaving entire CPU arches having to choose between having a GUI or risk staying on an old library indefinitely.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 20:12 UTC (Tue) by foom (subscriber, #14868) [Link] (5 responses)

"Entire CPU arches" which could absolutely add an LLVM backend if people still cared enough about them to do so.

The problem with these obsolete architectures is that various groups of volunteers do care about them, but only just barely enough to keep them alive and operational while small amounts of work are required. But there's just not enough interest, ablility, or willingness available to implement anything new for them.

And that's certainly fine and understandable.

But, don't then pretend that the lack of maintenance effort available for these obscure/historical architectures is some sort of problem with the newer compilers and languages. Or, try to convince people that languages which use LLVM should be avoided because it has "limited platform support".

If you mean "I wish enough people still cared about 68k enough for it to remain a viable architecture so I could keep using it", just say that, instead of saying "LLVM's limited platform support", as if that was some sort of actual problem with LLVM.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 14:28 UTC (Wed) by dvdeug (guest, #10998) [Link] (4 responses)

Well, no.

For one, that's a biased view of how software development works. If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that. If my new compiler only targets ARM64, I don't get to complain at all the people who aren't rushing to retarget it to x86-64 yet consider that a missing feature. Yes, LLVM has limited platform support with respect to many of the older architectures people are trying to support in Debian or NetBSD.

For another, according to roc on this article, LLVM is not interested in adding backends for these architectures. If they're forcing people to try and maintain entire CPU arches out of tree, then that's adding quite a bit of trouble. And while you've been less dismissive than roc has here, it's still far from saying "we're recognize that it would be nice to have these architectures, and if you're familiar with them, we're happy to help you implement them in LLVM/Rust." Offering an adversarial approach to people who want these architectures is hardly the way to convince them to put the work in on them.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 21:27 UTC (Wed) by roc (subscriber, #30627) [Link]

> For another, according to roc on this article, LLVM is not interested in adding backends for these architectures.

Don't quote me on that; I'm just speculating. All we actually know is that an m68k backend was proposed and not rejected, and the developer never got around to moving it forward.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 7:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that.

*IF* the creators or promoters care about this particular set people. Maybe they don't? Try to imagine. It could be for any reason, good or bad. Logical or not.

> If my new compiler only targets ARM64, I don't get to complain at all the people who who aren't rushing to retarget it to x86-64 yet consider that a missing feature.

I don't remember reading so many ungrounded assumptions packed in such a small piece of text. Pure rhetoric, it's surreal. I spent an ordinate amount of time trying (and failing) to relate it to something real.

BTW: you keep misunderstanding that roc favors the reality that he merely tries to _describe_.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 8:09 UTC (Fri) by dvdeug (guest, #10998) [Link]

>> If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that.

> *IF* the creators or promoters care about this particular set people.

Rust promoters do care about this particular set of people, or else we wouldn't be having this discussion. Rust promoters are right here complaining that these users don't support the use of Rust because it would hurt portability to their systems. They're not saying "if Debian chooses to reject Rust in core packages over this, that's cool with me." They're telling people they're wrong for finding this particular feature important.

> I don't remember reading so many ungrounded assumptions packed in such a small piece of text.

And yet you don't name one. Implement the features people want or not, but don't get offended that they use alternatives if you don't.

> you keep misunderstanding that roc favors the reality that he merely tries to _describe_.

Roc:
>> I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty.
>> I have more sympathy for potentially relevant new embedded architectures

When you start describing something as "misguided" and saying "I have more sympathy for", you're not neutrally describing reality.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 2:40 UTC (Tue) by foom (subscriber, #14868) [Link]

I recognize it would be nice to support these architectures, and would be happy if there was support for anything people actively needed.

Speaking for myself, I'd welcome the addition of new backends upstream, even for somewhat fringe architectures. But, I'd want some reassurance that such a backend has enough developer interest to actually maintain it, so that it's not just an abandoned code-dump. (I believe this is also the general sentiment in the LLVM developer community),

And it is definitely a time commitment to get a backend accepted in the first place. Not only do you have to write it, you have to get a patch series in a reviewable shape, try to attract people to review such a large patch-series, and follow up throughout to requests.

Anyways, I'm not sure what happened with the previous attempt to get a m68k backend added to LLVM. Looks like maybe the submitter just gave up upon a suggestion to split the patch up for ease of review? Or maybe due to question of code owners? If so, someone else could pick it up from where they left off...I'd be supportive if you wanted to do so. (But not supportive enough to actually spend significant time on it, since I don't really care about m68k.)

I'll just note here that Debian does _not_ actually support m68k or any of the other oddball architectures mentioned. Some are unofficial/unsupported ports (which means, amongst other things, that the distro will not hold back a change just because it doesn't work on one of the unsupported architectures...)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 8:56 UTC (Tue) by ehiggs (subscriber, #90713) [Link] (1 responses)

I haven't ported a full program from Python to Rust but the ability to link Rust via FFI means you can port the program piecemeal. This seems a good deal easier than porting to Go since you can stop halfway if you run out of budget and still get a huge amount of the performance benefits.

Also, batteries being included in Python was useful but seems deprecated. Who writes anything in Python without leveraging PyPI/pip?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:54 UTC (Tue) by flussence (guest, #85566) [Link]

There is value in installing a language and immediately having the ability to be productive in it without having to set up a web of trust and always-ready internet connection. Perl does the same as Python; CPAN may be the core selling point of the language but it's almost always installed with hundreds of modules.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 3:25 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (56 responses)

But it wasn't a no-op for C, at least if you include Microsoft.* They have two entirely separate copies of their API (the *W functions and the *A functions) just to handle that transition, *and* a huge pile of preprocessor hacks to dynamically switch between those two APIs depending on #defines etc. And Raymond Chen is *still* regularly writing blog posts about compatibility problems with this approach.

* You should include Microsoft, because Microsoft was (probably) a significant factor in Python's decision to use Unicode-encoded filenames and paths rather than the "bags of bytes" model that Unix favors.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:45 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (55 responses)

The most significant factor was the web and usb keys, that caused all the islands of incompatible 8bit encodings to collide (badly).

Anyone who had to salvage data on a system when every app felt the "bag of bytes" model entitled it to use a different encoding than other apps will agree with the Python decision to go UTF-8 (there were lots of those in the 00’s; a lot less now thanks to Python authors and other courageous unicode implementers).

Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools. Relying on the "bag of bytes" model to perform all kinds of unicode incompatible tricks is as bad as when compiler authors rely on "undefined behaviour" to implement optimizations that break apps.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:54 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Python decision to go UTF-8
Python doesn't use UTF-8. Its "Unicode" strings are also not actually guaranteed to be Unicode.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:12 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (5 responses)

UTF-8 is an unicode representation. You can convert from one to the other, you can’t with bag of bytes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:50 UTC (Tue) by excors (subscriber, #95769) [Link] (3 responses)

You can't reliably convert Python strings to UTF-8 either. The standard library will happily give you strings that throw UnicodeEncodeError when you try.

(I like Unicode, and I agree bags of bytes are bad. But I don't like things that pretend to be Unicode and actually aren't quite, because that leads to obscure bugs and security vulnerabilities.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:43 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (2 responses)

I guess that just shows that the Python team was right to decide that transitioning to unicode required forcing devs to use unicode. And that, despite all the grief they got about it over years, (and continue to get in this article), they didn’t go far enough.

People still managed to find loopholes and other ways to sabotage the migration.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:56 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> People still managed to find loopholes and other ways to sabotage the migration.
Perhaps Python should have also required rewriting all the code backwards? It would have been just as useful!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:44 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

Yes, right, say the people who are quite happy to use filesystems with working filenames, but see no reason to make the effort to generate working filenames themselves.

That’s called the tragedy of the commons.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:46 UTC (Wed) by jamesh (guest, #1159) [Link]

Python's strings don't use UTF-8 as their Unicode representation. Strings use fixed size characters (either 8-bit, 16-bit, or 32-bit depending on the contents) so that indexing is fast.

As far as file system encoding goes, it will depend on the locale on Linux. If you've got a UTF-8 locale, then it defaults to UTF-8 file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:04 UTC (Wed) by HelloWorld (guest, #56129) [Link] (44 responses)

I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned, and every decent programming language should be able to handle whatever the kernel throws at it. Pretending that everything is Unicode on the file system doesn't make it true, and it causes problems the moment you run into an FS where that is not the case.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 7:56 UTC (Thu) by marcH (subscriber, #57642) [Link] (42 responses)

> I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned...

You didn't go far enough and missed that bit:

> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools

Users don't care who's in charge of _their_ files, kernel or whatever else.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:01 UTC (Thu) by marcH (subscriber, #57642) [Link]

Déjà vu right after pressing "publish":
https://lwn.net/Articles/325304/ "Wheeler: Fixing Unix/Linux/POSIX Filenames" 2009

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:43 UTC (Thu) by HelloWorld (guest, #56129) [Link] (40 responses)

The point is that that doesn't matter at all. There are file systems that contain non-UTF-8 file names, and Python should be able to read and write these.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:53 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (39 responses)

The point is that it does not matter at all. Non-UTF-8 filenames will crash and burn in interesting ways in apps and scripts (and the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels).

Therefore "being able to write these" means "being able to crash other apps". It’s an hostile behavior, not really on par with Python objectives.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:17 UTC (Thu) by roc (subscriber, #30627) [Link] (35 responses)

> the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels

Depends on what you mean by "cannot be avoided". All platforms that I know of allow passing any filename as a command-line argument. On Linux, it is easy to write a C or Rust program that spawns another program, passing a non-UTF8 filename as a command line argument. It is easy to write the spawned program in C or Rust and have it open that file. In fact, the idiomatic C and Rust code will handle non-UTF8 filenames correctly.

That C code won't work on Windows, you'll have to use wmain() or something, but the Rust code would work on Windows too.

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:49 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

Do not feed them time bombs.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:17 UTC (Thu) by roc (subscriber, #30627) [Link]

> Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

That's exactly why your app needs to be able to cope with any garbage filenames it finds there.

> Do not feed them time bombs.

I'm not arguing that non-Unicode filenames are a good thing or that apps should create them gratuitously. But widely-deployed apps and tools should not break when they encounter them.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by anselm (subscriber, #2796) [Link] (27 responses)

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

I personally would like my file names to work with the shell and standard utilities. I'm not about to write a C or Rust program just to copy a bunch of files, because their names are in a weird encoding that can't be typed in or will otherwise mess up my command lines.

In the 2020s, it's a reasonable assumption that file names will be encoded in UTF-8. We've had a few decades to get used to the idea, after all. If there are outlandish legacy file systems that insist on doing something else, then as far as I'm concerned these file systems are the problem and they ought to be fixed.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

There are plenty of other examples where Py3 falls flat because of its string insistence. For example, we had a problem with a transparent proxy that needed to work with non-UTF-8 headers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:57 UTC (Fri) by cortana (subscriber, #24596) [Link] (3 responses)

I'm honestly not sealioning but: I thought HTTP headers were Latin-1. So they should be bytestrings in Python, not strings?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

The reality is that nobody uses the RFC-specified method of encoding non-Latin-1 characters for HTTP headers. So in reality there are tons of agents sending headers in local codepages or with US ASCII characters.

This actually works fine with most servers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:12 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

For a real-world example of handling HTTP headers, see XMLHttpRequest.getResponseHeader(). That's defined to return a ByteString (https://xhr.spec.whatwg.org/#ref-for-dom-xmlhttprequest-g...), which is converted to a JavaScript String by effectively decoding as Latin-1 (i.e. each byte is translated directly into a single 16-bit JS character) (https://heycam.github.io/webidl/#es-ByteString). When setting a header, you should get a TypeError exception if the JS String contains any character above U+00FF.

The only restrictions on a header value (https://fetch.spec.whatwg.org/#concept-header-value) are that it can't contain 0x00, 0x0D or 0x0A, and can't have leading/trailing 0x20 or 0x09. (And browsers only agreed on rejecting 0x00 quite recently.)

So it's pretty much just bytes, and if you want to try interpreting it as Unicode then that's at your own risk.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Hah. I thought that HTTP2 fixed this, but apparently it's not: https://tools.ietf.org/html/rfc7230#section-3.2 - still allows "obs-text" which is basically any character.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 20:21 UTC (Thu) by rodgerd (guest, #58896) [Link]

It's unfortunate - to put it mildly - how many people seem wedded to "Speak ASCII or die" colonialism in their code, and then pivot to "but what about esoteric nonsense?" to try to block any progress, no?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:28 UTC (Thu) by roc (subscriber, #30627) [Link] (20 responses)

> I'm not about to write a C or Rust program just to copy a bunch of files

`cp` and many other utilities handle non-Unicode filenames correctly. That's not surprising; C programs that accept filenames in argv[] and treats them as a null-terminated char strings should work.

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does. Apparently that is not viable.

> as far as I'm concerned these file systems are the problem and they ought to be fixed.

Sounds good to me, but reality disagrees.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:46 UTC (Fri) by anselm (subscriber, #2796) [Link] (4 responses)

`cp` and many other utilities handle non-Unicode filenames correctly.

True, but you need to feed them such names in the first place. Given that, these days, Linux systems normally use UTF-8-based locales, non-Unicode filenames aren't going to be a whole lot of fun on a shell command line, or in the output of ls, long before Python 3 even comes into play.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 9:04 UTC (Fri) by mbunkus (subscriber, #87248) [Link]

zsh with tab-completion works nicely (I would think bash with tab-completion, too). It's my go to solution for fixing file names with invalid UTF-8 encodings.

Just last week I had such a file name generated by my email program when saving an attachment from a mail created by yet another bloody email program that fucks up attachment file name encoding. And the week before by unzipping a ZIP created on a German Windows.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 21:46 UTC (Fri) by Jandar (subscriber, #85683) [Link] (2 responses)

I keep to encounter Linux systems running application using 8-bit national encodings. The same appears in Samba exported directories from decades old software controlling equally old instruments.

Seeing systems with only UTF-8 filenames is a rarity for me.

Enforcing UTF-8 only filenames is a complete no-go, even considering it is crazy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 22:47 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> I keep to encounter Linux systems running application using 8-bit national encodings.

Interesting, how does software on these systems typically know how to decode, display, exchange and generally deal with these encodings?

I understand Python itself enforces explicit encodings, not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:01 UTC (Sun) by Jandar (subscriber, #85683) [Link]

I have no information about any Python programs on these systems. C programs using setlocate(3) seem to have no major problems. If once in a while mojibake occurs in filenames like "qwert�uiop" this is insignificant in comparison to being unable to handle these files at all.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:56 UTC (Fri) by anselm (subscriber, #2796) [Link] (14 responses)

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does.

This is really a user discipline/hygiene issue more than a Linux file system issue. In the 1980s, the official recommendation was that portable file names should stick to ASCII letters, digits, and a few choice special characters such as the dot, dash, and underscore – this wasn't enforced by the file system, but reasonable people adhered to this and stayed out of trouble. I don't have a huge problem with a similar recommendation that in the 21st century, reasonable people should stick to UTF-8 for portable file names even if the file system doesn't enforce it. Sure, there are careless ignorant bozos who will sh*t all over a file system given half a chance, but they need to be taught manners in any case. Let them suffer instead of the reasonable people.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:07 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> ignorant bozos who will sh*t all over a file system
Like people using ShiftJIS and writing file names in Japanese?

Or Russian people using KOI-8 encoding on Linux?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:16 UTC (Fri) by anselm (subscriber, #2796) [Link] (12 responses)

If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.

It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 10:48 UTC (Fri) by farnz (subscriber, #17727) [Link] (11 responses)

The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:36 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by excors (subscriber, #95769) [Link]

> If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.

You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)

Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:05 UTC (Fri) by marcH (subscriber, #57642) [Link] (8 responses)

> However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:45 UTC (Fri) by anselm (subscriber, #2796) [Link] (7 responses)

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.

Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (6 responses)

> Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place.

How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.

And now case insensitivity even - much trickier than filename encoding.

Or NTFS maybe.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> How do you know they use Linux?
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.

There was also DOS (original and alternative) and ISO code pages, but they were rarely used.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (4 responses)

Interesting, thanks!

So how did Linux and Windows users exchange files in Russia? Not?

The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> So how did Linux and Windows users exchange files in Russia? Not?
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.

At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.

This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.

However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (2 responses)

> So it's pretty clear that something like Mercurial needs to support these niche users.

A cleanup flag day is IMHO the best trade off.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:40 UTC (Sat) by togga (guest, #53103) [Link] (1 responses)

"A cleanup flag day is IMHO the best trade off."

Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:48 UTC (Sat) by marcH (subscriber, #57642) [Link]

> Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

s/language/encodings/

This entire debate summarized in less than 25 characters. My pleasure.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:49 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Yeah, sure you can feed any random string to argv[], but the equally important case for file names is that somebody tries to type or paste them into their favorite editor (or its command line).

If you no longer have any way to type them because, surprise, your environment has been UTF8 for the last decade or so, then you'll need an otherwise-transparent encoding that can be pasted (or generated manually via \Udcxx), and that doesn't clash with the rest of your environment (your locale is utf-8 – and that's unlikely to change). Surrogateescape works for that. It should even be copy+paste-able.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 13:01 UTC (Wed) by niner (subscriber, #26151) [Link] (1 responses)

Create a file with a clearly non-UTF-8 name:
nine@sphinx:~> perl6 -e 'spurt ("testfile".encode ~ Buf.new(0xff, 0xff)).decode("utf8-c8"), "foo"'

The shell dutifully shows the name with surrogate characters:
nine@sphinx:~> ll testfile*
-rw-r--r-- 1 nine users 6 17. Sep 2014 testfile.latin-1
-rw-r--r-- 1 nine users 3 22. Jän 13:42 testfile??

Get that name from a directory listing, treating it like a string with a regex grep:
nine@sphinx:~> perl6 -e 'dir(".").grep(/testfile/).say'
("testfile.latin-1".IO "testfile􏿽xFF􏿽xFF".IO)

And just for fun: select+paste the file name in konsole:
nine@sphinx:~> cat testfile??
foo

Actually it looks like file names with "broken" encodings work pretty well. It's only Python 3 that stumbles:

nine@sphinx:~> python3
Python 3.7.3 (default, Apr 09 2019, 05:18:21) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for f in [f for f in os.listdir(".") if "testfile" in f]: print(f)
...
testfile.latin-1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 23:00 UTC (Wed) by Jandar (subscriber, #85683) [Link]

> nine@sphinx:~> cat testfile??
> foo

'?' is a special character for pattern matching in sh.

$ echo foo >testfilexx
$ cat testfile??
foo

So maybe this wasn't a correct test to see if your filename worked with copy&paste.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:34 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Sure. The entire software world is going to fix all its filename bugs and assumptions just because some people name their files on some filesystems in funny ways. The programs that don't get fixed will die. That plan is so much simpler and easier than renaming files. /s

Oh, and all the developers who were repeatedly told to "sanitize your input" to protect themselves and the buggy programs above are all going to relax their checks a bit too.

Best of luck!

If you can't be happy, be at least realistic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:49 UTC (Thu) by roc (subscriber, #30627) [Link]

Not the entire software world, no.

But it is realistic to expect that common utilities handle arbitrary filenames correctly (the most common ones do). And it realistic to expect that common languages and libraries make idiomatic filename-handling code handle arbitrary filenames correctly, because many do (including C, Go, Rust, Python2, and even some parts of Python3).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:05 UTC (Thu) by HelloWorld (guest, #56129) [Link] (2 responses)

So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Not caring about funky filenames because most other programs don't care either: seems perfectly logic to me. You're confusing likeliness and logic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:15 UTC (Thu) by marcH (subscriber, #57642) [Link]

Speaking of likeliness and happiness, let me share my personal preference. I'll stay brief or let's say briefer; seems doable.

I'm very happy that Python catches funky filenames at a relatively low-level with a clear, generic, usual, googlable and stackoverflowable exception rather than with some obscure crash and/or security issue specific to each Python program. These references about "garbage-in, garbage-out" surrogates that I don't have time to read scare me, I wish there were a way to turn them off.

I do not claim Python made all the right unicode decisions, I don't know what. I bet not, nothing's perfect. This comment is only about funky file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:20 UTC (Thu) by marcH (subscriber, #57642) [Link]

> I call bullshit. Whether you like it or not, file names are bags of bytes as far as the kernel is concerned,

"were"? https://lwn.net/Articles/784041/ Case-insensitive ext4

Now _that_ (case sensitivity) really never belonged to a kernel IMHO. Realism?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:58 UTC (Thu) by dgm (subscriber, #49227) [Link] (2 responses)

> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans

Absolutely. And I would add "and **only** by humans". Language run-times should not mess with them, period.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 18:43 UTC (Tue) by Wol (subscriber, #4433) [Link]

I've used a database where file names were NOT supposed to be interpreted by humans. And the database deliberately messed with them to make them unreadable ... :-)

Cheers,
Wol

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 19:32 UTC (Tue) by Jandar (subscriber, #85683) [Link]

On nearly any computer I use there are much more files generated by programs to be consumed exclusively by programs without any human looking at the filenames. In most cases these file-names have no more meaning than a raw pointer-value in C.

Although in case of trouble-shooting readable file-names are a remedy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:20 UTC (Mon) by pj (subscriber, #4506) [Link] (79 responses)

I think your statement begs the question: which users? IMO, Python obviously thought (insomuch as a large community has a single opinion on anything) it should care (wrt the 2-to-3 transition) more about _new_ users than maintainers of existing large codebases. I can't say their decision was wrong... or right. I suspect someone would complain no matter how it went down.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (62 responses)

> I think your statement begs the question: which users?
Indeed. Python maintainers decided to pick and choose them. Only good-behaving users who like eating their veggies ( https://snarky.ca/porting-to-python-3-is-like-eating-your... ) were allowed in.

As a result, Py3 has lost several huge codebases that started migrating to Go instead. Other projects like Mercurial or OpenStack started migration at the very last moment, because of 2.7 EoL.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 21:01 UTC (Mon) by vstinner (subscriber, #42675) [Link] (25 responses)

OpenStack project is way larger than Mercurial and has way more dependencies. OpenStack is more than 2 millions lines of Python code. OpenStack migration started in 2013, and I contributed to port like 90% of unit tests of almost all major projects (all except Nova and Swift where were less open for Python 3 changes), and I helped to port many 3rd party dependencies to Python 3 as well. All OpenStack projects have mandatory python3 CI since 2016 to, at least, not regress on what was already ported. See https://wiki.openstack.org/wiki/Python3 for more information. (I stopped working on OpenStack 2 years ago, so I don't know the current status.)

As Mercurial, Twisted is heavily based on bytes (networking framework) and it has been ported successfully to Python 3 a few years. Twisted can now be used with asyncio.

I tried to help porting Mercurial to Python 3, but their maintainers were not really open to discuss Python 3 when I tried. Well, I wanted to use Unicode for filenames, they didn't want to hear this idea. I gave up ;-)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 22:16 UTC (Mon) by excors (subscriber, #95769) [Link] (23 responses)

> Well, I wanted to use Unicode for filenames, they didn't want to hear this idea.

The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode. (There's PEP 383 but that produces strings that are not quite Unicode, e.g. it becomes impossible to encode them as UTF-16, so that's not good). And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode. For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

(The article also mentions the solution, as implemented in Rust: filenames are a platform-specific string type, with lossy conversions to Unicode if you really want that (e.g. to display to users).)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:19 UTC (Mon) by vstinner (subscriber, #42675) [Link] (12 responses)

> The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode.

On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

Read my articles https://vstinner.github.io/python30-listdir-undecodable-f... and https://vstinner.github.io/pep-383.html for the history the Unicode usage for filenames in the early days of Python 3 (Python 3.0 and Python 3.1).

The problem is that the UTF-8 codec of Python 2 doesn't respect the Unicode standard: it does encode surrogate characters. The Python 3 codec doesn't encode them, which makes possible to use surrogateescape error handler with UTF-8.

> And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.

I'm not sure of which problem you're talking about.

If you care of getting the same character on Windows and Linux (ex: é letter = U+00E9), you should encode the filename differently. Storing the filename as Unicode in the application is a convenient way for that. That's why Python prefers Unicode for filenames. But it also accepts filenames as bytes.

> For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

Well, it is where I gave up :-)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> I'm not sure of which problem you're talking about.
A VCS must be able to round-trip files on the same FS. Even if they are not encoded correctly.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:37 UTC (Mon) by roc (subscriber, #30627) [Link] (3 responses)

It sounds to me like on Windows you can round-trip arbitrary native filenames through Python "Unicode" strings because in both systems the strings are simply a list of 16-bit code units (which are normally interpreted as UTF-16 but may not be valid UTF-16). So maybe that 'surrogateescape' hack is enough. (But only because Python3 Unicode strings don't have to be valid Unicode after all.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:22 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

Python strings aren't 16-bit code units. b'\xf0\x92\x8d\x85'.decode('utf-8') is '\U00012345' with length 1, which is sensible.

You can't create a string like '\U00123456' (SyntaxError) or chr(0x123456) (ValueError); it's limited to the 21-bit range. But you *can* create a string like '\udccc' and Python will happily process it, at least until you try to encode it. '\udccc'.encode('utf-8') throws UnicodeEncodeError.

If you use the special decoding mode, b'\xcc'.decode('utf-8', 'surrogateescape') gives '\udccc'. If you (or some library) does that, now your application is tainted with not-really-Unicode strings, and I think if you ever try to encode without surrogateescape then you'll risk getting an exception.

If you tried to decode Windows filenames as round-trippable UCS-2, like

>>> ''.join(chr(c) for c, in struct.iter_unpack(b'>H', b'\xd8\x08\xdf\x45'))
'\ud808\udf45'

then you'd be introducing a third type of string (after Unicode and Unicode-plus-surrogate-escapes) which seems likely to make things even worse.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:44 UTC (Tue) by excors (subscriber, #95769) [Link]

> I think if you ever try to encode without surrogateescape then you'll risk getting an exception

Incidentally, that seems to include the default encoding performed by print() (at least in Python 3.6 on my system):

>>> for f in os.listdir('.'): print(f)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 4: surrogates not allowed

os.listdir() will surrogateescape-decode and functions like open() will surrogateescape-encode the filenames, but that doesn't help if you've got e.g. logging code that touches the filenames too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 4:47 UTC (Tue) by roc (subscriber, #30627) [Link]

Thanks for clearing that up.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:08 UTC (Thu) by marcH (subscriber, #57642) [Link]

> A VCS must be able to round-trip files on the same FS

Yet all VCS provide some sort of auto.crlf insanity, go figure.

Just in case someone wants to use Notepad-- from the last decade.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:32 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Huh, so Python3 "Unicode" strings aren't even necessarily valid Unicode :-(.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:40 UTC (Thu) by kjp (guest, #39639) [Link]

And the zen of python was forgotten long ago. Explicit is better than implicit? Errors should not pass silently? Nah. Let's just add math operators to dictionaries. Python has no direction, no stewardship, and I think it's been taken over by windows and perl folks.

Python: It's a [unstable] scripting language. NOT a systems or application language.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:37 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

> On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

But then you end up with a "Unicode" string in memory which can't be safely encoded as UTF-8 or UTF-16, so it's not really a Unicode string at all. (As far as I can see, the specifications are very clear that UTF-* can't encode U+D800..U+DFFF. An implementation that does encode/decode them is wrong or is not Unicode.)

That means Python applications that assume 'str' is Unicode are liable to get random exceptions when encoding properly (i.e. without surrogateescape).

> > And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
>
> I'm not sure of which problem you're talking about.

Windows (with NTFS) lets you create a file whose name is e.g. "\ud800". The APIs all handle filenames as strings of wchar_t (equivalent to uint16_t), so they're perfectly happy with that file. But it's clearly not a valid string of UTF-16 code units (because it would be an unpaired surrogate) so it can't be decoded, and it's not a valid string of Unicode scalar values so it can't be directly encoded as UTF-8 or UTF-16. It's simply not Unicode.

In practice most native Windows applications and APIs treat filenames as effectively UCS-2, and they never try to encode or decode so they don't care about surrogates, though the text rendering APIs try to decode as UTF-16 and go a bit weird if that fails. Python strings aren't UCS-2 so it has to convert to 'str' somehow, but there's no correct way to do that conversion.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:04 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Microsoft refers to it as an "extended character set":

https://docs.microsoft.com/en-us/windows/win32/fileio/nam...

Also, whatever your complaints are about whatever language, with respect to filenames, the win32 api is worse.

It's amazingly inconsistent. The level of insanity is just astonishing, especially if you're going across files created with the win api, and the .net libs.

You *have to p/invoke to read some files, and use the long filepath prefix, which doesn't support relative paths. And that's just the start.

Admittedly, I haven't touched it for almost a decade in any serious fashion, but, based on the docs linked above, it doesn't seem much has changed.

It's remarkable how easy they make it to write files that are quite hard to open..

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> It's amazingly inconsistent. The level of insanity is just astonishing
Just wait until you see POSIX!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 0:26 UTC (Wed) by gracinet (guest, #89400) [Link]

Hey Victor,

don't forget that Mercurial has to cope with filenames in its history that are 25 years old. Yes, that predates Mercurial but some of the older repos have had a long history as CVS then SVN.

Factor in the very strong stability requirements and the fact that risk to change a hash value is to be avoided, no wonder a VCS is one of the last to take the plundge. It's really not a matter of size of the codebase in this case.

Note: I wasn't directly involved in Mercurial at the time you were engaging with the project about that, I hope some good came out of it anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:18 UTC (Tue) by flussence (guest, #85566) [Link]

This was a sore point in Perl 6 too for many years due to its over-eagerness to destructively normalise everything on read. It fixed it eventually by adding a special encoding, similar to how Java has Modified UTF-8. It's not perfect, but without mandating a charset and normalisation at the filesystem level (something only Apple's dared to do) nothing is.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 7:57 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

How many Mercurial users store non-Unicode file names in a repository? Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames, their port to Python 3 would have gone much smoother.

If you do want a truly arbitrary ‘bag of bytes’ not just for file contents but for names too, I have the feeling you’d probably be using a different tool anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:39 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames

Losing the ability to read history of when the tool did not have such a restriction would not be a good thing. Losing the ability to manipulate those files (even to rename them to something valid) would also be tricky if it failed up front about bad filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 18:58 UTC (Wed) by hkario (subscriber, #94864) [Link]

it's easy to end up with malformed names in file system

just unzip a file from non-UTF-8 system, you're almost guaranteed to get mojibake as a result; then blindly commit files to the VCS and bam, you're set

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:35 UTC (Tue) by dvdeug (guest, #10998) [Link] (5 responses)

Which means there's no way to reliably share a Mercurial repository between Windows and Unix. You can either accept all filenames or make repositories portable between Windows and Unix, not both. Note that even pretending that you can support both systems ignores those whole "arbitrary byte strings" and "arbitrary uint16_t strings" issues. I'd certainly feel comfortable with Mercurial and other tools rejecting junk file names, though I can see where people with old 8-bit charset filenames in their history could have problems.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:59 UTC (Tue) by roc (subscriber, #30627) [Link] (4 responses)

> You can either accept all filenames or make repositories portable between Windows and Unix, not both.

You can accept all filenames and make repositories portable between Windows and Unix if they have valid Unicode filenames. AFAIK that's what Mercurial does, and I hope it's what git does.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:33 UTC (Tue) by dezgeg (subscriber, #92243) [Link] (3 responses)

Not quite enough... Let's not forget the portability troubles of Mac, where the filesystem does Unicode (de)normalization behind the application's back.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:21 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

OK sure. The point is: you can preserve native filenames, and also ensure that repos are portable to any OS/filesystem that can represent the repo filenames correctly. That's what I want any VCS to do.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:51 UTC (Tue) by Wol (subscriber, #4433) [Link]

Do what git does with line endings, maybe?

They had a load of grief with mixed Windows/linux repos, so there's now a switch that says "convert cr/lf on checkout/checkin".

Add a switch that says "enforce valid utf-8/utf-16/Apple filenames, and sort out the mess at checkout/checkin".

If that's off by default, or on by default for new repos, or whatever, then at least NEW stuff will be sane, even if older stuff isn't.

Cheers,
Wol

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:42 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

There are also the invalid path components on Windows. Other than the reserved names and characters, space and `.` are not allowed to be the end of a path component. All the gritty details: https://docs.microsoft.com/en-us/windows/win32/fileio/nam...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:57 UTC (Mon) by prometheanfire (subscriber, #65683) [Link]

Openstack is working on dropping python2 support this cycle. The problem is going to be ongoing support for older versions that still support python2. Just over the weekend gate crashed on setuptools newest version being installed in python2 when it's python3 only. It's gonna be rough, and we at least semi-prepared for this.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 5:40 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (35 responses)

So confused by this (but I don't really follow projects in either language.. well some Go ones that were always Go).

Python to Go seems like a weird switch. I tend to use them for very different tasks.

Unless you're bound to GCP as a platform or something similar.

But you're not the only one mentioning this: what projects have I missed that made the switch?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:02 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (34 responses)

> Python to Go seems like a weird switch. I tend to use them for very different tasks.
It actually is not, if you're writing something that is not a Jupyter notebook.

Stuff like command-line utilities and servers works really well in Go.

Several huge Python projects are migrating to Go as a result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:06 UTC (Tue) by mgedmin (subscriber, #34497) [Link] (33 responses)

> Several huge Python projects are migrating to Go as a result.

Can you name them?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:10 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (32 responses)

YouTube is one high-profile example. Salesforce also did a lot of rewriting internally.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:09 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (31 responses)

Youtube migrating to the Google programming language is not surprising.

As for the rest, a lot of infra-related things are being rewritten in Go just because containers (k8s and docker both use Go). That has little to do with the benefits offered by the language. It’s good old network effects. When you’re the container language, and everyone wants to do containers, being decent is sufficient to carry the day.

No one will argue that Go is less than decent. Many will argue it’s more than decent, but that’s irrelevant for its adoption curve.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 18:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Rewriting a project the scope of Youtube is not a small thing. And from my sources, Py2.7->3 migration was one of the motivating factors. After all, if you're rewriting everything then why not switch a language as well?

Mind you, Google actually tried to fix some of the Python issues by trying JIT compilation with unladen-swallow project before that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:05 UTC (Tue) by rra (subscriber, #99804) [Link]

Go is way, way faster than Python, consumes a lot less memory, and doesn't have the global interpreter lock so has much better multithreading. That's why you see a lot of infrastructure code move to Go.

For most applications, the speed changes don't matter and other concerns should dominate. But for core infrastructure code for large cloud providers, they absolutely matter in key places, and Python is not a good programming language for high-performance networking code.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 13:26 UTC (Sat) by HelloWorld (guest, #56129) [Link] (28 responses)

> No one will argue that Go is less than decent.
I will. Go is the single worst programming language design that achieved any kind of popularity in the last 10 years at least. It is archaic and outdated in pretty much every imaginable way. It puts stuff into the language that doesn't belong there, like containers and concurrency, and doesn't provide you with the tools that are needed to implement these where they belong, which is in a library. The designers of this programming language are actively pushing us back into the 1970s, and many people appear to be applauding that. It's nothing short of appalling.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 14:22 UTC (Sat) by smurf (subscriber, #17840) [Link] (27 responses)

Hmm. I agree with much of that. However: concurrency definitely does belong in a modern language. Just not the unstructured way Go does it. Cf. https://en.wikipedia.org/wiki/Structured_concurrency – which notes that the reasonable way to do it is via some cancellation mechanism, which also needs to be built into the language to be effective – but Go doesn't have one.

The other major gripe with Go which you missed, IMHO, is its appalling error handling; the requirement to return an "(err,result)" tuple and checking "err" *everywhere* (as opposed to a plain "result" and propagating exceptions via catch/throw or try/raise or however you'd call it) causes a "yay unreadable code" LOC explosion and can't protect against non-functional errors (division by zero, anybody?).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 0:09 UTC (Sun) by HelloWorld (guest, #56129) [Link] (26 responses)

> Hmm. I agree with much of that. However: concurrency definitely does belong in a modern language.
It absolutely does not. Concurrency is an ever-evolving, complex topic, and if you bake any particular approach into the language, it's impossible to change it when we discover better ways of doing it. Java tried this and failed miserably (synchronized keyword). Scala didn't put it into the language. Instead, what happened is that people came up with better and better libraries. First you had Scala standard library Futures, which was a vast improvement over anything Java had to offer at the time. But they were inefficient (many context switches), had no way to interrupt a concurrent computation or safely handle resources (open file handles etc.) and made stack traces useless. Over the years, a series of better and better libraries (Monix, cats-effect) were developed, and now the ZIO library solves every single one of these and a bunch more. And you know what? Two years from now, ZIO will be better still, or we'll have a new library that is even better.

By contrast, Scala does have language support for exceptions. It's pretty much the same as Java's try/catch/finally, how did that hold up? It's a steaming pile of crap. It interacts poorly with concurrency, it easily leads to resource leaks, it's hard to compose, it doesn't tell you which errors can occur where, and everybody who knows what they're doing is using a library instead, because libraries like ZIO don't have *any* of these problems.

So based on that experience, you're going to have a hard time convincing me that concurrency needs language support. Feel free to try anyway, but try ZIO first.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 1:03 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (25 responses)

The best and most maintainable way to write servers is still thread-per-request. Go makes that easy with its lightweight threads. Much better than any async library I've tried.

It really is that simple.

Plus, Go has a VERY practical runtime with zero dependency executables and a good interactive GC. It's amazing how much better Golang's simple mark&sweep is when compared to Java's neverending morass of CMS or G1GC (that constantly require 'tuning').

Sure, I would like a bit more structured concurrency in Go, but this can come later once Go team rolls out generics.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 5:47 UTC (Sun) by HelloWorld (guest, #56129) [Link] (24 responses)

> The best and most maintainable way to write servers is still thread-per-request. Go makes that easy with its lightweight threads. Much better than any async library I've tried.

Apparently you haven't tried ZIO, because it beats the pants off anything Go can do.

It really is that simple.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 6:01 UTC (Sun) by HelloWorld (guest, #56129) [Link] (23 responses)

I'll give just one example. With ZIO it is possible to terminate a fiber without writing custom code (e. g. checking a shared flag) and without leaking resources that the fiber may have acquired. This is simply not possible in Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 7:58 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

Zio is not even in contention, since it's built by pure functional erm... how to say it politly... adherents.

Meanwhile, Go is written by practical engineers. Cancellation and timeouts are done through the use of explicitly passed context.Context, resource cleanups are done through defered blocks.

This two simple methods in practice allow complicated systems comprising hundreds thousands of LOC to work reliably. While being easy to develop and iterate, not requiring multi-minute waits for one compile/run cycle.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:25 UTC (Sun) by smurf (subscriber, #17840) [Link] (16 responses)

Passing a Context around and checking of every function return for errors and manually terminating no-longer-needed Goroutines isn't exactly free. It bloats the code, it's error prone, too easy to get wrong accidentally, and makes the code much less readable.

If you come across a better paradigm sometime in the future, then bake it into a new version of the language and/or its libraries, and add interoperability features. Python3 is doing this, incidentally: asyncio is a heap of unstructured callbacks that evolved from somebody noticing that you can use "yield from" to build a coroutine runner, then Trio came along with a much better concept that actually enforces structure. Today the "anyio" module affords the same structured concept on top of asyncio, and in some probably-somewhat-distant future asyncio will support all that natively.

Languages, and their standard libraries, evolve.

With Go, this transition to Structured Concurrency is not going to happen any time soon because contexts and structure are nice-to-have features which are entirely optional and not supported by most libraries, thus it's much easier to simply ignore all that fancy structured stuff (another boilerplate argument you need to pass to every goroutine and another clause to add to every "select" because, surprise, there's no built-in cancellation? get real) and plod along as usual. The people in charge of Go do not want to change that. Their choice, just as it's my choice not to touch Go.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 12:35 UTC (Sun) by HelloWorld (guest, #56129) [Link] (15 responses)

> If you come across a better paradigm sometime in the future, then bake it into a new version of the language
You haven't yet demonstrated a single advantage of putting this into the language rather than a library, which is much more flexible and easier to evolve. Your thinking that this needs to be done in the language is probably a result of too much exposure to crippled languages like Python.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 14:35 UTC (Sun) by smurf (subscriber, #17840) [Link] (14 responses)

Go doesn't have automatic cleanup. Each call to "open file" requires a "defer close file". The point of structured code is that it's basically impossible, or at least a lot harder, to violate the structural requirements.

NB, Python also has the whole thing in a library. This is not primarily about language features. The problem is that it is simply impossible to add this to Go without either changing the language, or forcing people to write even more convoluted code.

Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:17 UTC (Sun) by HelloWorld (guest, #56129) [Link] (13 responses)

NB, Python also has the whole thing in a library. This is not primarily about language features.
It very much is about language features. Python has dedicated language support for list comprehensions, concurrency and error handling. But there is no need for that. Consider these:
x = await getX()
y = await getY(x)
return x + y

[ x + y
  for x in getX()
  for y in getY(x)
]
The structure is the same: we obtain an x, then we obtain a y that depends on x (expressed by the fact that getY takes x as a parameter), then we return x + y. The details are of course different, because in one case we obtain x from an async task, and in the other we obtain x from a list, but there's nevertheless a common structure. Hence, Scala offers syntax that covers both of these use cases:
for {
  x <- getX()
  y <- getY(x)
} yield x + y
And this is a much better solution than what Python does, because now you get to write generic code that works in a wide variety of contexts including error handling, concurrency, optionality, nondeterminism, statefulness and many, many others that we can't even imagine today.
Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.
This is a false dichotomy. Not having error handling built into the language doesn't mean you have to check for errors on every call.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 10:33 UTC (Mon) by smurf (subscriber, #17840) [Link] (12 responses)

> It very much is about language features.

Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.

But while Python is not such a language, neither is Go, so in effect you're shifting the playing ground here.

> Not having error handling built into the language doesn't mean you have to check for errors on every call.

No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 12:51 UTC (Mon) by HelloWorld (guest, #56129) [Link] (1 responses)

Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.
You don't need lazy evaluation for this to work. Scala is not lazily evaluated and it works great there.
No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.
You can factor out the error checking code into a function, so you don't need to write it more than once. After all, this is what we do as programmers: we detect common patterns, like “call a function, fail if it failed and proceed if it didn't” and factor them out into functions. This function is called flatMap in Scala, and it can be used like so:
getX().flatMap { x =>
  getY(x).map { y =>
    x + y
  }
}
But this is arguably hard to read, which is why we have for comprehensions. The following is equivalent to the above code:
for {
  x <- getX
  y <- getY x
} yield x + y
I would argue that if you write it like this, it is no harder to read than what Python gives you:
x = getX
y = getY(x)
return x + y
But the Scala version is much more informative. Every function now tells you in its type how it might fail (if at all), which is a huge boon to maintainability. You can also easily see which function calls might return an error, because you use <- instead of = to obtain their result. And it is much more flexible, because it's not limited to error handling but can be used for things like concurrency and other things as well. It's also compositional, meaning that if your function is concurrent and can fail, that works too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 18:30 UTC (Mon) by darwi (subscriber, #131202) [Link]

> But the Scala version is much more informative. Every function now tells you in its type how it might fail (if at all), which is a huge boon to maintainability

Long time ago (~2013), I worked as a backend SW engineer. We transformed our code from Java (~50K lines) to Scala (~7K lines, same functionality).

After the transition was complete, not a single NullPointerException was seen anywhere in the system, thanks to the Option[T] generics and pattern matching on Some()/None. It really made a huge difference.

NULL is a mistake in computing that no modern language should imitate :-( After my Scala experience, I dread using any language that openly accepts NULLs (python3, when used in a large 20k+ code-base, included!).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 15:46 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (9 responses)

> Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.

Yes, but with these types, *ignoring* (or passing on in Python) the error takes explicit steps rather than being implicit. IMO, that's a *far* better default. I would think the Zen of Python agrees…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 17:21 UTC (Mon) by HelloWorld (guest, #56129) [Link] (8 responses)

> Yes, but with these types, *ignoring* (or passing on in Python) the error takes explicit steps rather than being implicit. IMO, that's a *far* better default.

No, passing the error on does not take explicit steps, because the monadic bind operator (>>=) takes care of that for us. And that's a Good Thing, because in the vast majority of cases that is what you want to do. The problem with exceptions isn't that error propagation is implicit, that is actually a feature, but that it interacts poorly with the type system, resources that need to be closed, concurrency etc..

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 18:28 UTC (Mon) by smurf (subscriber, #17840) [Link] (6 responses)

Python doesn't have a problem with resources to be closed (that's what "with foo() as bar"-style context managers are for), nor concurrency (assuming that you use Trio or anyio).

Typing exceptions is an unsolved problem; conceivably it could be handled by a type checker like mypy. However, in actual practice most code is documented as possibly-raising a couple of well-known "special" exceptions derived from some base type ("HTTPError"), but might actually raise a couple of others (network error, cancellation, character encoding …). Neither listing them all separately (much too tedious) nor using a catch-all BaseException (defeats the purpose) is a reasonable solution.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 20, 2020 22:44 UTC (Mon) by HelloWorld (guest, #56129) [Link] (2 responses)

Python doesn't have a problem with resources to be closed (that's what "with foo() as bar"-style context managers are for), nor concurrency (assuming that you use Trio or anyio).
Sure, you can solve every problem that arises from adding magic features to the language by adding yet more magic. First, they added exceptions. But that interacted poorly with resource cleanup, so they added with to fix that. Then they realized that this fix interacts poorly with asynchronous code, and they added async with to cope with that. So yes, you can do it that way, because given enough thrust, pigs fly just fine. But you have yet to demonstrate a single advantage that comes from doing so.

On the other hand, there are trivial things that can't be done with with. For instance, if you want to acquire two resources, do stuff and then release them, you can just nest two with statements. But what if you want to acquire one resource for each element in a list? You can't, because that would require you to nest with statements as many times as there are elements in the list. In Scala with a decent library (ZIO or cats-effect), resources are a Monad, and lists have a traverse method that works with ALL monads, including the one for resources and the one for asynchronous tasks. But while asyncio.gather (which is basically the traverse equivalent for asynchronous code) does exist, there's no such thing in contextlib, which proves my point exactly: you end up with code that is constrained to specific use cases when it could be generic and thus much easier to learn because it works the same for _all_ monads.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 6:51 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

> But what if you want to acquire one resource for each element in a list?

You use an [Async]ExitStack. It's even in contextlib.

Yes, functional languages with Monads and all that stuff in them are super cool. No question. They're also super hard to learn compared to, say, Python.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 14:32 UTC (Tue) by HelloWorld (guest, #56129) [Link]

> You use an [Async]ExitStack. It's even in contextlib.
You can *always* write more code to fix any problem. That isn't the issue here, it's about code reuse. ExitStack shouldn't be needed, and neither should AsyncExitStack. These aren't solutions but symptoms.

> They're also super hard to learn compared to, say, Python.
For the first time, you're actually making an argument for putting the things in the language. But I'm not buying it, because I see how much more stuff I need to learn about in Python that just isn't necessary in fp. There's no ExitStack or AsyncExitStack in ZIO. There's no `with` statement. There's no try/except/finally, there's no ternary operator, no async/await, no assignment expressions, none of that nonsense. It's all just functions and equational reasoning. And equational reasoning is awesome _because_ it is so simple that we can teach it to high school students.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 2:32 UTC (Tue) by HelloWorld (guest, #56129) [Link] (2 responses)

I also think you're conflating two separate issues when it comes to error handling: language and library design. On the language side, this is mostly a solved problem. All you need is sum types, because they allow you to express that a computation either succeeded with a certain type or failed with another. The rest can be done in libraries.

If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem, because if you can't even list the errors that might happen in your function, you can't reasonably expect people to handle them either. You need to constrain the number of ways that a function can fail in, normally by categorising them in some way (e. g. technical errors vs. business domain errors). I think this is actually yet another way in which strongly typed functional programming pushes you towards better API design.

Unfortunately Scala hasn't proceeded along this path as far as I would like, because much of the ecosystem is based on cats-effect where type-safe error handling isn't the default. ZIO does much better, which is actually a good example of how innovation can happen when you implement these things in libraries as opposed to the language. Java has checked exceptions, and they're utterly useless now that everything is async...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 7:13 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

> Java has checked exceptions, and they're utterly useless now that everything is async

… and unstructured.

The Java people have indicated that they're going to migrate their async concepts towards Structured Concurrency, at which point they'll again be (somewhat) useful.

> If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem

That's one side of the medal. The other is that IMHO a library which insists on re-packaging every kind of error under the sun in its own exception type is intensely annoying because that loses or hides information.

There's not much commonality between a Cancellation, a JSON syntax error, a character encoding problem, or a HTTP 50x error, yet an HTTP client library might conceivably raise any one of those. And personally I have no problem with that – I teach my code to retry any 40x errors with exponential back-off and leave the rest to "retry *much* later and alert a human", thus the next-higher error handler is the Exception superclass anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Mar 19, 2020 16:52 UTC (Thu) by bjartur (guest, #67801) [Link]

Nesting result types explicitly is helpful because it makes you wonder when an exponential backoff is appropriate.

How about getWeather:: String→ DateTime→ IO (DnsResponse (TcpSession (HttpResponse (Json Weather)))) where each layer can fail? Of course, there's some leeway in choosing how to layer the types (although handling e.g. out-of memory errors this way would be unreasonable IMO).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 21, 2020 22:42 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

Even `>>=` is explicit error handling here (as it is with all the syntactic sugar that boils down to it). Using the convenience operators like >>= or Rust's Result::and_then or other similar methods are explicitly handling error conditions. Because the compiler knows about them it can clean up all the resources and such in a known way versus the unwinder figuring out what to do.

As a code reviewer, implicit codepaths are harder to reason about and don't make me as confident when reviewing such code (though the bar may also be lower in these cases because error reporting of escaping exceptions may be louder ignoring the `except BaseException: pass` anti-pattern instances).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:23 UTC (Sun) by HelloWorld (guest, #56129) [Link] (4 responses)

> Zio is not even in contention, since it's built by pure functional erm... how to say it politly... adherents.

You're free to stick with purely dysfunctional programming then. Have fun!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:49 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Indeed. It's way superior because it's actually used in practice.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:19 UTC (Sun) by HelloWorld (guest, #56129) [Link] (2 responses)

Well, so is pure FP that you have obviously no clue about and reject for purely ideological reasons. Oh well, your loss.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 21:25 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I actually worked on a rather large project in Haskell (a CPU circuit simulator) and I don't have many fond memories about it. I also spent probably several months in aggregate waiting for Scala code to compile.

My verdict is that pure FP languages are used only for ideological reasons and are totally impractical otherwise.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 22:42 UTC (Sun) by HelloWorld (guest, #56129) [Link]

So you worked on a bad Haskell project and that somehow makes functional programming impractical? That's not how logic works, but it does explain your unsubstantiated knee-jerk reactions to everything fp.

> I also spent probably several months in aggregate waiting for Scala code to compile.
There is some truth to this, it would be nice if the compiler were faster. That said, it has become significantly faster over the years and it's not nearly slow enough to make programming in Scala “totally impractical”. And the fact that I was able to name a very simple problem (“make an asynchronous operation interruptible without writing (error-prone) custom code and without leaking resources”) that has a trivial solution with ZIO and no solution at all in Go proves that pure fp has nothing to do with ideology. It solves real-world problem. There's a reason why React took over in the frontend space: it works better than anything else because it's functional.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 0:02 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (15 responses)

One of the points made in the blog post, though, is that the creators of Python 3 did some really stupid stuff that made it needlessly difficult to write code that worked in both versions. The specific example that stood out to me was the use of identifiers to specify whether a string literal was a string of bytes or of unicode points. In Python 2, it was possible to specify b'' to say it was a byte string and u'' to say it was a unicode strong. Python 3 kept the b'' syntax but initially eliminated the u'' for unicode strings, and only brought it back when users complained. That hurt people trying to move from Python 2 to Python 3 without providing any benefit to people starting with Python 3.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 10:12 UTC (Tue) by smurf (subscriber, #17840) [Link] (14 responses)

The u'' syntax was removed because the initial idea was that people would use 2to3 and similar tools to convert their code base to Python3 once, and they'd be done. Given the initial goal of quickly converting the whole infrastructure to Python3 that could even have worked.

What happened instead was an intense period of slowly converting to Py3, heaps of code that use "import six", and modules that ran, and run, with both 2 and 3 once some of those nits were reverted. And they were.

Thus, IMHO accusations of Python core developers not listening to (some of) their users are for the most part really unfounded. Hindsight is 20/20, yes they could have done some things better, but frankly my compassion for people who take their own sweet time to port their code to Python3 and complain *now*, when there's definitely no more chance to add anything to Py2 to make the transition easier, is severely limited.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:54 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> The u'' syntax was removed because the initial idea was that people would use 2to3 and similar tools to convert their code base to Python3 once, and they'd be done.
That wouldn't have worked because Python 3.0 lacked many required features, like being able to use format strings with bytes. They got re-added only in Python 3.5 released in late 2014 ( https://www.python.org/dev/peps/pep-0461/ ). So for many projects realistic porting could begin around 2015 when it trickled down to major distros.

These concerns were raised back in 2008, but Py3 developers ignored them because it was clear (to them) that only bad code needed it and developers should shut up and eat their veggies.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 18:47 UTC (Wed) by togga (guest, #53103) [Link] (12 responses)

> "like being able to use format strings with bytes. They got re-added only in Python 3.5 released in late 2014"

And then obviously removed later on.

Python 2.7.17 >>> b'{x}'.format(x=10)
'10'

Python 3.7.5 >>> b'{x}'.format(x=10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 19:47 UTC (Wed) by foom (subscriber, #14868) [Link] (11 responses)

In python 3.5, the "legacy" % formatting,
>>> b'%d' % (55,)
was supported again, but *not* the new and recommended format function.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 19:57 UTC (Wed) by togga (guest, #53103) [Link] (10 responses)

What an irony that the new and recommended format function is not working with latest Python3.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 8:50 UTC (Thu) by smurf (subscriber, #17840) [Link] (9 responses)

That's not an irony, it actually makes sense. %- and .format-formatting are typically used in different contexts.
.format was never "recommended" on bytestrings, in fact it was initially proposed for Python3. Neither was %, but lots of older code uses it in contexts which end up byte-ish when you migrate to Py3. That usage never was prevalent for .format, so why should the Python devs incur the additional complexity of adding it to bytes?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 14:18 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

> That usage never was prevalent for .format, so why should the Python devs incur the additional complexity of adding it to bytes?

So instead the burden is put on the coder to have to think about whether bytes or strings will be threaded through their code and can't use the newer API if they might have bytes floating about?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 23, 2020 16:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

You're not supposed to use bytes. Bytes are unhealthy and bad for you. Fake Unicode all the way!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 2:24 UTC (Sat) by togga (guest, #53103) [Link] (5 responses)

I think not being able to migrate developers to python3 for 10 years took a toll on pride resulting in politics and statements rather than sane language development. A number on weird decisions (some described in the article) pointing at.

There is no reason for not
* allowing byte strings as attributes
* being consistent with types and syntax for byte strings and strings
* being consistent with format options for strings and byte strings
* etc.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 11:03 UTC (Sat) by smurf (subscriber, #17840) [Link] (4 responses)

Python3 never had bytestrings as attributes, so I don't know how that would follow.

Python3 source code is Unicode. Python attribute access is written as "object.attr". This "attr" part therefore must be Unicode. Why would you want to use anything else? If you need bytestrings as keys, or integers for that matter, use a dict.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 21:44 UTC (Sat) by togga (guest, #53103) [Link] (3 responses)

Python source code doesn't have to be unicode. The encoding of the source code has nothing to do with attributes.

>Why would you want to use anything else?
Mostly due to library API:s requiring attributes for many thing. This is a big source for py3 encode/decode errors.

>"use a dict."
This is what attributes does:

>>> a=type('A', (object,), {})()
>>> setattr(a, 'b', 22)
>>> setattr(a, b'c', 12)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: attribute name must be string, not 'bytes'
>>> a.__dict__
{'b': 22}
>>> type(a.__dict__)
<class 'dict'>
>>> a.__dict__[b'c']=12
>>> a.__dict__
{'b': 22, b'c': 12}

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 29, 2020 12:32 UTC (Wed) by smurf (subscriber, #17840) [Link] (2 responses)

> Python source code doesn't have to be unicode

Sure, if you want to be pedantic you can use "-*- coding: iso-8859-1 -*-" (or whatever) in the first two lines and write your code in Latin1 (or whatever), but that's just the codec Python uses to read the source. It's still decoded to Unicode internally.

> >"use a dict."
> This is what attributes does:

Currently. In CPython. Other Python implementations, or future versions of CPython, may or may not use what is, or looks like, a generic dict to do that.

Yes, I do question why anybody would want to use attributes which then can't be accessed with `obj.attr` syntax. That's useless.
Also, it's not just bytes, arbitrary strings frequently contains hyphens, dots, or even start with digits.

Use a (real) dict.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Feb 12, 2020 20:40 UTC (Wed) by togga (guest, #53103) [Link] (1 responses)

>> " That's useless."

As I said above, it's a necessity due to the design of library APIs. Examples of needed, otherwise unnecessary, encode/decode are plenty (and error-prone). Article mentions a few, I've already mentioned ctypes where for instance structure field names (often read from binary APIs such as c strings, etc) is required to be attributes.

This thread has become a bit off topic. The interesting question for me is Python 2to3 language regressions or which migrations that are feasible, that stage was done in ~ 2010 to 2013 with several Python3 failed migration attempts. Nothing of value has changed since. Half of my Python2 use-cases is not suited for Python3 due to it's design choices and I do not intend to squeeze any problem in a tool not suited for it. That's more of a fact.

The question back of my head for me is about the other half of my use-cases that fits Python3. Given the experience of python leadership attitudes, decisions, migration pains, etc which the article is one example of. Is python3 a sound language choice for new projects?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Feb 12, 2020 20:47 UTC (Wed) by togga (guest, #53103) [Link]

>> interesting question for me is Python 2to3 language regressions

Oops.. it should read the opposite. "is not" Python 2to3 language regressions

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 25, 2020 13:26 UTC (Sat) by foom (subscriber, #14868) [Link]

The new format method was added because it was thought to be a better syntax for doing formatting. The % formatting was only kept in python3 (for strings) because it didn't seem feasible to migrate everyone's existing format strings, which might even be stored externally in config files.

Given the invention of better format syntax, forcing the continued use of the worse/legacy % format syntax for bytestrings seems a somewhat mystifying decision.

It's not as if the only use of bytestrings is in code converted from python 2...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:23 UTC (Tue) by atai (subscriber, #10977) [Link]

It is time for Python to stop care about its users so other languages can get their chances

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 8:02 UTC (Tue) by edomaur (subscriber, #14520) [Link]

RiiR is in progress, at Facebook :-)

Facebook is using Mercurial internally, because it works better than git as a monorepository, but they had to rewrite many hot paths, and are currently working on Mononoke, a full-Rust implementation of the Mercurial server. Still "The version that we provide on GitHub does not build yet." but I think it's an interesting project.

https://github.com/facebookexperimental/mononoke

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 0:43 UTC (Wed) by gracinet (guest, #89400) [Link]

Well, here's a nice coincidence: Greg Szorc is also one of the first proponents of Rust in Mercurial. He wrote the "OxidationPlan" (https://www.mercurial-scm.org/wiki/OxidationPlan) a while ago to that effect.

That got me on board, and if you're interested, my colleague Raphaël Gomès will give two talks on that subject in two weeks at FOSDEM.

I know from your posting history here what you think of Python3 and unicode strings, but even though Rust has dedicated types for filesystem paths, we still have some issues. For instance regex patterns are `&str`. It can be overcome by escaping, but that's less direct than reading them straight from the `.hgignore` file.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 19:30 UTC (Mon) by HelloWorld (guest, #56129) [Link] (30 responses)

Just in case anybody needed yet more evidence that dynamic typing pretty much invariably leads to an unmaintainable mess…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 20:49 UTC (Mon) by ehiggs (subscriber, #90713) [Link] (28 responses)

Meanwhile much of the Java ecosystem is stuck on Java 8.

Python's migration problem was because it didn't allow for Python3 code to load and run Python 2 bytecode or otherwise use Python2 files until they were all ported and vice-versa. This meant that any project that wanted to migrate had to wait until 100% of its dependencies were on Python3 already. And any library had a huge window where they needed to maintain compliance with 2 and 3 (so no one could take advantage of new features).

Without this migration path, it mean developers needed to perform a big bang migration. The article calls it a "flag day" migration.

This is discussed in the section labelled "Commentary on Python 3" which is well written and easy to follow.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 21:28 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (27 responses)

> Meanwhile much of the Java ecosystem is stuck on Java 8.
That's not quite true. Java 9 introduced modules which many projects are cheerfully ignoring. But most of Java 8 code works just fine in Java 9, without being module-aware.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:09 UTC (Mon) by ehiggs (subscriber, #90713) [Link] (9 responses)

>That's not quite true

It is absolutely true. Libraries target JDK 8 because that's what Android is stuck on. It's still a hassle and Java's type system didn't save it from the problems.

Java 9 was EOL in March 2018. Java 10 was EOL in September 2018. And if you're a commercial user of Java 8 and don't have a license with Oracle or anyone else, support was EOL in January 2019.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:12 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> It is absolutely true. Libraries target JDK 8 because that's what Android is stuck on. It's still a hassle and Java's type system didn't save it from the problems.
The thing is, it's easy to have a library targeting JDK 8 to work on JDK 11. I have several packages that are doing that. You basically need to refrain from using JDK>8 features and you'll be fine.

This didn't work with Python, the transition from 2 to 3 required massive rewrites.

> And if you're a commercial user of Java 8 and don't have a license with Oracle or anyone else, support was EOL in January 2019.
Just use https://aws.amazon.com/corretto/ , it'll be supported for a loooong time.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:22 UTC (Mon) by ehiggs (subscriber, #90713) [Link] (1 responses)

>You basically need to refrain from using JDK>8 features and you'll be fine.

Indeed and this is not the desired state of affairs. And Java's type system did not save it.

> This didn't work with Python, the transition from 2 to 3 required massive rewrites.

Indeed and this is not the desired state of affairs. And Python's type system did not cause it.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:30 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Indeed and this is not the desired state of affairs. And Java's type system did not save it.
It did. I can run JDK 8 code in JDK 11 without any modifications, mixing and matching it freely with newer versions.

> Indeed and this is not the desired state of affairs. And Python's type system did not cause it.
Yes, they did. The string type was fundamentally changed, along with a significant chunk of the API.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 3:11 UTC (Tue) by cesarb (subscriber, #6266) [Link] (1 responses)

> The thing is, it's easy to have a library targeting JDK 8 to work on JDK 11. [...] You basically need to refrain from using JDK>8 features and you'll be fine.

Unless your package, or one of its dependencies, does bytecode manipulation and uses an old version of the bytecode manipulation library, which chokes on classes compiled for a newer JDK. Or your package depends on one of the several J2EE libraries which were removed by JDK 11 (some of them having no replacement outside of the JDK). Or your package, or one of its dependencies, chokes on the replacement of one of the several J2EE libraries which were removed by JDK 11, because it uses an old version of the bytecode manipulation library, and the replacement J2EE library was compiled for a newer JDK.

As late as the end of 2019, some packages were still announcing Java 9 compatibility fixes. For some reason, Java 9 had more compatibility issues than usual, and Java 11 made it worse by completely removing components first deprecated in the short-lived Java 9 release.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:12 UTC (Tue) by ssmith32 (subscriber, #72404) [Link]

Apache Beam is a reasonably popular library stuck on 8.. for similar reasons.

Of course, it is doing some pretty wacky stuff. But it's the only option for some things (e.g. GCP Dataflow )

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:52 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (3 responses)

And do you really think Android getting stuck has no relationship at all with Google getting sued and redirecting its investments elsewhere?

The Java leadership has been busy making itself irrelevant by alienating most of the rest of the IT world.

Though I wonder where that will leave all the Apache foundation Java projects. They can’t survive in a closed circuit loop forever. Scala is not the solution, its adoption outside the existing Java world is nonexistent.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:54 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Google is moving towards Kotlin which runs fine on the JVM8. There are also third parties maintaining JVM forks (Amazon is one with Coretto project). Java will be fine.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:09 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

That means some more years fighting on the governance of Java, on what is a real Java implementation, what is not, what APIs can/should be used or not, etc.

Who wants to deal with this crap forever? Easier to port to another language and let someone else fatten lawyers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:20 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Who cares? OpenJDK is under GPL so Amazon can freely maintain its fork. They just need to avoid calling it "Java" to avoid trademark issues.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:16 UTC (Mon) by HelloWorld (guest, #56129) [Link] (16 responses)

> That's not quite true. Java 9 introduced modules which many projects are cheerfully ignoring.
Well, that might be because they suck. They've been working on this stuff for years and yet it's still not possible to have multiple versions of the same library in a single program. So if you want to use two different libraries that both depend on a third library but in different versions, you lose. Unless of course you use OSGi which has been around for, what, 20 years now and already solved this problem when it first came out.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:22 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Well, that might be because they suck.
Indeed they do.

However, I can't fault the way Snoracle introduced them - none of my module-unaware code broke during JDK9 migration.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:44 UTC (Tue) by Conan_Kudo (subscriber, #103240) [Link] (14 responses)

> They've been working on this stuff for years and yet it's still not possible to have multiple versions of the same library in a single program.

Oh God, no. I like my sanity, thank you very much. I am *totally* OK with that restriction and I would rather nobody ever lifted it.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:58 UTC (Tue) by HelloWorld (guest, #56129) [Link]

I see, apparently you don't like modularity.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:02 UTC (Tue) by HelloWorld (guest, #56129) [Link] (12 responses)

If I use library X, and X uses library Y internally but doesn't expose its types in its API, then Y is an implementation detail of X. Now if I also happen to use Y for some unrelated purpose, then upgrading either X or Y might break my application. IOW, I am now affected by X's implementation details. That is the antithesis of modularity, because modularity means that the implementation details of a module don't affect its users. So, could you explain why you believe this breach of modularity is a good thing?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 8:26 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (11 responses)

It's a bad thing on the development side. But it's (arguably) a good thing on the packaging and deployment side. As an SRE, I want to deploy one, complete, and entirely self-contained artifact into production, and ideally, I want it to behave exactly the same every time (because I'm going to deploy it to N machines, and I don't want half of them to explode and the other half to melt). It'd also be swell if that artifact would produce logs such as "This version of Y is 1.2.3 and it was pulled in by X, don't confuse it with the freestanding Y that is version 1.3.2." And, of course, I want some kind of ironclad guarantee that the two versions of Y "can't see each other" and therefore won't interfere in unexpected ways at runtime.

Even given all of that, I'm still not thrilled with this idea, because I *know* that sooner or later, some part of X is going to somehow call the "wrong Y," and I'll get paged at 3 AM when it crashes in production. And I'm sure that someone will have written a comment somewhere long ago, assuring the reader that, no, of course it's "impossible" for X to call the wrong Y, don't be silly, you see, they are entirely separate, there's no possible way for them to interact. Except for that one obscure side channel the SWE forgot about, where on alternate Thursdays when the moon is full, the software briefly tries to bind TCP port 12345 at exactly the stroke of midnight, in order to practice speaking a profane and blasphemous protocol. A protocol defined only in an RFC that the IETF subsequently declared Librorum Prohibitorum, and which must now be obtained by special dispensation from Vint Cerf. Why does it do this? Because some client asked for it five years ago and everyone has now bleached that contract from their collective unconscious. Anyway, the second version of Y fails to bind the port on EADDRINUSE, and the error gets swallowed because don't you know, in a containerized setup, you're not supposed to get EADDRINUSE, so obviously it's a /* can't happen */ situation. Then X, blissfully unaware, connects to the port and talks to the wrong Y, and the wrong Y does something subtly different from what X wanted, and if you're very lucky, this merely causes the app to crash.

In theory, those are mostly solvable problems. In practice, the language is not actually in a position to solve them (Are you really going to stop the two versions of Y from interacting with any kind of global state, including the filesystem? Unless your language is Haskell, or perhaps an extremely locked down dialect of JavaScript, that isn't realistic as a language-level restriction.). They are institutional problems, and require institutional solutions. As it turns out, one of those solutions can be* "library versions are bumped on a fixed schedule, keep up or else your code stops building and we stop deploying it." A "one library version per process" rule is a straightforward way of enforcing that, but of course you could just as easily attach some kind of custom restriction to the build process instead. It's just a matter of convenience.

* "Can be," not "has to be." There are other solutions, with various advantages and tradeoffs, which are beyond the scope of this discussion.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:10 UTC (Wed) by roc (subscriber, #30627) [Link] (2 responses)

Are such environmental collisions any more of a problem in practice for multiple versions of the same library coexisting than for different libraries coexisting?

I'm familiar with libraries stomping on each other at run-time, e.g. with races around fork(). I'm familiar with C libraries stomping on each other with symbol collisions during linking. I've even had listening port collisions with two components in the same container. But my Rust project links multiple versions of some libraries (which libraries, and which versions changes over time), and I haven't had any problems with that so far in practice.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 21:23 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

It is much more likely that two versions of the same library will share symbols than different libraries. The latter can probably be avoided by not exporting all symbols by default or adding a few strategic `static` keywords. The former…well, Python2 and Python3 can't be mixed in the same process because they share symbol names. Same with glib and other projects that tend to be "good citizens" with their ABI management.

Rust avoids the symbol problems, but still has woes with libraries trying to control any global resource (signal handlers, environment variables, etc.).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 21:38 UTC (Wed) by roc (subscriber, #30627) [Link]

Yes I understand that for C-like linkage multiple versions of the same library are a disaster.

Signal handlers are just a massive problem in general --- for different libraries as well as for two versions of the same library. For that and other reasons I have not encountered any Rust crates (other than tokio-signal) that set signal handlers. Likewise setting environment variables is a minefield libraries should avoid under any circumstances.

So I agree that two versions of the same library are more likely to hit these issues than two different libraries, but I'm not convinced that *in practice* it's really worse, for Rust.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:48 UTC (Wed) by HelloWorld (guest, #56129) [Link] (2 responses)

There is no global state in Java, “static” variables are per-classloader, and of course you can't load two classes of the same name in a single class loader, so different versions of the same class will have their own state. And if they interact through the file system or a TCP port, then two processes separate using different versions of the same library would also be affected, so such a library would be broken whether or not you use two different versions in a single process.

Besides, it's not like Java will detect that there are two different versions of a library on the classpath unless you use special measures to prevent that. It'll just crash later when something tries to call a method that isn't there any more or something like that, so it's not like your approach of “let's just forbid it” prevents anything.

So yeah, I'm not buying it. Libraries should be isolated from each other, anything else just doesn't scale…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:57 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

> There is no global state in Java, “static” variables are per-classloader, and of course you can't load two classes of the same name in a single class loader, so different versions of the same class will have their own state. And if they interact through the file system or a TCP port, then two processes separate using different versions of the same library would also be affected, so such a library would be broken whether or not you use two different versions in a single process.

You see, this kind of clever thinking on the SWE side is why the average SRE has a drinking problem. I told you that there would be a comment to that effect, and you *actually wrote it for me,* apparently in all seriousness believing it would change my mind.

Realistically, every major operating system has process-wide mutable parameters (the working directory, the umask, UID/GIDs, stdin/out/err redirection, etc.), and while Java may try very hard to sandbox those parameters, you can always call out to native code and manipulate them anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:01 UTC (Thu) by HelloWorld (guest, #56129) [Link]

> apparently in all seriousness believing it would change my mind.
Yes, I thought that arguments could convince someone to change their opinion, silly me. Apparently this is more of a religious thing for you...

Of course, 99 % of libraries *don't* call out to native code, so what you're saying is that we can't have the solution for the 99 %, because it might not work for the 1%, and of course you don't have a solution for the remaining cases either.

Besides, all this stuff about the cwd, the uid/gid, stdio redirection etc. is complete hogwash, because these are shared among *different* libraries as well, and therefore any library that relies on any of these to be in any particular state is broken to begin with, whether or not you allow multiple versions to be loaded. It's a completely unrelated problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 19:58 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (4 responses)

It’s not a good thing on the deployment side either. It leads to the slow accumulation of multiple versions of the same thing, which is not good resource and performance wise, and leads to death marchs when a problem affecting a wide range of versions is found (as is always eventually the case).

At that point, the pyramid crumbles under the weight of its technical debt and wipes out all past “savings”.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:09 UTC (Thu) by HelloWorld (guest, #56129) [Link] (3 responses)

Now that you've explained all the problems, it's time to start talking about solutions. Say I depend on two different libraries that I don't have control over and that depend on incompatible versions of a third library. Now what?

The fact of the matter is that this problem doesn't go away on its own, and if the platform doesn't solve it, people come up with other solutions. For Java that is JarJar Links...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:33 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (2 responses)

Now the correct software engineering solution is to help whichever part of the stack depends on a lagging version of the lib ported to common supported version (or help port it to something else if the third party lib is so broken porting is more expensive than dropping it).

That’s the inherent cost of using third party code. Don’t like it? Write your own code.

Engineering means delivering reliable solutions. Not letting problems fester in dark places.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:19 UTC (Thu) by HelloWorld (guest, #56129) [Link]

Bullshit! What happens in practice in larger projects isn't that the whole stack is migrated to the new version, but that people stick with the status quo, because there's no way to do the migration piecemeal and it's too large a disruption to do it in one step. I've seen this happen even in relatively small projects (< 50.000 loc), and it's bound to be much worse in larger ones.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:22 UTC (Thu) by HelloWorld (guest, #56129) [Link]

Besides, “letting problems fester in dark places” isn't a solid technical argument; rather, it's just rhetoric. Perhaps a new, incompatible release of a library only improves matters in a way that isn't relevant for my particular use case, and in that case, the correct response is “if it ain't broke don't fix it”.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:49 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

When you e.g. deal with file names and interoperability between different locales, OS versions and even OSes as well as a history of what your own code worked with, typing is not really the core issue. String types and encodings get a mess in any programming language (I always loved reading Mozilla C++ patches and the neverending discussions which string types to use and how to convert them into each other without killing performance).
When you use non-unicode file names across different file systems or computers (or even OSes), all hell may break lose. I always said there be dragons when using non-ASCII file names but in stark contrast to my 8.3 filename times, I nowadays marvel at how BMP characters tend to work for file names even across different locales and OSes...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 21:31 UTC (Mon) by roc (subscriber, #30627) [Link] (7 responses)

I kinda wonder why Mercurial was implemented in Python2 in the beginning. Reliability, maintainability and performance are key features for a VCS and a dynamically typed, interpreted language is not a natural fit for those requirements.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:15 UTC (Mon) by ehiggs (subscriber, #90713) [Link] (3 responses)

Having used Python from a shared network mount (NFS, GPFS, Panasas, Lustre) and waited for it to stat 10 thousand files over the network just to start up (thanks, pkg_resources), I also wonder why a VCS would be implemented in Python.

(In fairness, Python isn't the only culprit, Cargo also stores it's index as directories and files and would probably be much snappier it it used sqlite).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:25 UTC (Tue) by KaiRo (subscriber, #1987) [Link] (1 responses)

side talk: AFAIK, cargo is only needed for compiling, not for running the application, right? Slightly different than the Python story...
(Disclaimer: I never wrote Rust code but use Python 3 in a few projects.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 4:49 UTC (Tue) by roc (subscriber, #30627) [Link]

That's right.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:43 UTC (Tue) by mgedmin (subscriber, #34497) [Link]

pkg_resources is infamous for this (there's an open bug, but no clear path to resolution). Luckily, not every Python script uses it.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 3:06 UTC (Tue) by foom (subscriber, #14868) [Link] (2 responses)

And yet Bazaar also chose python as its implementation language, at about the same time. So it would seem that python did look like a good match for that kind of software at that point.

And these days, maybe you'll see less python, but writing server software in JavaScript is now all the rage...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 16:43 UTC (Tue) by mgedmin (subscriber, #34497) [Link] (1 responses)

For Bazaar, the original plan was to write a prototype in Python, then rewrite in a lower level language for speed. A few years later the developers decided that Python was good enough after all and rewriting would not be worth the effort.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 1:03 UTC (Wed) by roc (subscriber, #30627) [Link]

That is interesting.

I guess when you write a prototype, you need to consider whether it's worth writing in a language that will let it grow into a performance, reliable production system.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:37 UTC (Mon) by togga (guest, #53103) [Link] (52 responses)

I can second almost all of the py2/py3 issues in numerous projects (along with a few more).

"Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment"

* the approach of assuming the world is Unicode is flat out wrong
* it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality
* Python 3 introduces a ton of problems and doesn't really solve many
* Python's pendulum has swung too far towards Unicode only.

Given the obvious mismatch, It had to be a big leap of faith for Gregory to even undergo this effort to begin with. We can now conclude that ROI will never happen.

The last bullet above in combination with ctypes was the definite turning point for me being hit by exceptions in iovisor/bcc after they added ~130 encode/decode calls for py3 compatibility. This made me abandon iovisor/bcc python part altogether. Even if it tries so solve day-to-day issues it creates a whole set of new issues.

I've been a python developer since 2003 and seen it go from a kick-ass scripting language (former py2) to a subpar application language (py3). If the python community (during the wasted years) instead of this mess and questionable feature creep had focused on things like GIL, threading model and performance python could have been more prepared for this:

https://www.theguardian.com/commentisfree/2020/jan/11/we-...

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 0:49 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (16 responses)

> I've been a python developer since 2003 and seen it go from a kick-ass scripting language (former py2) to a subpar application language (py3).

Meh, I wouldn't go that far. Python 2 or 3 are both perfectly serviceable "I would use bash except it's kinda terrible" languages (which is what I think of when I hear "scripting language"). For the most part, in new code, you don't have to think about encoding issues because everything "just works" (even when the way that it works is frankly terrifying - see surrogateescape for example).

Sure, if you are actually taking code points apart and playing around with the UTF-8 representation, it's a lot more annoying. But realistically, for basic sysadmin-ish scripting tasks, you're not actually doing that.

> * it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality

Can you please be more specific? The only "obvious" example I can think of is the filesystem encoding, but surrogateescape *does* abstract over that with no data loss or loss of functionality, and it's not "outright wrong" because the transformation is losslessly reversed on round-trip.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 4:50 UTC (Tue) by roc (subscriber, #30627) [Link] (9 responses)

See above for why there are still problems on Windows.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 10:18 UTC (Tue) by smurf (subscriber, #17840) [Link] (8 responses)

Yeah, but that's an implementation problem. Conceptually, the surrogateescape route is bidirectionally lossless and thus solves the problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:02 UTC (Tue) by roc (subscriber, #30627) [Link] (7 responses)

Read the comments more carefully. surrogateescape does not give you a way to encode a Windows filename containing a lone surrogate as a Python Unicode string.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:02 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

My read of https://docs.microsoft.com/en-us/windows/win32/fileio/nam... is that those filenames are illegal anyway:

> Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:

> [various exceptions]

In this context, "Unicode" means "UTF-16," because Microsoft. A lone surrogate is certainly not a "UTF-16 character." It's half a character.

I don't know if the various *W interfaces actually check for lone surrogates and error out (the documentation for CreateFileW does not explicitly call this case out), but they probably should.

(Microsoft's error checking of filenames is kinda terrible anyway, so I would not be too surprised if you could get lone surrogates through the API. For example, it says that you can't create a file whose name ends in a dot, but if you prefix the path with the \\?\ prefix that they also discuss on the same page, then that check is bypassed. And then your sysadmin has to figure out how to delete the damned thing, because none of the standard tools will even recognize that it exists. See also: https://bugs.python.org/issue22299)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:39 UTC (Tue) by foom (subscriber, #14868) [Link] (2 responses)

No, it doesn't really mean UTF-16, it really means "16-bit unicode". This wasn't some mistake -- the Windows unicode support was defined back when surrogates didn't exist, and "unicode characters" simply *were* only 16-bits wide, because people were REALLY REALLY trying to pretend that all the characters would be able to be encoded in 65535 codepoints. It's the same in Java and JavaScript, too, fwiw...

So, yes, you can perfectly well put lone halves of a surrogate pair in windows unicode strings and filesystems.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:46 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

That was true once upon a couple of decades ago, but now "Unicode" means either "UTF-16" or "UTF-16LE" depending on context. See for example https://docs.microsoft.com/en-us/windows/win32/intl/surro..., which explicitly states:

> Standalone surrogate code points have either a high surrogate without an adjacent low surrogate, or vice versa. These code points are invalid and are not supported. Their behavior is undefined.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 22:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, we know that. But there are real Windows filesystems that don't.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:47 UTC (Tue) by foom (subscriber, #14868) [Link] (2 responses)

You don't need a special way to store a lone surrogate in a python string -- python unicode strings are fine with holding lone surrogates.

However, you do need to have a utf-16 decoder/encoder variant which allows lone surrogates to be decoded/encoded without throwing an error. Python has the "surrogatepass" error handler for that. E.g.:

b'\x00\xD8'.decode('utf-16le', errors='surrogatepass').encode('utf-16le', errors='surrogatepass')

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:55 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

Good point. That's interesting. So you can use the "lone surrogate" (mis)feature to represent both Linux and Windows filenames as Python3 "Unicode" strings... it's just that the method is different for Linux and Windows.

But do APIs like os.listdir() do that automatically on Windows like they do on Linux?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 4:18 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

According to (a comment in) the source code: https://github.com/python/cpython/blob/master/Modules/pos...

> On Windows, if we get a (Unicode) string we extract the wchar_t * and return it; if we get bytes we decode to wchar_t * and return that.

I believe this means that the Windows version of that module will never try to encode Unicode strings into UTF-16LE (or any other encoding), meaning it won't "catch" invalid surrogates. They should pass straight through to the Windows *W APIs.

This is also supported by the os.path docs, which say the following:

> Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.

Since they only call out bytes objects, I think the implication is that str objects are not a problem on Windows. But the fact that it makes no mention of os.fsencode() and os.fsdecode(), nor the surrogateescape handler, makes me suspicious of whether this documentation is still up-to-date.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 20:13 UTC (Tue) by togga (guest, #53103) [Link]

I once chose Python as "glue scripting language" with introspection, interactive environment and easy integration with c. Other major features was "never mess with data", numpy/scipy/matplotlib and lots of batteries included. Data from any program (random file-format, network, embedded device under test, web-page or someones excel-sheet / matlab program, algorithm with binary interface etc) needed to be passed as-is or processed to another program.

The main point here is that scripting/glue languages should not mess with or set requirements on any data. The Python2 experience (batteries included) was seamless enough that you could overlook workarounds needed for limitations (threading, performance etc.) and some other pure ugly design/behaviour. In python3 this is just not the case anymore, batteries comes with constant glitches, etc (lose lose situation).

It's awesome though what Python has achieved over time and it may even be a good thing that py2 dies as it give opportunities migrating to more modern language for these classes of use-cases / problem-domains. Old python code is relatively easy to interface from (some) other languages if needed.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:55 UTC (Thu) by jezuch (subscriber, #52988) [Link] (4 responses)

The way I see it is that for some reason people started insisting on writing large applications in Python even though it's basically a scripting language having not even close to enough safety nets to make it non-brittle in the long run. But yeah, I'm very much in the "veggies are good for your health" camp and I love Haskell, the ultimate bondage-and-discipline language :)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 22:23 UTC (Thu) by togga (guest, #53103) [Link] (3 responses)

Yes. But this is valid for both Python2 and Python3. The main difference is that Python2 adapted to the outside world while Python3 requires that it's the outside world that should adapt making it less useful as a scripting language.

decode/encode may be lossless if the input is ASCII ranging from 0-127 but for random input it's very fragile and the language is full of "inconsistencies" (aka features) for example:

>>> 'text'[0] == 't'
True
>>> b"text"[0] == b't'
False

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 22:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> >>> b"text"[0] == b't'
> False
WHAT?

How on Earth??

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:07 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

In Python, the individual elements of a bytes object (a sequence type) are of type int. This makes reasonable sense given what one is likely to want to do with individual elements of a bytes object. Python does not have a char type similar to that in C.

The b"text"[0] == b"t" comparison fails because it is comparing a single element of a bytes object (an int) to another bytes object, albeit one of length 1. This sort of thing probably won't work in Rust, either (it certainly won't in C). You will note that b"text"[0] == b"t"[0] is True, since you're comparing ints.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 19:14 UTC (Sat) by gracinet (guest, #89400) [Link]

In Rust, the compiler would refuse it right away, it's after all a type error.

It's also generally more consistent because one does not expect a character to be of the same type as a string anyway.

In that case, the equivalent for bytes is &[u8] and you don't compare it with u8.
Something similar would happen with &str and char.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:28 UTC (Tue) by dvdeug (guest, #10998) [Link] (34 responses)

Since 2003, Python has gone from a runner-up to Perl to one of the most dominant languages in the world, so okay, you don't like what it's become.

The approach of assuming the world can be approximated by floating point numbers is flat out wrong. And yet we continue to do it. Unicode is sometimes problematic with filenames, but not in practice for most users. If you have to handle text, it's the only way to do it. Old school text handling usually trashed anything that wasn't in an 8-bit character set, unless the programmer put a lot of work in.

You link to an article that mentions code written in assembly. If you want extreme speed or the tightest of code, why were you writing in Python to begin with?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:56 UTC (Tue) by excors (subscriber, #95769) [Link] (33 responses)

> Unicode is sometimes problematic with filenames, but not in practice for most users.

That's true, but the difference between "most users" and "all users" becomes important when an application grows to have many users. Eventually one of them will try to use your program on a FAT-formatted USB stick with Shift-JIS filenames or whatever, and they will file a bug report when it crashes with a Unicode error. As a responsible programmer you want to make your program work for that user, but Python makes that difficult, and you will get annoyed at fighting with the language.

That experience will encourage you to start your next project in a different language (one whose designers have already considered this problem and solved it properly) so you won't have the same pains if your project becomes popular and needs to handle obscure edge cases. If a number of high-profile projects make the same decision to migrate, that seems likely to significantly damage Python's reputation and popularity.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:50 UTC (Tue) by dvdeug (guest, #10998) [Link] (32 responses)

I'm sure there's so many more people nowadays who want to take to the time and trouble to work with SJIS than there were before Unicode. It wasn't until 1998 that Debian became eight bit clean, and even then I think Japanese people still carried a bunch of patches around. It will be more practical to spend all that time working on the program than tell them to set their LC_CTYPE to a SJIS locale.

A lot of Python code won't ever have that problem, for two reasons. A lot of Python code is for limited use or for use in closed environments. Secondly, it's not like Python 3 autocrashes with non-Unicode filenames. I tried it with:

#!/usr/bin/python3
import os
import sys

if len(sys.argv) > 1:
        x = sys.argv[1]
else:
        x = os.listdir('.')[0]
f = open (x, "r");
read_data = f.read ()
print (read_data)

And it had no problem opening a file with a filename that wasn't Unicode, whether read from the directory or given on the command line. It gave a Unicode error instead of printing out the filename, but like ls, you should probably be escaping a weird filename before dumping it to a terminal or the like.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 22:11 UTC (Tue) by roc (subscriber, #30627) [Link] (31 responses)

> you should probably be escaping a weird filename before dumping it to a terminal or the like.

Who does that escaping, though? What API would you even use to do it? In practice, no-one's going to do it until some user hits a "weird"-but-valid-UTF8 filename, then they're going to hack in some escaping that relies on UTF8 encoding not failing, then maybe one day some user hits a non-Unicode filename and then they hack in more escaping for strings containing lone surrogates.

That's not good if you want your software to be reliable. I don't think it makes sense to expect a dynamically-typed language like Python to be good for writing reliable software, but Python's specific choices make it unnecessarily unreliable.

As an example of a better way to do things: in Rust, the Path type is not a string and does not implement Display; if you want to print one, you call path.display() which does the necessary escaping. (It does not return a string, but does return something that implements Display, i.e. can be printed or converted to a string).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 1:01 UTC (Wed) by roc (subscriber, #30627) [Link] (9 responses)

And of course the compiler guarantees that your program, once successfully compiled, will never try to print a Path.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:27 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (8 responses)

That’s even more braindamaged than the python solution, since it makes sure rust is able to create and propagate paths that can not be displayed in an interoperable way.

*That* means any form of argument passing from rust to other software will fail in strange and underwhelming ways.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:30 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Why would it fail?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:12 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (5 responses)

Because the whole system relies on a rust-specific Display() method. That won’t be supported or compatible with all other apps out there that will encounter rust filenames and expect them to work as normal filename arguments.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:28 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

Well, when you print anything you're expected to use the current locale. If you can't because the string is not displayable, tough luck.

The idea that file names are somehow privileged to not require that went out of the window a long time ago. It doesn't matter one whit whether the code printing said file name is written in Python, Rust, Go, C++, Z80 assembly, or LOLCODE.

If you want a standard way to carry non-UTF8 pseudo-printable data (e.g. Latin1 filenames from the stone ages), no problem, either use the surrogateescape method or do proper shell quoting. The "write the non-UTF8 data" method is fine only when limited to streams that are known to be binary. "find -print0" comes to mind.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:21 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Well, when you print anything you're expected to use the current locale. If you can't because the string is not displayable, tough luck.
And what if you need to write a transparent proxy that needs to cope with non-UTF-8 headers?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:43 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

You ask the tool's author to please add "surrogateescape" to their .en/decode("utf-8") calls, either unconditionally or via some special mode. Or to transparently pass unencodeable headers as bytes, either …[ditto]. Or you submit a patch to do that yourself. Or you fork the code.

None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard? Yes this could have been easier from the beginning, but that's why Python 3.8 is a whole lot better at this than 3.0.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:53 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard?
The reality (that stubborn thing that doesn't go away) has agents that don't obey the standard. So a transparent proxy must accommodate it.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 6:11 UTC (Fri) by smurf (subscriber, #17840) [Link]

I know that. But most of the world is not a transparent proxy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:48 UTC (Thu) by roc (subscriber, #30627) [Link]

It just means you need to think when you transmit paths to other programs. You will not use Path::display(), instead you will do whatever those other programs require.

By far the most common way to pass a path to a program is on its command line. In Rust you pass a command-line argument by calling Command::arg(...) to add the argument to the command you're building, and Command::arg(...) accepts Paths. Each platform has a standard way to pass arbitrary filenames as command-line parameters, and Rust does what the platform requires.

A few programs accept arbitrary paths as strings on stdin; they need to define how those paths are encoded on stdin. On Linux the program could read null-terminated strings of bytes; then in Rust you would use std::os::unix::ffi::OsStrExt::as_bytes() to extract the raw bytes from a Path and write them to the pipe. That code wouldn't even compile on Windows, which makes sense because a null-terminated string of bytes is not a reasonable way to represent a Windows path. A Windows program accepting paths as strings on stdin needs to define a different encoding, e.g. null-terminated strings of WCHAR, in which case the Rust program would use std::os::windows::ffi::OsStrExt::encode_wide() to produce such strings.

Rust makes it about as easy as possible to reliably pass non-UTF8 strings to other programs.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 4:22 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (17 responses)

> Who does that escaping, though? What API would you even use to do it?

repr()?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:01 UTC (Wed) by roc (subscriber, #30627) [Link] (16 responses)

You probably don't want *every* filename to be quoted, you'd need some wrapper that quotes only "irregular" filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 22:25 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (15 responses)

So use one of the other options from https://docs.python.org/3.8/library/codecs.html#error-han...

str.encode(..., errors='replace') # Replaces bad chars with U+FFFD.
str.encode(..., errors='ignore') # Silently drops bad chars.
str.encode(..., errors='xmlcharrefreplace') # Replaces with XML &-encoding
str.encode(..., errors='backslashreplace') # Replaces with \u.... syntax.
Or write your own and hook it into the standard system with https://docs.python.org/3.8/library/codecs.html#codecs.re...

Python is, after all, a "batteries included" language. This is a solved problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 23:44 UTC (Wed) by togga (guest, #53103) [Link] (13 responses)

Of which none gives back the original pathname when passed to another program that tries to find the file?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:17 UTC (Thu) by roc (subscriber, #30627) [Link] (10 responses)

To be fair, that's an unsolvable problem, because there's no way to know how any given program will unescape file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:17 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (6 responses)

It’s only an unsolvable problem if you let your code write malformed filenames.

Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:51 UTC (Thu) by roc (subscriber, #30627) [Link] (5 responses)

It would be great if we could impose a requirement that all filenames are valid Unicode, but unfortunately that cat left the building a long time ago. Operating systems don't enforce that, and non-Unicode filenames exist.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:46 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (4 responses)

So what? Operating systems will fail enforcing against all kinds of brokeness and malware, that’s no reason to write broken files or malware.

Own up to the things your code does, do not hide behind lack of OS enforcement.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:46 UTC (Thu) by smurf (subscriber, #17840) [Link]

The assumption is that some people created source archives with Latin1-or-worse-encoded file names. They're embedded in Mercurial archives, and you need to be able to reproduce them exactly when checking them out. Replacing the file name with its UTF-8 equivalent is not an option if you want to 1:1 reproduce these files.

That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:54 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools. Imagine malware hiding behind a filename of `\xff` on your filesystem. Should my Python tools be blind to it or accept the reality that, in general, filenames suck and the status quo at least needs support (though not necessarily be encouraged).

System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:17 UTC (Thu) by anselm (subscriber, #2796) [Link] (1 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.

This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.

In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Well, surrogateescape and the other modes should really only happen at the display boundary. Internally, storing paths as a bag of bytes is the way it should be done (and 16bit units on Windows). It's only at the display side that things need munged for safety. Of course, sometimes you have the display (stdout) as the communication medium and an escaping strategy needs to exist there. Flags such as `-print0` and the like bypass that, but not everything likes to communicate with that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:40 UTC (Sun) by togga (guest, #53103) [Link] (2 responses)

Yet still Python2 solved this by treating everything as a bag of bytes. I didn't even know this was a problem until I tried Python3 for the first time. I'd say that the problem itself is imposed to users by Python3. This, and many more similar problems is why Python3 became such a pain.

Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:28 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

Well, you obviously never added some bytes to a string in some of your code (or worse had a library do it) which then splatted you with Mojibake in some other – almost, but not quite, entirely unrelated – procedure.

Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.

It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:36 UTC (Sun) by anselm (subscriber, #2796) [Link]

We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.

The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 6:35 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

That is not a meaningful question, because encode() returns an object of a different type (str -> bytes), so you can't use the result with the same API that you started with, unless it's a polymorphic API.* You have to call str.decode() to undo the operation, and I don't think there's a built-in way to automatically reverse those error handlers (other than surrogateescape which was specifically designed for round-tripping). But you don't need to do that in the first place, because both str and bytes are immutable classes, so you still have the original object. If you lose the original object, I don't think you have the right to complain that it "got lost." Just don't lose it, and then you don't need to worry about reversing the operation.

(Regardless, xmlcharrefreplace and backslashreplace are both *mostly* lossless, and in the appropriate context, can be fully lossless, so long as you escape all ampersands or backslashes respectively. If you are outputting XML or HTML, then of course you have to escape ampersands anyway, which is obviously what xmlcharrefreplace was intended for. Similarly, backslash replacement is not a very sensible thing to do, unless you are working in a context where backslashes are normally escaped.)

* For example, Python's filesystem API calls os.fsencode() and os.fsdecode() automatically to translate between the operating system's preferred type and whatever type the user passes, but you can still call these manually to pass the errors argument or if you decide you actually wanted the other type.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by excors (subscriber, #95769) [Link]

If you are outputting filenames as XML, it looks like you can't safely use xmlcharrefreplace, because e.g. "\udccc".encode("utf-8", errors="xmlcharrefreplace") returns b'&#56524;' which is not well-formed XML and will cause the XML parser to reject the whole document. (Character references must refer to one of https://www.w3.org/TR/xml/#NT-Char). And in HTML the parser will convert &#56524; into U+FFFD, so it's not lossless. You'll need a different scheme to encode Python's filename strings as XML or HTML.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:16 UTC (Thu) by roc (subscriber, #30627) [Link]

Thanks, that's helpful.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 5:54 UTC (Wed) by dvdeug (guest, #10998) [Link] (1 responses)

> Who does that escaping, though?

GNU Coreutils does.

> In practice, no-one's going to do it

Well, then, it really does seem like much ado about nothing. In a serious engineered program, you basically never output an unfiltered string; every line has to be marked for translation. This has always included filenames, which have security issues when dumped to a terminal, in Python 2 or 3.

Sure, I'll believe Rust has better ways of doing it. But if you're comparing Python 2 versus Python 3, it's just not that big a difference, either in normal usage or best practices.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:03 UTC (Wed) by roc (subscriber, #30627) [Link]

> > Who does that escaping, though?
> GNU Coreutils does.

In Python, I meant.

> In a serious engineered program, you basically never output an unfiltered string

I guess I've never seen a seriously engineered Python program. I suppose that's not unexpected.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:58 UTC (Fri) by cortana (subscriber, #24596) [Link]

I make it a habit to rely on repr(filename) when printing to log messages etc. It's ugly but it shouldn't risk any misunderstanding.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 22:58 UTC (Tue) by kjp (guest, #39639) [Link]

Since noone else mentioned it, look at C# if you're disillusioned by python and scripting languages. It supports things I like in python, like exceptions and generic data structures. And it isn't run "by committee". Frankly, having systemd start a per-user socket-activated daemon for C# (due to slow startup), combined with a C++ CLI that just sends and receives data over the socket, sounds pretty good.

But yeah, I totally agree with this post, but it's even worse: even _after_ you are on python 3, stuff still breaks. All the time. And it gets worse the more dependencies you have. So many things that could be caught by a compiler aren't. And if you try using mypy (which is still alpha), why are you using python in the first place for a large project. The terrible startup time? The terrible module namespacing? The terrible performance and threading?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 8:37 UTC (Fri) by rhdxmr (guest, #44404) [Link]

I wrote a program that processed tons of text documents. The program's core logic treated data as Unicode strings. However, I found it too late that many 3rd party libraries for zookeepers, redis and http handle only bytes. So there should be lots of conversions between Unicode and bytes. I thought this is so inefficient that I tried to change the core logic of the program to handle data as bytes. But I gave up because it was really hard to ensure that there exist no errors or side effects after the work is done.


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds