Szorc: Mercurial's Journey to and Reflections on Python 3
I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years."
Posted Jan 13, 2020 19:17 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (176 responses)
Posted Jan 13, 2020 20:03 UTC (Mon)
by koh (subscriber, #101482)
[Link] (92 responses)
Posted Jan 13, 2020 20:21 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (91 responses)
Posted Jan 14, 2020 1:13 UTC (Tue)
by koh (subscriber, #101482)
[Link] (90 responses)
The TL;DR version for this one here should be "removal of u'...' literals, '%' on objects of bytes type, **kwargs only on str instead of bytes are backwards-incompatible changes making the transitition harder - all in all: a global change that for C as a language has basically had no effect (ascii -> utf8, probably by design) in Python 2/3 results in huge ramifications."
> [...] "if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3". As crazy as it initially sounded, I think I agree with that assessment.
Posted Jan 14, 2020 1:59 UTC (Tue)
by flussence (guest, #85566)
[Link] (32 responses)
Posted Jan 14, 2020 5:29 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link] (29 responses)
I haven't run into any issues with LLVM support, what have you seen?
Also, the Rust language and standard library seems to have more features than Go: but that would be a very subjective opinion on my part. Depends on what you're focused on, I think - what do you find in Go, that was missing in Rust, and you found frustrating?
Posted Jan 14, 2020 6:11 UTC (Tue)
by roc (subscriber, #30627)
[Link] (18 responses)
OTOH LLVM supports Qualcomm Hexagon while gcc doesn't, and is an alive-and-well architecture that has shipped billions of units. For some reason Rust's detractors do not see this as a problem for gcc or Go.
Posted Jan 14, 2020 11:19 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (17 responses)
LLVM doesn't support a number of architectures that distributions/OSes like Debian and NetBSD support using GCC. Linux does in theory run on Qualcomm Hexagon, but as far as I can tell most of those shipments have been in SnapDragon SoCs running ARM chips, and the Hexagon chips are used for Qualcomm-proprietary reasons or specialized multimedia or AI purposes. Mercurial is never going to run on that for anything besides a demo.
Posted Jan 14, 2020 11:55 UTC (Tue)
by roc (subscriber, #30627)
[Link] (16 responses)
Posted Jan 14, 2020 13:38 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (15 responses)
Posted Jan 14, 2020 21:32 UTC (Tue)
by roc (subscriber, #30627)
[Link] (14 responses)
The problem is that there aren't enough people who want to be able to run the latest and greatest software on museum architectures to actually support those architectures through just their own efforts, e.g. by maintaining an out-of-tree LLVM backend. Thus they try to influence other developers to do work to help them out. Sometimes they get their preferences enshrined as distro policy to compel other developers to work for them. Sometimes it's borderline dishonest, raising deliberately vague concerns like "LLVM's limited platform support" to discourage technology choices that would require them to do work.
This is usually just annoying, but when it means steering developers from more secure technologies to less-secure ones, treating museum-architecture support as more important than cleaning up the toxic ecosystem of insecure, unreliable code that everyone actually depends on, I think it's egregious.
Posted Jan 15, 2020 3:25 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (13 responses)
> Eventually one of them will try to use your program on a FAT-formatted USB stick with Shift-JIS filenames or whatever, .... As a responsible programmer you want to make your program work for that user...
>That experience will encourage you to start your next project in a different language (one whose designers have already considered this problem and solved it properly) so you won't have the same pains if your project becomes popular and needs to handle obscure edge cases.
You know that getting these programs to work on these systems is a priority for some people. You can do the hard work like people did for C (and a number of other languages), and write a direct compiler. You can piggyback on GCC, like a half-dozen languages have. You can compile to one of the languages in the first two groups, like any number of languages have, or write an interpreter in a language in one of the three groups, like vastly more languages have. The Rust developers instead chose to handle this in a way that wouldn't support some of their userbase. That has nothing to do with being a "more secure technology"; that's "choosing to drop customer requirements that would take work to support".
I see where you're coming from, but on the other hand, if your competition supplies a feature that people want, perhaps it's on you to implement that feature, and perhaps developers will consider excors' advice above about using a language that won't have this problem.
(As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)
Posted Jan 15, 2020 10:23 UTC (Wed)
by roc (subscriber, #30627)
[Link] (12 responses)
I think that's a fine way to look at it, as long we are clear about which "customers" are actually being dropped. "The Rust developers chose to handle this in a way that wouldn't support some of their userbase" sounds rather ominous, but when we clarify that "some of their userbase" means "a few obsolete and a few minor embedded-only architectures", it sounds more reasonable.
> (As a sidenote, when you say "maintaining an out-of-tree LLVM backend", do you mean that LLVM wouldn't accept a backend for m68k, etc.? Because I don't blame anyone for not wanting to maintain an unmergable fork of a program, and that simply makes the argument against using LLVM so much stronger.)
Oddly enough, this is not a theoretical question: https://lists.llvm.org/pipermail/llvm-dev/2018-August/125...
I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k. Everyone using m68k to run modern software for any real task could accomplish the same thing faster with lower power on modern hardware, therefore they are doing it strictly for fun. No-one *needs* to run a particular piece of modern software on m68k. I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty. Same goes for the other obsolete architectures.
I have more sympathy for potentially relevant new embedded architectures like C-Sky. But I suspect that sooner or later an LLVM port, upstream or not, is going to be just part of the cost of promulgating a viable architecture, especially for desktops. There are already a lot of LLVM-based languages, including Rust, Swift and Julia; clang-only applications like Firefox and Chromium (Firefox requires Rust too of course); and random other stuff like Gallium llvm-pipe.
I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc, and this particular issue will become moot.
Posted Jan 15, 2020 11:46 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (10 responses)
But somehow it doesn't sound reasonable that distributions that support those architectures don't support using Rust for core software? You're trying to have your cake and eat it too.
> I just don't see an argument that anyone other than the m68k community should bear the cost of supporting m68k.
You don't see an argument for cooperating with your fellow open source developers on their projects, but you do see an argument for supporting a billion dollar company that produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.
We could start with community. If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.
> with lower power on modern hardware,
How much power does it take make a modern computer? The m68k (e.g.) is a little extreme, but it's certainly a signpost that Linux, as an operating system, is not going to drop support for hardware that's not the latest and greatest. I've got a laptop that has 20 times the processor and eight times the memory of what I went to college with, that has Windows 10 on it, and response times are vastly worse than that laptop I used in college. I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.
> I suspect that once you can build Linux with clang, CPU vendors will start choosing to just implement an LLVM backend and not bother with gcc,
Cool. What you're saying is that if you have your way, the programming language that has my heart strings, Ada, will get much harder to use on modern systems, and should I return to active work on Debian I have an interest in discouraging Rust and LLVM and firmly grounding the importance of GCC in the system. Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?
Posted Jan 15, 2020 14:54 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jan 21, 2020 7:02 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (2 responses)
Posted Jan 21, 2020 7:47 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
LLVM can replace GCC completely.
Posted Jan 21, 2020 12:03 UTC (Tue)
by dvdeug (guest, #10998)
[Link]
LLVM could in theory replace GCC completely. And why would a company that's been working with GCC for 25 years find it worth giving up all that expertise to do so? A research project is useful, and it's possible a useful tool will come of this, but there seems to be no evidence GCC is in such a dire state for AdaCore to change the underlying platform on their core product.
Posted Jan 15, 2020 22:35 UTC (Wed)
by roc (subscriber, #30627)
[Link] (5 responses)
That sounds reasonable in isolation, but when it's part of a causal chain that results in a few hobbyists holding back important improvements for the other 99.999% of users, it becomes unreasonable.
> produces proprietary software and hardware with their proprietary Qualcomm Hexagon ISA.
While you're wielding "proprietary" as a slur, keep in mind that almost every architecture that you think it's important to support is also proprietary.
> We could start with community.
Can you elucidate the actual argument here?
> If that doesn't move you, go with simple politics; free software has its own politics, and that guy who wrote the code you need to change to compile the kernel with LLVM turns out to be one of the guys who did the original Alpha port (which did all the work needed to make Linux portable beyond the 80386) and runs an Alpha in his basement, and funny, he's in a mood to be critical of your patches instead of helpful.
The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.
Even if you think that behavior should be tolerated, I doubt it would actually happen often enough in practice that it would be worth anyone taking it into account ahead of time.
> I don't want to see Linux go that way. There's an environmental cost in forcing perfectly good hardware to be replaced, as well as a financial one.
I'm not sure what arguments you're making here. I can imagine two:
> What you're saying is that if you have your way,
FWIW "CPU vendors will start choosing to just implement an LLVM" is my prediction, not my goal. I actually slightly prefer a world where CPU vendors implement both an LLVM backend and a gcc backend, because I personally like gcc licensing more than LLVM's. I just don't think that's the future.
> Systemd tries to support old software and systems; did you look at the arguments and decide that actively breaking old software and systems would have been better?
Not sure what you're trying to say here. If the cost of supporting old systems is very low, I certainly wouldn't gratuitously break them. For example, in rr we support quite old Intel CPUs because it's no trouble. OTOH once in a while we increase the minimum kernel requirements for rr because new kernel features can make rr better for everyone running non-ancient kernels and maintaining compatibility code paths for older kernels is costly in some cases.
If systemd developers are spending a lot of energy supporting old systems used by a tiny fraction of users, when they could be spending that energy making significant improvements for the other 99.999%, then yeah I'd question that decision.
Posted Jan 17, 2020 9:10 UTC (Fri)
by dvdeug (guest, #10998)
[Link] (4 responses)
That's dramatic and silly. Red Hat doesn't care about the m68k, and presumably it's not going to hold back important improvements from them. Nor Suse. If Debian cares about the m68k, more than 0.001% of their users care about it in theory, even if they don't use it, and Ubuntu and other Debian derivatives can work around that if they care about it.
It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.
> While you're wielding "proprietary" as a slur, / Can you elucidate the actual argument here?
Ah, see, I was a developer for Debian GNU/Linux. So the idea that we should be working on Free Software as a team is important to me.
> keep in mind that almost every architecture that you think it's important to support is also proprietary.
At different levels, maybe. But a patent only runs 20 years, so any old enough CPU can be reimplemented without license. And the uses aren't proprietary; it's a bunch of hobbyists who benefit, not one big company. There's a difference between an x86-64 chip that's mass-marketed and used in a vast array of devices, and a chip that's only used on Qualcomm's SoCs, and primarily running Qualcomm's code. If LLVM is worried about the cost of bringing an architecture in house, then why let Qualcomm take your developer's time?
> The Linux community expects maintainers to evaluate patches on their merits, not taking into account that sort of quid pro quo.
So you get to judge whether a feature is important or not, but a Linux maintainer can't? You can choose what features you work on, but a Linux maintainer can't? A Linux maintainer can certainly say "your patch causes the kernel to crash; here's the traceback", and leave it at that, even if it will take fifteen minutes of their time or a dozen hours of yours to find the bug. I don't know if they can say that LLVM support isn't worth it--it's probably down to Linus himself--but they can at the very least quit if they feel they have to deal with pointless LLVM patches instead of important patches.
> I'm not sure what arguments you're making here.
I said the m68k is an extreme case. But it is a bellwether; a system that is quick to drop old hardware is much more likely to drop my old hardware, and a system that support the m68k is much less likely to go through and dump support for old hardware. It is something of a matter of pride that Linux works on old systems. Even passing the tests on many of these old systems puts a limit on how slow the software can be.
> once in a while we increase the minimum kernel requirements for rr
Which isn't much a problem because the kernel cares about backward compatibility and doesn't go around knocking off old hardware. It would be a lot more frustrating if every time you had to upgrade rr, you had to upgrade the kernel and seriously worry about the system not coming back up or important parts not working; perhaps many people would stop doing both.
Basically it comes down this paragraph again: It's a free/Free market. Program developers can write what they want, and distributions can use what they want. If a distribution's priorities aren't in line with yours, you can go somewhere else. If they don't want to include your program without some features, you can include those features or not, and if not, they can patch in those features or go without. Don't gripe about those distributions; just add the features or not.
Posted Jan 17, 2020 11:10 UTC (Fri)
by peter-b (subscriber, #66996)
[Link] (1 responses)
Yes it does. https://lwn.net/Articles/769468/
Posted Jan 17, 2020 13:16 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
It's not the kernel [developers] that knock off hardware – it's the users who retired said hardware. If anybody had spoken up in favor of keeping (and maintaining …) the dropped architectures or drivers in question, they'd still be supported.
Posted Jan 17, 2020 14:59 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link]
The kernel dropped support for 386 years ago, despite it being the first CPU it ran on.
Posted Jan 17, 2020 22:36 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Of course. But we should still discuss the impact of those choices, which is not always obvious.
This subthread was triggered by just such a discussion:
which led us into a discussion of what exactly "LLVM's limited platform support" means and how important that is relative to other considerations. I learned something and I guess other readers did too.
Posted Jan 16, 2020 17:34 UTC (Thu)
by ndesaulniers (subscriber, #110768)
[Link]
Posted Jan 14, 2020 11:00 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (2 responses)
Posted Jan 14, 2020 11:52 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jan 14, 2020 13:02 UTC (Tue)
by joib (subscriber, #8541)
[Link]
Posted Jan 14, 2020 16:41 UTC (Tue)
by flussence (guest, #85566)
[Link] (6 responses)
See the Debian librsvg problems for one example. A few lines of Rust code leaving entire CPU arches having to choose between having a GUI or risk staying on an old library indefinitely.
Posted Jan 14, 2020 20:12 UTC (Tue)
by foom (subscriber, #14868)
[Link] (5 responses)
The problem with these obsolete architectures is that various groups of volunteers do care about them, but only just barely enough to keep them alive and operational while small amounts of work are required. But there's just not enough interest, ablility, or willingness available to implement anything new for them.
And that's certainly fine and understandable.
But, don't then pretend that the lack of maintenance effort available for these obscure/historical architectures is some sort of problem with the newer compilers and languages. Or, try to convince people that languages which use LLVM should be avoided because it has "limited platform support".
If you mean "I wish enough people still cared about 68k enough for it to remain a viable architecture so I could keep using it", just say that, instead of saying "LLVM's limited platform support", as if that was some sort of actual problem with LLVM.
Posted Jan 15, 2020 14:28 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (4 responses)
For one, that's a biased view of how software development works. If your program doesn't do something people need, then the onus is generally on its creators and promoters to fix that. If my new compiler only targets ARM64, I don't get to complain at all the people who aren't rushing to retarget it to x86-64 yet consider that a missing feature. Yes, LLVM has limited platform support with respect to many of the older architectures people are trying to support in Debian or NetBSD.
For another, according to roc on this article, LLVM is not interested in adding backends for these architectures. If they're forcing people to try and maintain entire CPU arches out of tree, then that's adding quite a bit of trouble. And while you've been less dismissive than roc has here, it's still far from saying "we're recognize that it would be nice to have these architectures, and if you're familiar with them, we're happy to help you implement them in LLVM/Rust." Offering an adversarial approach to people who want these architectures is hardly the way to convince them to put the work in on them.
Posted Jan 15, 2020 21:27 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Don't quote me on that; I'm just speculating. All we actually know is that an m68k backend was proposed and not rejected, and the developer never got around to moving it forward.
Posted Jan 16, 2020 7:36 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
*IF* the creators or promoters care about this particular set people. Maybe they don't? Try to imagine. It could be for any reason, good or bad. Logical or not.
> If my new compiler only targets ARM64, I don't get to complain at all the people who who aren't rushing to retarget it to x86-64 yet consider that a missing feature.
I don't remember reading so many ungrounded assumptions packed in such a small piece of text. Pure rhetoric, it's surreal. I spent an ordinate amount of time trying (and failing) to relate it to something real.
BTW: you keep misunderstanding that roc favors the reality that he merely tries to _describe_.
Posted Jan 17, 2020 8:09 UTC (Fri)
by dvdeug (guest, #10998)
[Link]
> *IF* the creators or promoters care about this particular set people.
Rust promoters do care about this particular set of people, or else we wouldn't be having this discussion. Rust promoters are right here complaining that these users don't support the use of Rust because it would hurt portability to their systems. They're not saying "if Debian chooses to reject Rust in core packages over this, that's cool with me." They're telling people they're wrong for finding this particular feature important.
> I don't remember reading so many ungrounded assumptions packed in such a small piece of text.
And yet you don't name one. Implement the features people want or not, but don't get offended that they use alternatives if you don't.
> you keep misunderstanding that roc favors the reality that he merely tries to _describe_.
Roc:
When you start describing something as "misguided" and saying "I have more sympathy for", you're not neutrally describing reality.
Posted Jan 21, 2020 2:40 UTC (Tue)
by foom (subscriber, #14868)
[Link]
Speaking for myself, I'd welcome the addition of new backends upstream, even for somewhat fringe architectures. But, I'd want some reassurance that such a backend has enough developer interest to actually maintain it, so that it's not just an abandoned code-dump. (I believe this is also the general sentiment in the LLVM developer community),
And it is definitely a time commitment to get a backend accepted in the first place. Not only do you have to write it, you have to get a patch series in a reviewable shape, try to attract people to review such a large patch-series, and follow up throughout to requests.
Anyways, I'm not sure what happened with the previous attempt to get a m68k backend added to LLVM. Looks like maybe the submitter just gave up upon a suggestion to split the patch up for ease of review? Or maybe due to question of code owners? If so, someone else could pick it up from where they left off...I'd be supportive if you wanted to do so. (But not supportive enough to actually spend significant time on it, since I don't really care about m68k.)
I'll just note here that Debian does _not_ actually support m68k or any of the other oddball architectures mentioned. Some are unofficial/unsupported ports (which means, amongst other things, that the distro will not hold back a change just because it doesn't work on one of the unsupported architectures...)
Posted Jan 14, 2020 8:56 UTC (Tue)
by ehiggs (subscriber, #90713)
[Link] (1 responses)
Also, batteries being included in Python was useful but seems deprecated. Who writes anything in Python without leveraging PyPI/pip?
Posted Jan 14, 2020 16:54 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Jan 14, 2020 3:25 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (56 responses)
* You should include Microsoft, because Microsoft was (probably) a significant factor in Python's decision to use Unicode-encoded filenames and paths rather than the "bags of bytes" model that Unix favors.
Posted Jan 14, 2020 17:45 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (55 responses)
Anyone who had to salvage data on a system when every app felt the "bag of bytes" model entitled it to use a different encoding than other apps will agree with the Python decision to go UTF-8 (there were lots of those in the 00’s; a lot less now thanks to Python authors and other courageous unicode implementers).
Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools. Relying on the "bag of bytes" model to perform all kinds of unicode incompatible tricks is as bad as when compiler authors rely on "undefined behaviour" to implement optimizations that break apps.
Posted Jan 14, 2020 17:54 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 14, 2020 18:12 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (5 responses)
Posted Jan 14, 2020 18:50 UTC (Tue)
by excors (subscriber, #95769)
[Link] (3 responses)
(I like Unicode, and I agree bags of bytes are bad. But I don't like things that pretend to be Unicode and actually aren't quite, because that leads to obscure bugs and security vulnerabilities.)
Posted Jan 15, 2020 19:43 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
People still managed to find loopholes and other ways to sabotage the migration.
Posted Jan 15, 2020 19:56 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jan 16, 2020 8:44 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
That’s called the tragedy of the commons.
Posted Jan 15, 2020 9:46 UTC (Wed)
by jamesh (guest, #1159)
[Link]
As far as file system encoding goes, it will depend on the locale on Linux. If you've got a UTF-8 locale, then it defaults to UTF-8 file names.
Posted Jan 15, 2020 20:04 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (44 responses)
Posted Jan 16, 2020 7:56 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (42 responses)
You didn't go far enough and missed that bit:
> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools
Users don't care who's in charge of _their_ files, kernel or whatever else.
Posted Jan 16, 2020 8:01 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Posted Jan 16, 2020 8:43 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (40 responses)
Posted Jan 16, 2020 8:53 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (39 responses)
Therefore "being able to write these" means "being able to crash other apps". It’s an hostile behavior, not really on par with Python objectives.
Posted Jan 16, 2020 10:17 UTC (Thu)
by roc (subscriber, #30627)
[Link] (35 responses)
Depends on what you mean by "cannot be avoided". All platforms that I know of allow passing any filename as a command-line argument. On Linux, it is easy to write a C or Rust program that spawns another program, passing a non-UTF8 filename as a command line argument. It is easy to write the spawned program in C or Rust and have it open that file. In fact, the idiomatic C and Rust code will handle non-UTF8 filenames correctly.
That C code won't work on Windows, you'll have to use wmain() or something, but the Rust code would work on Windows too.
So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.
If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.
Posted Jan 16, 2020 12:49 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Do not feed them time bombs.
Posted Jan 16, 2020 21:17 UTC (Thu)
by roc (subscriber, #30627)
[Link]
That's exactly why your app needs to be able to cope with any garbage filenames it finds there.
> Do not feed them time bombs.
I'm not arguing that non-Unicode filenames are a good thing or that apps should create them gratuitously. But widely-deployed apps and tools should not break when they encounter them.
Posted Jan 16, 2020 12:51 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (27 responses)
I personally would like my file names to work with the shell and standard utilities. I'm not about to write a C or Rust program just to copy a bunch of files, because their names are in a weird encoding that can't be typed in or will otherwise mess up my command lines.
In the 2020s, it's a reasonable assumption that file names will be encoded in UTF-8. We've had a few decades to get used to the idea, after all. If there are outlandish legacy file systems that insist on doing something else, then as far as I'm concerned these file systems are the problem and they ought to be fixed.
Posted Jan 16, 2020 16:17 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Jan 17, 2020 16:57 UTC (Fri)
by cortana (subscriber, #24596)
[Link] (3 responses)
Posted Jan 17, 2020 17:05 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
This actually works fine with most servers.
Posted Jan 17, 2020 19:12 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
The only restrictions on a header value (https://fetch.spec.whatwg.org/#concept-header-value) are that it can't contain 0x00, 0x0D or 0x0A, and can't have leading/trailing 0x20 or 0x09. (And browsers only agreed on rejecting 0x00 quite recently.)
So it's pretty much just bytes, and if you want to try interpreting it as Unicode then that's at your own risk.
Posted Jan 17, 2020 19:27 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 16, 2020 20:21 UTC (Thu)
by rodgerd (guest, #58896)
[Link]
Posted Jan 16, 2020 21:28 UTC (Thu)
by roc (subscriber, #30627)
[Link] (20 responses)
`cp` and many other utilities handle non-Unicode filenames correctly. That's not surprising; C programs that accept filenames in argv[] and treats them as a null-terminated char strings should work.
We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does. Apparently that is not viable.
> as far as I'm concerned these file systems are the problem and they ought to be fixed.
Sounds good to me, but reality disagrees.
Posted Jan 17, 2020 1:46 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (4 responses)
True, but you need to feed them such names in the first place. Given that, these days, Linux systems normally use UTF-8-based locales, non-Unicode filenames aren't going to be a whole lot of fun on a shell command line, or in the output of ls, long before Python 3 even comes into play.
Posted Jan 17, 2020 9:04 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link]
Just last week I had such a file name generated by my email program when saving an attachment from a mail created by yet another bloody email program that fucks up attachment file name encoding. And the week before by unzipping a ZIP created on a German Windows.
Posted Jan 17, 2020 21:46 UTC (Fri)
by Jandar (subscriber, #85683)
[Link] (2 responses)
Seeing systems with only UTF-8 filenames is a rarity for me.
Enforcing UTF-8 only filenames is a complete no-go, even considering it is crazy.
Posted Jan 17, 2020 22:47 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Interesting, how does software on these systems typically know how to decode, display, exchange and generally deal with these encodings?
I understand Python itself enforces explicit encodings, not UTF-8.
Posted Jan 19, 2020 10:01 UTC (Sun)
by Jandar (subscriber, #85683)
[Link]
Posted Jan 17, 2020 1:56 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (14 responses)
This is really a user discipline/hygiene issue more than a Linux file system issue. In the 1980s, the official recommendation was that portable file names should stick to ASCII letters, digits, and a few choice special characters such as the dot, dash, and underscore – this wasn't enforced by the file system, but reasonable people adhered to this and stayed out of trouble. I don't have a huge problem with a similar recommendation that in the 21st century, reasonable people should stick to UTF-8 for portable file names even if the file system doesn't enforce it. Sure, there are careless ignorant bozos who will sh*t all over a file system given half a chance, but they need to be taught manners in any case. Let them suffer instead of the reasonable people.
Posted Jan 17, 2020 2:07 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
Or Russian people using KOI-8 encoding on Linux?
Posted Jan 17, 2020 2:16 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (12 responses)
If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.
It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.
Posted Jan 17, 2020 10:48 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (11 responses)
The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.
Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…
Posted Jan 17, 2020 13:36 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (1 responses)
If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.
Posted Jan 17, 2020 16:51 UTC (Fri)
by excors (subscriber, #95769)
[Link]
You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.
You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)
Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.
Posted Jan 17, 2020 15:05 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (8 responses)
These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?
I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.
Posted Jan 17, 2020 15:45 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (7 responses)
Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.
Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.
Posted Jan 17, 2020 16:35 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (6 responses)
How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.
And now case insensitivity even - much trickier than filename encoding.
Or NTFS maybe.
Posted Jan 17, 2020 16:51 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
There was also DOS (original and alternative) and ISO code pages, but they were rarely used.
Posted Jan 17, 2020 17:35 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (4 responses)
So how did Linux and Windows users exchange files in Russia? Not?
The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.
Posted Jan 17, 2020 17:49 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.
This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.
However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.
Posted Jan 17, 2020 18:06 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (2 responses)
A cleanup flag day is IMHO the best trade off.
Posted Jan 18, 2020 22:40 UTC (Sat)
by togga (guest, #53103)
[Link] (1 responses)
Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?
Posted Jan 18, 2020 22:48 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
s/language/encodings/
This entire debate summarized in less than 25 characters. My pleasure.
Posted Jan 16, 2020 13:49 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
If you no longer have any way to type them because, surprise, your environment has been UTF8 for the last decade or so, then you'll need an otherwise-transparent encoding that can be pasted (or generated manually via \Udcxx), and that doesn't clash with the rest of your environment (your locale is utf-8 – and that's unlikely to change). Surrogateescape works for that. It should even be copy+paste-able.
Posted Jan 22, 2020 13:01 UTC (Wed)
by niner (subscriber, #26151)
[Link] (1 responses)
The shell dutifully shows the name with surrogate characters:
Get that name from a directory listing, treating it like a string with a regex grep:
And just for fun: select+paste the file name in konsole:
Actually it looks like file names with "broken" encodings work pretty well. It's only Python 3 that stumbles:
nine@sphinx:~> python3
Posted Jan 22, 2020 23:00 UTC (Wed)
by Jandar (subscriber, #85683)
[Link]
'?' is a special character for pattern matching in sh.
$ echo foo >testfilexx
So maybe this wasn't a correct test to see if your filename worked with copy&paste.
Posted Jan 16, 2020 16:34 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
Sure. The entire software world is going to fix all its filename bugs and assumptions just because some people name their files on some filesystems in funny ways. The programs that don't get fixed will die. That plan is so much simpler and easier than renaming files. /s
Oh, and all the developers who were repeatedly told to "sanitize your input" to protect themselves and the buggy programs above are all going to relax their checks a bit too.
Best of luck!
If you can't be happy, be at least realistic.
Posted Jan 16, 2020 21:49 UTC (Thu)
by roc (subscriber, #30627)
[Link]
But it is realistic to expect that common utilities handle arbitrary filenames correctly (the most common ones do). And it realistic to expect that common languages and libraries make idiomatic filename-handling code handle arbitrary filenames correctly, because many do (including C, Go, Rust, Python2, and even some parts of Python3).
Posted Jan 16, 2020 16:05 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Posted Jan 16, 2020 16:36 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
Not caring about funky filenames because most other programs don't care either: seems perfectly logic to me. You're confusing likeliness and logic.
Posted Jan 16, 2020 17:15 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
I'm very happy that Python catches funky filenames at a relatively low-level with a clear, generic, usual, googlable and stackoverflowable exception rather than with some obscure crash and/or security issue specific to each Python program. These references about "garbage-in, garbage-out" surrogates that I don't have time to read scare me, I wish there were a way to turn them off.
I do not claim Python made all the right unicode decisions, I don't know what. I bet not, nothing's perfect. This comment is only about funky file names.
Posted Jan 16, 2020 17:20 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
"were"? https://lwn.net/Articles/784041/ Case-insensitive ext4
Now _that_ (case sensitivity) really never belonged to a kernel IMHO. Realism?
Posted Jan 16, 2020 15:58 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (2 responses)
Absolutely. And I would add "and **only** by humans". Language run-times should not mess with them, period.
Posted Jan 21, 2020 18:43 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jan 21, 2020 19:32 UTC (Tue)
by Jandar (subscriber, #85683)
[Link]
Although in case of trouble-shooting readable file-names are a remedy.
Posted Jan 13, 2020 20:20 UTC (Mon)
by pj (subscriber, #4506)
[Link] (79 responses)
Posted Jan 13, 2020 20:32 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (62 responses)
As a result, Py3 has lost several huge codebases that started migrating to Go instead. Other projects like Mercurial or OpenStack started migration at the very last moment, because of 2.7 EoL.
Posted Jan 13, 2020 21:01 UTC (Mon)
by vstinner (subscriber, #42675)
[Link] (25 responses)
As Mercurial, Twisted is heavily based on bytes (networking framework) and it has been ported successfully to Python 3 a few years. Twisted can now be used with asyncio.
I tried to help porting Mercurial to Python 3, but their maintainers were not really open to discuss Python 3 when I tried. Well, I wanted to use Unicode for filenames, they didn't want to hear this idea. I gave up ;-)
Posted Jan 13, 2020 22:16 UTC (Mon)
by excors (subscriber, #95769)
[Link] (23 responses)
The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode. (There's PEP 383 but that produces strings that are not quite Unicode, e.g. it becomes impossible to encode them as UTF-16, so that's not good). And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode. For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.
(The article also mentions the solution, as implemented in Rust: filenames are a platform-specific string type, with lossy conversions to Unicode if you really want that (e.g. to display to users).)
Posted Jan 13, 2020 23:19 UTC (Mon)
by vstinner (subscriber, #42675)
[Link] (12 responses)
On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.
Read my articles https://vstinner.github.io/python30-listdir-undecodable-f... and https://vstinner.github.io/pep-383.html for the history the Unicode usage for filenames in the early days of Python 3 (Python 3.0 and Python 3.1).
The problem is that the UTF-8 codec of Python 2 doesn't respect the Unicode standard: it does encode surrogate characters. The Python 3 codec doesn't encode them, which makes possible to use surrogateescape error handler with UTF-8.
> And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
I'm not sure of which problem you're talking about.
If you care of getting the same character on Windows and Linux (ex: é letter = U+00E9), you should encode the filename differently. Storing the filename as Unicode in the application is a convenient way for that. That's why Python prefers Unicode for filenames. But it also accepts filenames as bytes.
> For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.
Well, it is where I gave up :-)
Posted Jan 13, 2020 23:29 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Jan 13, 2020 23:37 UTC (Mon)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Jan 14, 2020 2:22 UTC (Tue)
by excors (subscriber, #95769)
[Link] (2 responses)
You can't create a string like '\U00123456' (SyntaxError) or chr(0x123456) (ValueError); it's limited to the 21-bit range. But you *can* create a string like '\udccc' and Python will happily process it, at least until you try to encode it. '\udccc'.encode('utf-8') throws UnicodeEncodeError.
If you use the special decoding mode, b'\xcc'.decode('utf-8', 'surrogateescape') gives '\udccc'. If you (or some library) does that, now your application is tainted with not-really-Unicode strings, and I think if you ever try to encode without surrogateescape then you'll risk getting an exception.
If you tried to decode Windows filenames as round-trippable UCS-2, like
>>> ''.join(chr(c) for c, in struct.iter_unpack(b'>H', b'\xd8\x08\xdf\x45'))
then you'd be introducing a third type of string (after Unicode and Unicode-plus-surrogate-escapes) which seems likely to make things even worse.
Posted Jan 14, 2020 2:44 UTC (Tue)
by excors (subscriber, #95769)
[Link]
Incidentally, that seems to include the default encoding performed by print() (at least in Python 3.6 on my system):
>>> for f in os.listdir('.'): print(f)
os.listdir() will surrogateescape-decode and functions like open() will surrogateescape-encode the filenames, but that doesn't help if you've got e.g. logging code that touches the filenames too.
Posted Jan 14, 2020 4:47 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jan 16, 2020 8:08 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Yet all VCS provide some sort of auto.crlf insanity, go figure.
Just in case someone wants to use Notepad-- from the last decade.
Posted Jan 13, 2020 23:32 UTC (Mon)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jan 16, 2020 17:40 UTC (Thu)
by kjp (guest, #39639)
[Link]
Python: It's a [unstable] scripting language. NOT a systems or application language.
Posted Jan 14, 2020 1:37 UTC (Tue)
by excors (subscriber, #95769)
[Link] (2 responses)
But then you end up with a "Unicode" string in memory which can't be safely encoded as UTF-8 or UTF-16, so it's not really a Unicode string at all. (As far as I can see, the specifications are very clear that UTF-* can't encode U+D800..U+DFFF. An implementation that does encode/decode them is wrong or is not Unicode.)
That means Python applications that assume 'str' is Unicode are liable to get random exceptions when encoding properly (i.e. without surrogateescape).
> > And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
Windows (with NTFS) lets you create a file whose name is e.g. "\ud800". The APIs all handle filenames as strings of wchar_t (equivalent to uint16_t), so they're perfectly happy with that file. But it's clearly not a valid string of UTF-16 code units (because it would be an unpaired surrogate) so it can't be decoded, and it's not a valid string of Unicode scalar values so it can't be directly encoded as UTF-8 or UTF-16. It's simply not Unicode.
In practice most native Windows applications and APIs treat filenames as effectively UCS-2, and they never try to encode or decode so they don't care about surrogates, though the text rendering APIs try to decode as UTF-16 and go a bit weird if that fails. Python strings aren't UCS-2 so it has to convert to 'str' somehow, but there's no correct way to do that conversion.
Posted Jan 14, 2020 6:04 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link] (1 responses)
https://docs.microsoft.com/en-us/windows/win32/fileio/nam...
Also, whatever your complaints are about whatever language, with respect to filenames, the win32 api is worse.
It's amazingly inconsistent. The level of insanity is just astonishing, especially if you're going across files created with the win api, and the .net libs.
You *have to p/invoke to read some files, and use the long filepath prefix, which doesn't support relative paths. And that's just the start.
Admittedly, I haven't touched it for almost a decade in any serious fashion, but, based on the docs linked above, it doesn't seem much has changed.
It's remarkable how easy they make it to write files that are quite hard to open..
Posted Jan 14, 2020 15:35 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 15, 2020 0:26 UTC (Wed)
by gracinet (guest, #89400)
[Link]
don't forget that Mercurial has to cope with filenames in its history that are 25 years old. Yes, that predates Mercurial but some of the older repos have had a long history as CVS then SVN.
Factor in the very strong stability requirements and the fact that risk to change a hash value is to be avoided, no wonder a VCS is one of the last to take the plundge. It's really not a matter of size of the codebase in this case.
Note: I wasn't directly involved in Mercurial at the time you were engaging with the project about that, I hope some good came out of it anyway.
Posted Jan 14, 2020 2:18 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Jan 14, 2020 7:57 UTC (Tue)
by epa (subscriber, #39769)
[Link] (2 responses)
If you do want a truly arbitrary ‘bag of bytes’ not just for file contents but for names too, I have the feeling you’d probably be using a different tool anyway.
Posted Jan 14, 2020 15:39 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Losing the ability to read history of when the tool did not have such a restriction would not be a good thing. Losing the ability to manipulate those files (even to rename them to something valid) would also be tricky if it failed up front about bad filenames.
Posted Jan 15, 2020 18:58 UTC (Wed)
by hkario (subscriber, #94864)
[Link]
just unzip a file from non-UTF-8 system, you're almost guaranteed to get mojibake as a result; then blindly commit files to the VCS and bam, you're set
Posted Jan 14, 2020 11:35 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (5 responses)
Posted Jan 14, 2020 11:59 UTC (Tue)
by roc (subscriber, #30627)
[Link] (4 responses)
You can accept all filenames and make repositories portable between Windows and Unix if they have valid Unicode filenames. AFAIK that's what Mercurial does, and I hope it's what git does.
Posted Jan 14, 2020 12:33 UTC (Tue)
by dezgeg (subscriber, #92243)
[Link] (3 responses)
Posted Jan 14, 2020 13:21 UTC (Tue)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jan 14, 2020 15:51 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
They had a load of grief with mixed Windows/linux repos, so there's now a switch that says "convert cr/lf on checkout/checkin".
Add a switch that says "enforce valid utf-8/utf-16/Apple filenames, and sort out the mess at checkout/checkin".
If that's off by default, or on by default for new repos, or whatever, then at least NEW stuff will be sane, even if older stuff isn't.
Cheers,
Posted Jan 14, 2020 15:42 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 13, 2020 23:57 UTC (Mon)
by prometheanfire (subscriber, #65683)
[Link]
Posted Jan 14, 2020 5:40 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link] (35 responses)
Python to Go seems like a weird switch. I tend to use them for very different tasks.
Unless you're bound to GCP as a platform or something similar.
But you're not the only one mentioning this: what projects have I missed that made the switch?
Posted Jan 14, 2020 16:02 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (34 responses)
Stuff like command-line utilities and servers works really well in Go.
Several huge Python projects are migrating to Go as a result.
Posted Jan 14, 2020 17:06 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link] (33 responses)
Can you name them?
Posted Jan 14, 2020 17:10 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (32 responses)
Posted Jan 14, 2020 18:09 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link] (31 responses)
As for the rest, a lot of infra-related things are being rewritten in Go just because containers (k8s and docker both use Go). That has little to do with the benefits offered by the language. It’s good old network effects. When you’re the container language, and everyone wants to do containers, being decent is sufficient to carry the day.
No one will argue that Go is less than decent. Many will argue it’s more than decent, but that’s irrelevant for its adoption curve.
Posted Jan 14, 2020 18:25 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Mind you, Google actually tried to fix some of the Python issues by trying JIT compilation with unladen-swallow project before that.
Posted Jan 14, 2020 19:05 UTC (Tue)
by rra (subscriber, #99804)
[Link]
For most applications, the speed changes don't matter and other concerns should dominate. But for core infrastructure code for large cloud providers, they absolutely matter in key places, and Python is not a good programming language for high-performance networking code.
Posted Jan 18, 2020 13:26 UTC (Sat)
by HelloWorld (guest, #56129)
[Link] (28 responses)
Posted Jan 18, 2020 14:22 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (27 responses)
The other major gripe with Go which you missed, IMHO, is its appalling error handling; the requirement to return an "(err,result)" tuple and checking "err" *everywhere* (as opposed to a plain "result" and propagating exceptions via catch/throw or try/raise or however you'd call it) causes a "yay unreadable code" LOC explosion and can't protect against non-functional errors (division by zero, anybody?).
Posted Jan 19, 2020 0:09 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (26 responses)
By contrast, Scala does have language support for exceptions. It's pretty much the same as Java's try/catch/finally, how did that hold up? It's a steaming pile of crap. It interacts poorly with concurrency, it easily leads to resource leaks, it's hard to compose, it doesn't tell you which errors can occur where, and everybody who knows what they're doing is using a library instead, because libraries like ZIO don't have *any* of these problems.
So based on that experience, you're going to have a hard time convincing me that concurrency needs language support. Feel free to try anyway, but try ZIO first.
Posted Jan 19, 2020 1:03 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (25 responses)
It really is that simple.
Plus, Go has a VERY practical runtime with zero dependency executables and a good interactive GC. It's amazing how much better Golang's simple mark&sweep is when compared to Java's neverending morass of CMS or G1GC (that constantly require 'tuning').
Sure, I would like a bit more structured concurrency in Go, but this can come later once Go team rolls out generics.
Posted Jan 19, 2020 5:47 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (24 responses)
Apparently you haven't tried ZIO, because it beats the pants off anything Go can do.
It really is that simple.
Posted Jan 19, 2020 6:01 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (23 responses)
Posted Jan 19, 2020 7:58 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (22 responses)
Meanwhile, Go is written by practical engineers. Cancellation and timeouts are done through the use of explicitly passed context.Context, resource cleanups are done through defered blocks.
This two simple methods in practice allow complicated systems comprising hundreds thousands of LOC to work reliably. While being easy to develop and iterate, not requiring multi-minute waits for one compile/run cycle.
Posted Jan 19, 2020 10:25 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (16 responses)
If you come across a better paradigm sometime in the future, then bake it into a new version of the language and/or its libraries, and add interoperability features. Python3 is doing this, incidentally: asyncio is a heap of unstructured callbacks that evolved from somebody noticing that you can use "yield from" to build a coroutine runner, then Trio came along with a much better concept that actually enforces structure. Today the "anyio" module affords the same structured concept on top of asyncio, and in some probably-somewhat-distant future asyncio will support all that natively.
Languages, and their standard libraries, evolve.
With Go, this transition to Structured Concurrency is not going to happen any time soon because contexts and structure are nice-to-have features which are entirely optional and not supported by most libraries, thus it's much easier to simply ignore all that fancy structured stuff (another boilerplate argument you need to pass to every goroutine and another clause to add to every "select" because, surprise, there's no built-in cancellation? get real) and plod along as usual. The people in charge of Go do not want to change that. Their choice, just as it's my choice not to touch Go.
Posted Jan 19, 2020 12:35 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (15 responses)
Posted Jan 19, 2020 14:35 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (14 responses)
NB, Python also has the whole thing in a library. This is not primarily about language features. The problem is that it is simply impossible to add this to Go without either changing the language, or forcing people to write even more convoluted code.
Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.
Posted Jan 19, 2020 21:17 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (13 responses)
Posted Jan 20, 2020 10:33 UTC (Mon)
by smurf (subscriber, #17840)
[Link] (12 responses)
Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.
But while Python is not such a language, neither is Go, so in effect you're shifting the playing ground here.
> Not having error handling built into the language doesn't mean you have to check for errors on every call.
No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.
Posted Jan 20, 2020 12:51 UTC (Mon)
by HelloWorld (guest, #56129)
[Link] (1 responses)
Posted Jan 20, 2020 18:30 UTC (Mon)
by darwi (subscriber, #131202)
[Link]
Long time ago (~2013), I worked as a backend SW engineer. We transformed our code from Java (~50K lines) to Scala (~7K lines, same functionality).
After the transition was complete, not a single NullPointerException was seen anywhere in the system, thanks to the Option[T] generics and pattern matching on Some()/None. It really made a huge difference.
NULL is a mistake in computing that no modern language should imitate :-( After my Scala experience, I dread using any language that openly accepts NULLs (python3, when used in a large 20k+ code-base, included!).
Posted Jan 20, 2020 15:46 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (9 responses)
Yes, but with these types, *ignoring* (or passing on in Python) the error takes explicit steps rather than being implicit. IMO, that's a *far* better default. I would think the Zen of Python agrees…
Posted Jan 20, 2020 17:21 UTC (Mon)
by HelloWorld (guest, #56129)
[Link] (8 responses)
No, passing the error on does not take explicit steps, because the monadic bind operator (>>=) takes care of that for us. And that's a Good Thing, because in the vast majority of cases that is what you want to do. The problem with exceptions isn't that error propagation is implicit, that is actually a feature, but that it interacts poorly with the type system, resources that need to be closed, concurrency etc..
Posted Jan 20, 2020 18:28 UTC (Mon)
by smurf (subscriber, #17840)
[Link] (6 responses)
Typing exceptions is an unsolved problem; conceivably it could be handled by a type checker like mypy. However, in actual practice most code is documented as possibly-raising a couple of well-known "special" exceptions derived from some base type ("HTTPError"), but might actually raise a couple of others (network error, cancellation, character encoding …). Neither listing them all separately (much too tedious) nor using a catch-all BaseException (defeats the purpose) is a reasonable solution.
Posted Jan 20, 2020 22:44 UTC (Mon)
by HelloWorld (guest, #56129)
[Link] (2 responses)
On the other hand, there are trivial things that can't be done with
Posted Jan 21, 2020 6:51 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (1 responses)
You use an [Async]ExitStack. It's even in contextlib.
Yes, functional languages with Monads and all that stuff in them are super cool. No question. They're also super hard to learn compared to, say, Python.
Posted Jan 21, 2020 14:32 UTC (Tue)
by HelloWorld (guest, #56129)
[Link]
> They're also super hard to learn compared to, say, Python.
Posted Jan 21, 2020 2:32 UTC (Tue)
by HelloWorld (guest, #56129)
[Link] (2 responses)
If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem, because if you can't even list the errors that might happen in your function, you can't reasonably expect people to handle them either. You need to constrain the number of ways that a function can fail in, normally by categorising them in some way (e. g. technical errors vs. business domain errors). I think this is actually yet another way in which strongly typed functional programming pushes you towards better API design.
Unfortunately Scala hasn't proceeded along this path as far as I would like, because much of the ecosystem is based on cats-effect where type-safe error handling isn't the default. ZIO does much better, which is actually a good example of how innovation can happen when you implement these things in libraries as opposed to the language. Java has checked exceptions, and they're utterly useless now that everything is async...
Posted Jan 21, 2020 7:13 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (1 responses)
… and unstructured.
The Java people have indicated that they're going to migrate their async concepts towards Structured Concurrency, at which point they'll again be (somewhat) useful.
> If listing the errors that an operation can throw is too tedious, I would argue that that is not a language problem but a library design problem
That's one side of the medal. The other is that IMHO a library which insists on re-packaging every kind of error under the sun in its own exception type is intensely annoying because that loses or hides information.
There's not much commonality between a Cancellation, a JSON syntax error, a character encoding problem, or a HTTP 50x error, yet an HTTP client library might conceivably raise any one of those. And personally I have no problem with that – I teach my code to retry any 40x errors with exponential back-off and leave the rest to "retry *much* later and alert a human", thus the next-higher error handler is the Exception superclass anyway.
Posted Mar 19, 2020 16:52 UTC (Thu)
by bjartur (guest, #67801)
[Link]
Nesting result types explicitly is helpful because it makes you wonder when an exponential backoff is appropriate.
How about
Posted Jan 21, 2020 22:42 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
As a code reviewer, implicit codepaths are harder to reason about and don't make me as confident when reviewing such code (though the bar may also be lower in these cases because error reporting of escaping exceptions may be louder ignoring the `except BaseException: pass` anti-pattern instances).
Posted Jan 19, 2020 11:23 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (4 responses)
You're free to stick with purely dysfunctional programming then. Have fun!
Posted Jan 19, 2020 18:49 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jan 19, 2020 21:19 UTC (Sun)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Posted Jan 19, 2020 21:25 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
My verdict is that pure FP languages are used only for ideological reasons and are totally impractical otherwise.
Posted Jan 19, 2020 22:42 UTC (Sun)
by HelloWorld (guest, #56129)
[Link]
> I also spent probably several months in aggregate waiting for Scala code to compile.
Posted Jan 14, 2020 0:02 UTC (Tue)
by rgmoore (✭ supporter ✭, #75)
[Link] (15 responses)
One of the points made in the blog post, though, is that the creators of Python 3 did some really stupid stuff that made it needlessly difficult to write code that worked in both versions. The specific example that stood out to me was the use of identifiers to specify whether a string literal was a string of bytes or of unicode points. In Python 2, it was possible to specify
Posted Jan 14, 2020 10:12 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (14 responses)
What happened instead was an intense period of slowly converting to Py3, heaps of code that use "import six", and modules that ran, and run, with both 2 and 3 once some of those nits were reverted. And they were.
Thus, IMHO accusations of Python core developers not listening to (some of) their users are for the most part really unfounded. Hindsight is 20/20, yes they could have done some things better, but frankly my compassion for people who take their own sweet time to port their code to Python3 and complain *now*, when there's definitely no more chance to add anything to Py2 to make the transition easier, is severely limited.
Posted Jan 14, 2020 15:54 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
These concerns were raised back in 2008, but Py3 developers ignored them because it was clear (to them) that only bad code needed it and developers should shut up and eat their veggies.
Posted Jan 22, 2020 18:47 UTC (Wed)
by togga (guest, #53103)
[Link] (12 responses)
And then obviously removed later on.
Python 2.7.17 >>> b'{x}'.format(x=10)
Python 3.7.5 >>> b'{x}'.format(x=10)
Posted Jan 22, 2020 19:47 UTC (Wed)
by foom (subscriber, #14868)
[Link] (11 responses)
Posted Jan 22, 2020 19:57 UTC (Wed)
by togga (guest, #53103)
[Link] (10 responses)
Posted Jan 23, 2020 8:50 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (9 responses)
Posted Jan 23, 2020 14:18 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
So instead the burden is put on the coder to have to think about whether bytes or strings will be threaded through their code and can't use the newer API if they might have bytes floating about?
Posted Jan 23, 2020 16:48 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 25, 2020 2:24 UTC (Sat)
by togga (guest, #53103)
[Link] (5 responses)
There is no reason for not
Posted Jan 25, 2020 11:03 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (4 responses)
Python3 source code is Unicode. Python attribute access is written as "object.attr". This "attr" part therefore must be Unicode. Why would you want to use anything else? If you need bytestrings as keys, or integers for that matter, use a dict.
Posted Jan 25, 2020 21:44 UTC (Sat)
by togga (guest, #53103)
[Link] (3 responses)
>Why would you want to use anything else?
>"use a dict."
>>> a=type('A', (object,), {})()
Posted Jan 29, 2020 12:32 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (2 responses)
Sure, if you want to be pedantic you can use "-*- coding: iso-8859-1 -*-" (or whatever) in the first two lines and write your code in Latin1 (or whatever), but that's just the codec Python uses to read the source. It's still decoded to Unicode internally.
> >"use a dict."
Currently. In CPython. Other Python implementations, or future versions of CPython, may or may not use what is, or looks like, a generic dict to do that.
Yes, I do question why anybody would want to use attributes which then can't be accessed with `obj.attr` syntax. That's useless.
Use a (real) dict.
Posted Feb 12, 2020 20:40 UTC (Wed)
by togga (guest, #53103)
[Link] (1 responses)
As I said above, it's a necessity due to the design of library APIs. Examples of needed, otherwise unnecessary, encode/decode are plenty (and error-prone). Article mentions a few, I've already mentioned ctypes where for instance structure field names (often read from binary APIs such as c strings, etc) is required to be attributes.
This thread has become a bit off topic. The interesting question for me is Python 2to3 language regressions or which migrations that are feasible, that stage was done in ~ 2010 to 2013 with several Python3 failed migration attempts. Nothing of value has changed since. Half of my Python2 use-cases is not suited for Python3 due to it's design choices and I do not intend to squeeze any problem in a tool not suited for it. That's more of a fact.
The question back of my head for me is about the other half of my use-cases that fits Python3. Given the experience of python leadership attitudes, decisions, migration pains, etc which the article is one example of. Is python3 a sound language choice for new projects?
Posted Feb 12, 2020 20:47 UTC (Wed)
by togga (guest, #53103)
[Link]
Oops.. it should read the opposite. "is not" Python 2to3 language regressions
Posted Jan 25, 2020 13:26 UTC (Sat)
by foom (subscriber, #14868)
[Link]
Given the invention of better format syntax, forcing the continued use of the worse/legacy % format syntax for bytestrings seems a somewhat mystifying decision.
It's not as if the only use of bytestrings is in code converted from python 2...
Posted Jan 14, 2020 1:23 UTC (Tue)
by atai (subscriber, #10977)
[Link]
Posted Jan 14, 2020 8:02 UTC (Tue)
by edomaur (subscriber, #14520)
[Link]
Facebook is using Mercurial internally, because it works better than git as a monorepository, but they had to rewrite many hot paths, and are currently working on Mononoke, a full-Rust implementation of the Mercurial server. Still "The version that we provide on GitHub does not build yet." but I think it's an interesting project.
https://github.com/facebookexperimental/mononoke
Posted Jan 15, 2020 0:43 UTC (Wed)
by gracinet (guest, #89400)
[Link]
That got me on board, and if you're interested, my colleague Raphaël Gomès will give two talks on that subject in two weeks at FOSDEM.
I know from your posting history here what you think of Python3 and unicode strings, but even though Rust has dedicated types for filesystem paths, we still have some issues. For instance regex patterns are `&str`. It can be overcome by escaping, but that's less direct than reading them straight from the `.hgignore` file.
Posted Jan 13, 2020 19:30 UTC (Mon)
by HelloWorld (guest, #56129)
[Link] (30 responses)
Posted Jan 13, 2020 20:49 UTC (Mon)
by ehiggs (subscriber, #90713)
[Link] (28 responses)
Python's migration problem was because it didn't allow for Python3 code to load and run Python 2 bytecode or otherwise use Python2 files until they were all ported and vice-versa. This meant that any project that wanted to migrate had to wait until 100% of its dependencies were on Python3 already. And any library had a huge window where they needed to maintain compliance with 2 and 3 (so no one could take advantage of new features).
Without this migration path, it mean developers needed to perform a big bang migration. The article calls it a "flag day" migration.
This is discussed in the section labelled "Commentary on Python 3" which is well written and easy to follow.
Posted Jan 13, 2020 21:28 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (27 responses)
Posted Jan 13, 2020 23:09 UTC (Mon)
by ehiggs (subscriber, #90713)
[Link] (9 responses)
It is absolutely true. Libraries target JDK 8 because that's what Android is stuck on. It's still a hassle and Java's type system didn't save it from the problems.
Java 9 was EOL in March 2018. Java 10 was EOL in September 2018. And if you're a commercial user of Java 8 and don't have a license with Oracle or anyone else, support was EOL in January 2019.
Posted Jan 13, 2020 23:12 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
This didn't work with Python, the transition from 2 to 3 required massive rewrites.
> And if you're a commercial user of Java 8 and don't have a license with Oracle or anyone else, support was EOL in January 2019.
Posted Jan 13, 2020 23:22 UTC (Mon)
by ehiggs (subscriber, #90713)
[Link] (1 responses)
Indeed and this is not the desired state of affairs. And Java's type system did not save it.
> This didn't work with Python, the transition from 2 to 3 required massive rewrites.
Indeed and this is not the desired state of affairs. And Python's type system did not cause it.
Posted Jan 13, 2020 23:30 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Indeed and this is not the desired state of affairs. And Python's type system did not cause it.
Posted Jan 14, 2020 3:11 UTC (Tue)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Unless your package, or one of its dependencies, does bytecode manipulation and uses an old version of the bytecode manipulation library, which chokes on classes compiled for a newer JDK. Or your package depends on one of the several J2EE libraries which were removed by JDK 11 (some of them having no replacement outside of the JDK). Or your package, or one of its dependencies, chokes on the replacement of one of the several J2EE libraries which were removed by JDK 11, because it uses an old version of the bytecode manipulation library, and the replacement J2EE library was compiled for a newer JDK.
As late as the end of 2019, some packages were still announcing Java 9 compatibility fixes. For some reason, Java 9 had more compatibility issues than usual, and Java 11 made it worse by completely removing components first deprecated in the short-lived Java 9 release.
Posted Jan 14, 2020 6:12 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link]
Of course, it is doing some pretty wacky stuff. But it's the only option for some things (e.g. GCP Dataflow )
Posted Jan 15, 2020 19:52 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
The Java leadership has been busy making itself irrelevant by alienating most of the rest of the IT world.
Though I wonder where that will leave all the Apache foundation Java projects. They can’t survive in a closed circuit loop forever. Scala is not the solution, its adoption outside the existing Java world is nonexistent.
Posted Jan 15, 2020 19:54 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Jan 16, 2020 9:09 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Who wants to deal with this crap forever? Easier to port to another language and let someone else fatten lawyers.
Posted Jan 16, 2020 16:20 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 13, 2020 23:16 UTC (Mon)
by HelloWorld (guest, #56129)
[Link] (16 responses)
Posted Jan 13, 2020 23:22 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
However, I can't fault the way Snoracle introduced them - none of my module-unaware code broke during JDK9 migration.
Posted Jan 14, 2020 1:44 UTC (Tue)
by Conan_Kudo (subscriber, #103240)
[Link] (14 responses)
Oh God, no. I like my sanity, thank you very much. I am *totally* OK with that restriction and I would rather nobody ever lifted it.
Posted Jan 14, 2020 6:58 UTC (Tue)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 14, 2020 19:02 UTC (Tue)
by HelloWorld (guest, #56129)
[Link] (12 responses)
Posted Jan 15, 2020 8:26 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (11 responses)
Even given all of that, I'm still not thrilled with this idea, because I *know* that sooner or later, some part of X is going to somehow call the "wrong Y," and I'll get paged at 3 AM when it crashes in production. And I'm sure that someone will have written a comment somewhere long ago, assuring the reader that, no, of course it's "impossible" for X to call the wrong Y, don't be silly, you see, they are entirely separate, there's no possible way for them to interact. Except for that one obscure side channel the SWE forgot about, where on alternate Thursdays when the moon is full, the software briefly tries to bind TCP port 12345 at exactly the stroke of midnight, in order to practice speaking a profane and blasphemous protocol. A protocol defined only in an RFC that the IETF subsequently declared Librorum Prohibitorum, and which must now be obtained by special dispensation from Vint Cerf. Why does it do this? Because some client asked for it five years ago and everyone has now bleached that contract from their collective unconscious. Anyway, the second version of Y fails to bind the port on EADDRINUSE, and the error gets swallowed because don't you know, in a containerized setup, you're not supposed to get EADDRINUSE, so obviously it's a /* can't happen */ situation. Then X, blissfully unaware, connects to the port and talks to the wrong Y, and the wrong Y does something subtly different from what X wanted, and if you're very lucky, this merely causes the app to crash.
In theory, those are mostly solvable problems. In practice, the language is not actually in a position to solve them (Are you really going to stop the two versions of Y from interacting with any kind of global state, including the filesystem? Unless your language is Haskell, or perhaps an extremely locked down dialect of JavaScript, that isn't realistic as a language-level restriction.). They are institutional problems, and require institutional solutions. As it turns out, one of those solutions can be* "library versions are bumped on a fixed schedule, keep up or else your code stops building and we stop deploying it." A "one library version per process" rule is a straightforward way of enforcing that, but of course you could just as easily attach some kind of custom restriction to the build process instead. It's just a matter of convenience.
* "Can be," not "has to be." There are other solutions, with various advantages and tradeoffs, which are beyond the scope of this discussion.
Posted Jan 15, 2020 9:10 UTC (Wed)
by roc (subscriber, #30627)
[Link] (2 responses)
I'm familiar with libraries stomping on each other at run-time, e.g. with races around fork(). I'm familiar with C libraries stomping on each other with symbol collisions during linking. I've even had listening port collisions with two components in the same container. But my Rust project links multiple versions of some libraries (which libraries, and which versions changes over time), and I haven't had any problems with that so far in practice.
Posted Jan 15, 2020 21:23 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Rust avoids the symbol problems, but still has woes with libraries trying to control any global resource (signal handlers, environment variables, etc.).
Posted Jan 15, 2020 21:38 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Signal handlers are just a massive problem in general --- for different libraries as well as for two versions of the same library. For that and other reasons I have not encountered any Rust crates (other than tokio-signal) that set signal handlers. Likewise setting environment variables is a minefield libraries should avoid under any circumstances.
So I agree that two versions of the same library are more likely to hit these issues than two different libraries, but I'm not convinced that *in practice* it's really worse, for Rust.
Posted Jan 15, 2020 9:48 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Besides, it's not like Java will detect that there are two different versions of a library on the classpath unless you use special measures to prevent that. It'll just crash later when something tries to call a method that isn't there any more or something like that, so it's not like your approach of “let's just forbid it” prevents anything.
So yeah, I'm not buying it. Libraries should be isolated from each other, anything else just doesn't scale…
Posted Jan 16, 2020 1:57 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
You see, this kind of clever thinking on the SWE side is why the average SRE has a drinking problem. I told you that there would be a comment to that effect, and you *actually wrote it for me,* apparently in all seriousness believing it would change my mind.
Realistically, every major operating system has process-wide mutable parameters (the working directory, the umask, UID/GIDs, stdin/out/err redirection, etc.), and while Java may try very hard to sandbox those parameters, you can always call out to native code and manipulate them anyway.
Posted Jan 16, 2020 9:01 UTC (Thu)
by HelloWorld (guest, #56129)
[Link]
Of course, 99 % of libraries *don't* call out to native code, so what you're saying is that we can't have the solution for the 99 %, because it might not work for the 1%, and of course you don't have a solution for the remaining cases either.
Besides, all this stuff about the cwd, the uid/gid, stdio redirection etc. is complete hogwash, because these are shared among *different* libraries as well, and therefore any library that relies on any of these to be in any particular state is broken to begin with, whether or not you allow multiple versions to be loaded. It's a completely unrelated problem.
Posted Jan 15, 2020 19:58 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (4 responses)
At that point, the pyramid crumbles under the weight of its technical debt and wipes out all past “savings”.
Posted Jan 16, 2020 9:09 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (3 responses)
The fact of the matter is that this problem doesn't go away on its own, and if the platform doesn't solve it, people come up with other solutions. For Java that is JarJar Links...
Posted Jan 16, 2020 9:33 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (2 responses)
That’s the inherent cost of using third party code. Don’t like it? Write your own code.
Engineering means delivering reliable solutions. Not letting problems fester in dark places.
Posted Jan 16, 2020 14:19 UTC (Thu)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 16, 2020 14:22 UTC (Thu)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 14, 2020 1:49 UTC (Tue)
by KaiRo (subscriber, #1987)
[Link]
Posted Jan 13, 2020 21:31 UTC (Mon)
by roc (subscriber, #30627)
[Link] (7 responses)
Posted Jan 13, 2020 23:15 UTC (Mon)
by ehiggs (subscriber, #90713)
[Link] (3 responses)
(In fairness, Python isn't the only culprit, Cargo also stores it's index as directories and files and would probably be much snappier it it used sqlite).
Posted Jan 14, 2020 1:25 UTC (Tue)
by KaiRo (subscriber, #1987)
[Link] (1 responses)
Posted Jan 14, 2020 4:49 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jan 14, 2020 16:43 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link]
Posted Jan 14, 2020 3:06 UTC (Tue)
by foom (subscriber, #14868)
[Link] (2 responses)
And these days, maybe you'll see less python, but writing server software in JavaScript is now all the rage...
Posted Jan 14, 2020 16:43 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
Posted Jan 15, 2020 1:03 UTC (Wed)
by roc (subscriber, #30627)
[Link]
I guess when you write a prototype, you need to consider whether it's worth writing in a language that will let it grow into a performance, reliable production system.
Posted Jan 13, 2020 23:37 UTC (Mon)
by togga (guest, #53103)
[Link] (52 responses)
"Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment"
* the approach of assuming the world is Unicode is flat out wrong
Given the obvious mismatch, It had to be a big leap of faith for Gregory to even undergo this effort to begin with. We can now conclude that ROI will never happen.
The last bullet above in combination with ctypes was the definite turning point for me being hit by exceptions in iovisor/bcc after they added ~130 encode/decode calls for py3 compatibility. This made me abandon iovisor/bcc python part altogether. Even if it tries so solve day-to-day issues it creates a whole set of new issues.
I've been a python developer since 2003 and seen it go from a kick-ass scripting language (former py2) to a subpar application language (py3). If the python community (during the wasted years) instead of this mess and questionable feature creep had focused on things like GIL, threading model and performance python could have been more prepared for this:
https://www.theguardian.com/commentisfree/2020/jan/11/we-...
Posted Jan 14, 2020 0:49 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (16 responses)
Meh, I wouldn't go that far. Python 2 or 3 are both perfectly serviceable "I would use bash except it's kinda terrible" languages (which is what I think of when I hear "scripting language"). For the most part, in new code, you don't have to think about encoding issues because everything "just works" (even when the way that it works is frankly terrifying - see surrogateescape for example).
Sure, if you are actually taking code points apart and playing around with the UTF-8 representation, it's a lot more annoying. But realistically, for basic sysadmin-ish scripting tasks, you're not actually doing that.
> * it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality
Can you please be more specific? The only "obvious" example I can think of is the filesystem encoding, but surrogateescape *does* abstract over that with no data loss or loss of functionality, and it's not "outright wrong" because the transformation is losslessly reversed on round-trip.
Posted Jan 14, 2020 4:50 UTC (Tue)
by roc (subscriber, #30627)
[Link] (9 responses)
Posted Jan 14, 2020 10:18 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (8 responses)
Posted Jan 14, 2020 12:02 UTC (Tue)
by roc (subscriber, #30627)
[Link] (7 responses)
Posted Jan 14, 2020 17:02 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
> Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
> [various exceptions]
In this context, "Unicode" means "UTF-16," because Microsoft. A lone surrogate is certainly not a "UTF-16 character." It's half a character.
I don't know if the various *W interfaces actually check for lone surrogates and error out (the documentation for CreateFileW does not explicitly call this case out), but they probably should.
(Microsoft's error checking of filenames is kinda terrible anyway, so I would not be too surprised if you could get lone surrogates through the API. For example, it says that you can't create a file whose name ends in a dot, but if you prefix the path with the \\?\ prefix that they also discuss on the same page, then that check is bypassed. And then your sysadmin has to figure out how to delete the damned thing, because none of the standard tools will even recognize that it exists. See also: https://bugs.python.org/issue22299)
Posted Jan 14, 2020 19:39 UTC (Tue)
by foom (subscriber, #14868)
[Link] (2 responses)
So, yes, you can perfectly well put lone halves of a surrogate pair in windows unicode strings and filesystems.
Posted Jan 14, 2020 21:46 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
> Standalone surrogate code points have either a high surrogate without an adjacent low surrogate, or vice versa. These code points are invalid and are not supported. Their behavior is undefined.
Posted Jan 14, 2020 22:45 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 14, 2020 19:47 UTC (Tue)
by foom (subscriber, #14868)
[Link] (2 responses)
However, you do need to have a utf-16 decoder/encoder variant which allows lone surrogates to be decoded/encoded without throwing an error. Python has the "surrogatepass" error handler for that. E.g.:
b'\x00\xD8'.decode('utf-16le', errors='surrogatepass').encode('utf-16le', errors='surrogatepass')
Posted Jan 14, 2020 21:55 UTC (Tue)
by roc (subscriber, #30627)
[Link] (1 responses)
But do APIs like os.listdir() do that automatically on Windows like they do on Linux?
Posted Jan 15, 2020 4:18 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
> On Windows, if we get a (Unicode) string we extract the wchar_t * and return it; if we get bytes we decode to wchar_t * and return that.
I believe this means that the Windows version of that module will never try to encode Unicode strings into UTF-16LE (or any other encoding), meaning it won't "catch" invalid surrogates. They should pass straight through to the Windows *W APIs.
This is also supported by the os.path docs, which say the following:
> Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.
Since they only call out bytes objects, I think the implication is that str objects are not a problem on Windows. But the fact that it makes no mention of os.fsencode() and os.fsdecode(), nor the surrogateescape handler, makes me suspicious of whether this documentation is still up-to-date.
Posted Jan 14, 2020 20:13 UTC (Tue)
by togga (guest, #53103)
[Link]
The main point here is that scripting/glue languages should not mess with or set requirements on any data. The Python2 experience (batteries included) was seamless enough that you could overlook workarounds needed for limitations (threading, performance etc.) and some other pure ugly design/behaviour. In python3 this is just not the case anymore, batteries comes with constant glitches, etc (lose lose situation).
It's awesome though what Python has achieved over time and it may even be a good thing that py2 dies as it give opportunities migrating to more modern language for these classes of use-cases / problem-domains. Old python code is relatively easy to interface from (some) other languages if needed.
Posted Jan 16, 2020 10:55 UTC (Thu)
by jezuch (subscriber, #52988)
[Link] (4 responses)
Posted Jan 16, 2020 22:23 UTC (Thu)
by togga (guest, #53103)
[Link] (3 responses)
decode/encode may be lossless if the input is ASCII ranging from 0-127 but for random input it's very fragile and the language is full of "inconsistencies" (aka features) for example:
>>> 'text'[0] == 't'
Posted Jan 16, 2020 22:28 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
How on Earth??
Posted Jan 17, 2020 2:07 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (1 responses)
In Python, the individual elements of a bytes object (a sequence type) are of type int. This makes reasonable sense given what one is likely to want to do with individual elements of a bytes object. Python does not have a char type similar to that in C.
The b"text"[0] == b"t" comparison fails because it is comparing a single element of a bytes object (an int) to another bytes object, albeit one of length 1. This sort of thing probably won't work in Rust, either (it certainly won't in C). You will note that b"text"[0] == b"t"[0] is True, since you're comparing ints.
Posted Jan 18, 2020 19:14 UTC (Sat)
by gracinet (guest, #89400)
[Link]
It's also generally more consistent because one does not expect a character to be of the same type as a string anyway.
In that case, the equivalent for bytes is &[u8] and you don't compare it with u8.
Posted Jan 14, 2020 12:28 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (34 responses)
The approach of assuming the world can be approximated by floating point numbers is flat out wrong. And yet we continue to do it. Unicode is sometimes problematic with filenames, but not in practice for most users. If you have to handle text, it's the only way to do it. Old school text handling usually trashed anything that wasn't in an 8-bit character set, unless the programmer put a lot of work in.
You link to an article that mentions code written in assembly. If you want extreme speed or the tightest of code, why were you writing in Python to begin with?
Posted Jan 14, 2020 13:56 UTC (Tue)
by excors (subscriber, #95769)
[Link] (33 responses)
That's true, but the difference between "most users" and "all users" becomes important when an application grows to have many users. Eventually one of them will try to use your program on a FAT-formatted USB stick with Shift-JIS filenames or whatever, and they will file a bug report when it crashes with a Unicode error. As a responsible programmer you want to make your program work for that user, but Python makes that difficult, and you will get annoyed at fighting with the language.
That experience will encourage you to start your next project in a different language (one whose designers have already considered this problem and solved it properly) so you won't have the same pains if your project becomes popular and needs to handle obscure edge cases. If a number of high-profile projects make the same decision to migrate, that seems likely to significantly damage Python's reputation and popularity.
Posted Jan 14, 2020 15:50 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (32 responses)
I'm sure there's so many more people nowadays who want to take to the time and trouble to work with SJIS than there were before Unicode. It wasn't until 1998 that Debian became eight bit clean, and even then I think Japanese people still carried a bunch of patches around. It will be more practical to spend all that time working on the program than tell them to set their LC_CTYPE to a SJIS locale. A lot of Python code won't ever have that problem, for two reasons. A lot of Python code is for limited use or for use in closed environments. Secondly, it's not like Python 3 autocrashes with non-Unicode filenames. I tried it with:
And it had no problem opening a file with a filename that wasn't Unicode, whether read from the directory or given on the command line. It gave a Unicode error instead of printing out the filename, but like ls, you should probably be escaping a weird filename before dumping it to a terminal or the like.
Posted Jan 14, 2020 22:11 UTC (Tue)
by roc (subscriber, #30627)
[Link] (31 responses)
Who does that escaping, though? What API would you even use to do it? In practice, no-one's going to do it until some user hits a "weird"-but-valid-UTF8 filename, then they're going to hack in some escaping that relies on UTF8 encoding not failing, then maybe one day some user hits a non-Unicode filename and then they hack in more escaping for strings containing lone surrogates.
That's not good if you want your software to be reliable. I don't think it makes sense to expect a dynamically-typed language like Python to be good for writing reliable software, but Python's specific choices make it unnecessarily unreliable.
As an example of a better way to do things: in Rust, the Path type is not a string and does not implement Display; if you want to print one, you call path.display() which does the necessary escaping. (It does not return a string, but does return something that implements Display, i.e. can be printed or converted to a string).
Posted Jan 15, 2020 1:01 UTC (Wed)
by roc (subscriber, #30627)
[Link] (9 responses)
Posted Jan 15, 2020 20:27 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (8 responses)
*That* means any form of argument passing from rust to other software will fail in strange and underwhelming ways.
Posted Jan 15, 2020 20:30 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 16, 2020 9:12 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (5 responses)
Posted Jan 16, 2020 10:28 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (4 responses)
The idea that file names are somehow privileged to not require that went out of the window a long time ago. It doesn't matter one whit whether the code printing said file name is written in Python, Rust, Go, C++, Z80 assembly, or LOLCODE.
If you want a standard way to carry non-UTF8 pseudo-printable data (e.g. Latin1 filenames from the stone ages), no problem, either use the surrogateescape method or do proper shell quoting. The "write the non-UTF8 data" method is fine only when limited to streams that are known to be binary. "find -print0" comes to mind.
Posted Jan 16, 2020 16:21 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jan 16, 2020 17:43 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard? Yes this could have been easier from the beginning, but that's why Python 3.8 is a whole lot better at this than 3.0.
Posted Jan 16, 2020 17:53 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jan 17, 2020 6:11 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Posted Jan 16, 2020 10:48 UTC (Thu)
by roc (subscriber, #30627)
[Link]
By far the most common way to pass a path to a program is on its command line. In Rust you pass a command-line argument by calling Command::arg(...) to add the argument to the command you're building, and Command::arg(...) accepts Paths. Each platform has a standard way to pass arbitrary filenames as command-line parameters, and Rust does what the platform requires.
A few programs accept arbitrary paths as strings on stdin; they need to define how those paths are encoded on stdin. On Linux the program could read null-terminated strings of bytes; then in Rust you would use std::os::unix::ffi::OsStrExt::as_bytes() to extract the raw bytes from a Path and write them to the pipe. That code wouldn't even compile on Windows, which makes sense because a null-terminated string of bytes is not a reasonable way to represent a Windows path. A Windows program accepting paths as strings on stdin needs to define a different encoding, e.g. null-terminated strings of WCHAR, in which case the Rust program would use std::os::windows::ffi::OsStrExt::encode_wide() to produce such strings.
Rust makes it about as easy as possible to reliably pass non-UTF8 strings to other programs.
Posted Jan 15, 2020 4:22 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (17 responses)
repr()?
Posted Jan 15, 2020 9:01 UTC (Wed)
by roc (subscriber, #30627)
[Link] (16 responses)
Posted Jan 15, 2020 22:25 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (15 responses)
str.encode(..., errors='replace') # Replaces bad chars with U+FFFD.
Python is, after all, a "batteries included" language. This is a solved problem.
Posted Jan 15, 2020 23:44 UTC (Wed)
by togga (guest, #53103)
[Link] (13 responses)
Posted Jan 16, 2020 1:17 UTC (Thu)
by roc (subscriber, #30627)
[Link] (10 responses)
Posted Jan 16, 2020 9:17 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (6 responses)
Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.
Posted Jan 16, 2020 10:51 UTC (Thu)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted Jan 16, 2020 12:46 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (4 responses)
Own up to the things your code does, do not hide behind lack of OS enforcement.
Posted Jan 16, 2020 13:46 UTC (Thu)
by smurf (subscriber, #17840)
[Link]
That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".
Posted Jan 16, 2020 14:54 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.
Posted Jan 16, 2020 15:17 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (1 responses)
This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.
In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).
Posted Jan 16, 2020 15:32 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 19, 2020 10:40 UTC (Sun)
by togga (guest, #53103)
[Link] (2 responses)
Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.
Posted Jan 19, 2020 11:28 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (1 responses)
Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.
It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.
Posted Jan 19, 2020 18:36 UTC (Sun)
by anselm (subscriber, #2796)
[Link]
Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.
The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.
Posted Jan 16, 2020 6:35 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
(Regardless, xmlcharrefreplace and backslashreplace are both *mostly* lossless, and in the appropriate context, can be fully lossless, so long as you escape all ampersands or backslashes respectively. If you are outputting XML or HTML, then of course you have to escape ampersands anyway, which is obviously what xmlcharrefreplace was intended for. Similarly, backslash replacement is not a very sensible thing to do, unless you are working in a context where backslashes are normally escaped.)
* For example, Python's filesystem API calls os.fsencode() and os.fsdecode() automatically to translate between the operating system's preferred type and whatever type the user passes, but you can still call these manually to pass the errors argument or if you decide you actually wanted the other type.
Posted Jan 16, 2020 12:51 UTC (Thu)
by excors (subscriber, #95769)
[Link]
Posted Jan 16, 2020 1:16 UTC (Thu)
by roc (subscriber, #30627)
[Link]
Posted Jan 15, 2020 5:54 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (1 responses)
GNU Coreutils does.
> In practice, no-one's going to do it
Well, then, it really does seem like much ado about nothing. In a serious engineered program, you basically never output an unfiltered string; every line has to be marked for translation. This has always included filenames, which have security issues when dumped to a terminal, in Python 2 or 3.
Sure, I'll believe Rust has better ways of doing it. But if you're comparing Python 2 versus Python 3, it's just not that big a difference, either in normal usage or best practices.
Posted Jan 15, 2020 9:03 UTC (Wed)
by roc (subscriber, #30627)
[Link]
In Python, I meant.
> In a serious engineered program, you basically never output an unfiltered string
I guess I've never seen a seriously engineered Python program. I suppose that's not unexpected.
Posted Jan 17, 2020 18:58 UTC (Fri)
by cortana (subscriber, #24596)
[Link]
Posted Jan 14, 2020 22:58 UTC (Tue)
by kjp (guest, #39639)
[Link]
But yeah, I totally agree with this post, but it's even worse: even _after_ you are on python 3, stuff still breaks. All the time. And it gets worse the more dependencies you have. So many things that could be caught by a compiler aren't. And if you try using mypy (which is still alpha), why are you using python in the first place for a large project. The terrible startup time? The terrible module namespacing? The terrible performance and threading?
Posted Jan 17, 2020 8:37 UTC (Fri)
by rhdxmr (guest, #44404)
[Link]
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Might also enter, but Go is entirely out of scope.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Odd that this would be unique to Rust, and not the plethora of other languages that use LLVM, with no complaints. Or is this something that crops up, no matter the language, and I just haven't heard of the use case yet? Linux & Mac should definitely be well supported. And I assume windows, if Mozilla is using Rust...
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Looks like it was not rejected outright, though I'm dubious it would ever be accepted. Maintaining an out-of-tree fork is more work than having it upstream --- while it is out-of-tree, all the maintenance has to be done by the m68k community, but upstreaming the backend would shift a lot of the maintenance costs from the community fans to the rest of the LLVM developers. That is exactly why they would not accept it!
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
GNAT developers saw the writing on the wall and are working on adding Ada support to LLVM: https://blog.adacore.com/combining-gnat-with-llvm
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
1) Replacing an m68k machine with something more modern hurts the environment because of the manufacturing impact of the more modern machine.
In practice m68k is so obsolete you could replace it something much faster and more efficient that someone else was going to throw away.
2) Keeping modern software running on m68k helps keep that software efficient for all users.
In practice I have not seen, and have a hard time imagining, developers saying "we really need to optimize this or it'll be slow on m68k, even though it's fast enough for the rest of our users". To the extent they care about obscure architectures, if it works at all, that's good enough.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
> Porting Python apps to Go is currently quite a lot more viable than Rust, which is both handicapped by LLVM's limited platform support and doesn't have any batteries included.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
>> I don't know why gcc play along; I suspect it's inertia and a misguided sense of duty.
>> I have more sympathy for potentially relevant new embedded architectures
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Python doesn't use UTF-8. Its "Unicode" strings are also not actually guaranteed to be Unicode.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Perhaps Python should have also required rewriting all the code backwards? It would have been just as useful!
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
https://lwn.net/Articles/325304/ "Wheeler: Fixing Unix/Linux/POSIX Filenames" 2009
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
`cp` and many other utilities handle non-Unicode filenames correctly.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does.
Szorc: Mercurial's Journey to and Reflections on Python 3
Like people using ShiftJIS and writing file names in Japanese?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
nine@sphinx:~> perl6 -e 'spurt ("testfile".encode ~ Buf.new(0xff, 0xff)).decode("utf8-c8"), "foo"'
nine@sphinx:~> ll testfile*
-rw-r--r-- 1 nine users 6 17. Sep 2014 testfile.latin-1
-rw-r--r-- 1 nine users 3 22. Jän 13:42 testfile??
nine@sphinx:~> perl6 -e 'dir(".").grep(/testfile/).say'
("testfile.latin-1".IO "testfilexFFxFF".IO)
nine@sphinx:~> cat testfile??
foo
Python 3.7.3 (default, Apr 09 2019, 05:18:21) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for f in [f for f in os.listdir(".") if "testfile" in f]: print(f)
...
testfile.latin-1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
Szorc: Mercurial's Journey to and Reflections on Python 3
> foo
$ cat testfile??
foo
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Wol
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Indeed. Python maintainers decided to pick and choose them. Only good-behaving users who like eating their veggies ( https://snarky.ca/porting-to-python-3-is-like-eating-your... ) were allowed in.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
A VCS must be able to round-trip files on the same FS. Even if they are not encoded correctly.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
'\ud808\udf45'
Szorc: Mercurial's Journey to and Reflections on Python 3
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 4: surrogates not allowed
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
>
> I'm not sure of which problem you're talking about.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Just wait until you see POSIX!
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Wol
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
It actually is not, if you're writing something that is not a Jupyter notebook.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
I will. Go is the single worst programming language design that achieved any kind of popularity in the last 10 years at least. It is archaic and outdated in pretty much every imaginable way. It puts stuff into the language that doesn't belong there, like containers and concurrency, and doesn't provide you with the tools that are needed to implement these where they belong, which is in a library. The designers of this programming language are actively pushing us back into the 1970s, and many people appear to be applauding that. It's nothing short of appalling.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
It absolutely does not. Concurrency is an ever-evolving, complex topic, and if you bake any particular approach into the language, it's impossible to change it when we discover better ways of doing it. Java tried this and failed miserably (synchronized keyword). Scala didn't put it into the language. Instead, what happened is that people came up with better and better libraries. First you had Scala standard library Futures, which was a vast improvement over anything Java had to offer at the time. But they were inefficient (many context switches), had no way to interrupt a concurrent computation or safely handle resources (open file handles etc.) and made stack traces useless. Over the years, a series of better and better libraries (Monix, cats-effect) were developed, and now the ZIO library solves every single one of these and a bunch more. And you know what? Two years from now, ZIO will be better still, or we'll have a new library that is even better.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
You haven't yet demonstrated a single advantage of putting this into the language rather than a library, which is much more flexible and easier to evolve. Your thinking that this needs to be done in the language is probably a result of too much exposure to crippled languages like Python.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
NB, Python also has the whole thing in a library. This is not primarily about language features.
It very much is about language features. Python has dedicated language support for list comprehensions, concurrency and error handling. But there is no need for that. Consider these:
x = await getX()
y = await getY(x)
return x + y
[ x + y
for x in getX()
for y in getY(x)
]
The structure is the same: we obtain an x, then we obtain a y that depends on x (expressed by the fact that getY takes x as a parameter), then we return x + y. The details are of course different, because in one case we obtain x from an async task, and in the other we obtain x from a list, but there's nevertheless a common structure. Hence, Scala offers syntax that covers both of these use cases:
for {
x <- getX()
y <- getY(x)
} yield x + y
And this is a much better solution than what Python does, because now you get to write generic code that works in a wide variety of contexts including error handling, concurrency, optionality, nondeterminism, statefulness and many, many others that we can't even imagine today.
Python basically transforms "result = foo(); return result" into what Go would call "err, result = foo(Context); if (err) return err, nil; return nil,result" behind the scenes. (If you also want to handle cancellations, it gets even worse – and handling cancellation is not optional if you want a correct program.) I happen to believe that forcing each and every programmer to explicitly write the latter code instead of the former, for pretty much every function call whatsoever, is an unproductive waste of everybody's time. So don't talk to me about Python being crippled, please.
This is a false dichotomy. Not having error handling built into the language doesn't mean you have to check for errors on every call.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Well, sure, if you have a nice functional language where everything is lazily evaluated then of course you can write generic code that doesn't care whether the evaluation involves a loop or a context switch or whatever.
You don't need lazy evaluation for this to work. Scala is not lazily evaluated and it works great there.
No? then what else do you do? Pass along a Haskell-style "Maybe" or "Either"? that's just error checking by a different name.
You can factor out the error checking code into a function, so you don't need to write it more than once. After all, this is what we do as programmers: we detect common patterns, like “call a function, fail if it failed and proceed if it didn't” and factor them out into functions. This function is called flatMap
in Scala, and it can be used like so:
getX().flatMap { x =>
getY(x).map { y =>
x + y
}
}
But this is arguably hard to read, which is why we have for
comprehensions. The following is equivalent to the above code:
for {
x <- getX
y <- getY x
} yield x + y
I would argue that if you write it like this, it is no harder to read than what Python gives you:
x = getX
y = getY(x)
return x + y
But the Scala version is much more informative. Every function now tells you in its type how it might fail (if at all), which is a huge boon to maintainability. You can also easily see which function calls might return an error, because you use <-
instead of =
to obtain their result. And it is much more flexible, because it's not limited to error handling but can be used for things like concurrency and other things as well. It's also compositional, meaning that if your function is concurrent and can fail, that works too.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Python doesn't have a problem with resources to be closed (that's what "with foo() as bar"-style context managers are for), nor concurrency (assuming that you use Trio or anyio).
Sure, you can solve every problem that arises from adding magic features to the language by adding yet more magic. First, they added exceptions. But that interacted poorly with resource cleanup, so they added with
to fix that. Then they realized that this fix interacts poorly with asynchronous code, and they added async with
to cope with that. So yes, you can do it that way, because given enough thrust, pigs fly just fine. But you have yet to demonstrate a single advantage that comes from doing so.
with
. For instance, if you want to acquire two resources, do stuff and then release them, you can just nest two with
statements. But what if you want to acquire one resource for each element in a list? You can't, because that would require you to nest with
statements as many times as there are elements in the list. In Scala with a decent library (ZIO or cats-effect), resources are a Monad, and lists have a traverse
method that works with ALL monads, including the one for resources and the one for asynchronous tasks. But while asyncio.gather
(which is basically the traverse
equivalent for asynchronous code) does exist, there's no such thing in contextlib
, which proves my point exactly: you end up with code that is constrained to specific use cases when it could be generic and thus much easier to learn because it works the same for _all_ monads.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
You can *always* write more code to fix any problem. That isn't the issue here, it's about code reuse. ExitStack shouldn't be needed, and neither should AsyncExitStack. These aren't solutions but symptoms.
For the first time, you're actually making an argument for putting the things in the language. But I'm not buying it, because I see how much more stuff I need to learn about in Python that just isn't necessary in fp. There's no ExitStack or AsyncExitStack in ZIO. There's no `with` statement. There's no try/except/finally, there's no ternary operator, no async/await, no assignment expressions, none of that nonsense. It's all just functions and equational reasoning. And equational reasoning is awesome _because_ it is so simple that we can teach it to high school students.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
getWeather:: String→ DateTime→
IO (DnsResponse (TcpSession (HttpResponse (Json Weather))))
where each layer can fail? Of course, there's some leeway in choosing how to layer the types (although handling e.g. out-of memory errors this way would be unreasonable IMO).
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
There is some truth to this, it would be nice if the compiler were faster. That said, it has become significantly faster over the years and it's not nearly slow enough to make programming in Scala “totally impractical”. And the fact that I was able to name a very simple problem (“make an asynchronous operation interruptible without writing (error-prone) custom code and without leaking resources”) that has a trivial solution with ZIO and no solution at all in Go proves that pure fp has nothing to do with ideology. It solves real-world problem. There's a reason why React took over in the frontend space: it works better than anything else because it's functional.
Szorc: Mercurial's Journey to and Reflections on Python 3
b''
to say it was a byte string and u''
to say it was a unicode strong. Python 3 kept the b''
syntax but initially eliminated the u''
for unicode strings, and only brought it back when users complained. That hurt people trying to move from Python 2 to Python 3 without providing any benefit to people starting with Python 3.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
That wouldn't have worked because Python 3.0 lacked many required features, like being able to use format strings with bytes. They got re-added only in Python 3.5 released in late 2014 ( https://www.python.org/dev/peps/pep-0461/ ). So for many projects realistic porting could begin around 2015 when it trickled down to major distros.
Szorc: Mercurial's Journey to and Reflections on Python 3
'10'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'
Szorc: Mercurial's Journey to and Reflections on Python 3
>>> b'%d' % (55,)
was supported again, but *not* the new and recommended format function.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
.format was never "recommended" on bytestrings, in fact it was initially proposed for Python3. Neither was %, but lots of older code uses it in contexts which end up byte-ish when you migrate to Py3. That usage never was prevalent for .format, so why should the Python devs incur the additional complexity of adding it to bytes?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
* allowing byte strings as attributes
* being consistent with types and syntax for byte strings and strings
* being consistent with format options for strings and byte strings
* etc.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Mostly due to library API:s requiring attributes for many thing. This is a big source for py3 encode/decode errors.
This is what attributes does:
>>> setattr(a, 'b', 22)
>>> setattr(a, b'c', 12)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: attribute name must be string, not 'bytes'
>>> a.__dict__
{'b': 22}
>>> type(a.__dict__)
<class 'dict'>
>>> a.__dict__[b'c']=12
>>> a.__dict__
{'b': 22, b'c': 12}
Szorc: Mercurial's Journey to and Reflections on Python 3
> This is what attributes does:
Also, it's not just bytes, arbitrary strings frequently contains hyphens, dots, or even start with digits.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Just in case anybody needed yet more evidence that dynamic typing pretty much invariably leads to an unmaintainable mess…
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
That's not quite true. Java 9 introduced modules which many projects are cheerfully ignoring. But most of Java 8 code works just fine in Java 9, without being module-aware.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
The thing is, it's easy to have a library targeting JDK 8 to work on JDK 11. I have several packages that are doing that. You basically need to refrain from using JDK>8 features and you'll be fine.
Just use https://aws.amazon.com/corretto/ , it'll be supported for a loooong time.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
It did. I can run JDK 8 code in JDK 11 without any modifications, mixing and matching it freely with newer versions.
Yes, they did. The string type was fundamentally changed, along with a significant chunk of the API.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Well, that might be because they suck. They've been working on this stuff for years and yet it's still not possible to have multiple versions of the same library in a single program. So if you want to use two different libraries that both depend on a third library but in different versions, you lose. Unless of course you use OSGi which has been around for, what, 20 years now and already solved this problem when it first came out.
Szorc: Mercurial's Journey to and Reflections on Python 3
Indeed they do.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Yes, I thought that arguments could convince someone to change their opinion, silly me. Apparently this is more of a religious thing for you...
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
When you use non-unicode file names across different file systems or computers (or even OSes), all hell may break lose. I always said there be dragons when using non-ASCII file names but in stark contrast to my 8.3 filename times, I nowadays marvel at how BMP characters tend to work for file names even across different locales and OSes...
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
(Disclaimer: I never wrote Rust code but use Python 3 in a few projects.)
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
* it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality
* Python 3 introduces a ton of problems and doesn't really solve many
* Python's pendulum has swung too far towards Unicode only.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
True
>>> b"text"[0] == b't'
False
Szorc: Mercurial's Journey to and Reflections on Python 3
> False
WHAT?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Something similar would happen with &str and char.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
#!/usr/bin/python3
import os
import sys
if len(sys.argv) > 1:
x = sys.argv[1]
else:
x = os.listdir('.')[0]
f = open (x, "r");
read_data = f.read ()
print (read_data)
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
And what if you need to write a transparent proxy that needs to cope with non-UTF-8 headers?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
The reality (that stubborn thing that doesn't go away) has agents that don't obey the standard. So a transparent proxy must accommodate it.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
str.encode(..., errors='ignore') # Silently drops bad chars.
str.encode(..., errors='xmlcharrefreplace') # Replaces with XML &-encoding
str.encode(..., errors='backslashreplace') # Replaces with \u.... syntax.
Or write your own and hook it into the standard system with https://docs.python.org/3.8/library/codecs.html#codecs.re...
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
> GNU Coreutils does.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3