Watson: Launchpad now runs on Python 3 [LWN.net]

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 5:23 UTC (Tue) by swilmet (subscriber, #98424) [Link] (34 responses)

Some text editors classify Python as a "Scripting language", alongside Shell and Perl, in their menus to choose the syntax highlighting.

For writing large programs, Python (or any other interpreted language) is a huge mistake in my opinion.

Of course this has already been debated many times…

What really helps is: a good compiler, plus a syntax where it's trivial to know the type of any variable (for both humans and the compiler).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 8:43 UTC (Tue) by jafd (subscriber, #129642) [Link] (5 responses)

How does knowing the type of every variable help exactly when the syntax suddenly changes under you (in a way that things now mean other, incompatible things), or when the types of the same name change semantics slightly but also incompatibly?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 9:42 UTC (Tue) by epa (subscriber, #39769) [Link] (4 responses)

Whatever the merits of static typing in general, I think that in the particular case of disentangling a 'string of bytes' from a 'string of characters' it would have helped a great deal. Cases where you pass one but the other is required will be spotted at compile time. And it could even have allowed a soft changeover, where you gradually migrate your code from the old 'string of whatever' type to one of the two more specific variants.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:02 UTC (Tue) by jafd (subscriber, #129642) [Link] (3 responses)

Well, you could never have two Python runtimes coexisting in one Python process.

Theoretically, you could have had them, it would not be easy, true, but the developers didn't bother at all. Better be "courageous", as they say at one fruit company, and waste 10 years on migration, then just shoot Python 2 in the head and make sure no one ever picks it up, with trademark infringement threats and all.

This is not cool in the least.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:12 UTC (Tue) by niner (subscriber, #26151) [Link] (2 responses)

Coexisting in the same process would not even have been necessary. Interpreters could have run in different processes communicating with each other. This could have worked and would have made migrating tremendously easier. Of course it wouldn't have been good for performance critical code, but the majority would have been helped.

But then, Python has never been optimized for its users but instead been optimized for the developers of Python compilers and runtimes. E.g. maintainability of the runtime's code has always been preferred over performance. The reason why one could not name a method "not" or "and" was not that this would have conflicted with builtin syntax. The reason was that prohibiting those made writing the parser easier. And this is not conjecture. It was even documented officially.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 9:10 UTC (Wed) by ehiggs (subscriber, #90713) [Link] (1 responses)

> Coexisting in the same process would not even have been necessary.

I think it would. Otherwise how would you import python2 code from python3?

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 9:27 UTC (Thu) by niner (subscriber, #26151) [Link]

Short answer: proxies all the way.

Like Inline::Python in Perl https://metacpan.org/dist/Inline-Python/view/Python.pod or Inline::Perl5 in Raku: https://github.com/niner/Inline-Perl5
Disclaimer: I'm the maintainer of both and the author of the latter.

Once you realize that inter-operation between languages (and Python 2 and 3 can be considered different languages) is a communications problem, it quickly follows that distance is only a factor for performance, not for the possibilities. Being in the same process just makes it faster, but everything you do can be done just as well with multiple processes and even over the network. That's why I mentioned performance.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 9:29 UTC (Tue) by rsidd (subscriber, #2582) [Link] (17 responses)

Compiled languages are better if raw execution speed is necessary. Many tasks do not require that. Examples: web frameworks (eg Django), front-end interfaces (eg Keras), and, yes, things like Launchpad. Python is a great choice for those.

The issue is not that it is an interpreted language, it is the incompatible change in language. This happened previously with C++ and other compiled languages too.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 10:27 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (4 responses)

Changes in C++ were trivial to fix and were not affected the majority of the existing code. Moreover, it was possible to make the code forward compatible by some renames and #if checks. I have a C++ source from 1997 that still compiles as is albeit with warnings and then runs.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:14 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

Sadly that was different world where C++ developers still cared about users.

Today they adopted schizophreniac stance that it's Ok to break perfectly working programs without warnings but it's a big no-no to make them non-compileable.

Thus you get almost the same issue as with Python 3: after each update you hit some random issues which you are supposed to fix without any guidance from the compiler, because, essentially “according to our lawyers this code was always incorrect and worked for 30 years on billions computers by pure accident”.

Watson: Launchpad now runs on Python 3

Posted Aug 12, 2021 17:21 UTC (Thu) by codewiz (subscriber, #63050) [Link]

Upgrading from C++11 to C++14 to C++17 to C++20 has always been a very smooth experience for me, and I maintain very large code bases. In most cases, it's just a matter of changing the compiler flag and recompile with no changes to the code at all.

Features that have been deprecated and later removed from C++, such as trigraphs and auto_ptr, were never very popular in the first place.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:36 UTC (Wed) by gdt (subscriber, #6284) [Link] (1 responses)

I've had to maintain a 43KLOC program in C++ from 1997 and this has not been my experience. Recently, I was bitten again when Fedora moved to GCC 11. The program compiles but then segfaults due to a change in C++ semantics between 1997 and now. In that respect my experience isn't far from if the program had been written in an interpreted language.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:20 UTC (Wed) by pizza (subscriber, #46) [Link]

Question -- did your program specify which C++ version (ie --std=c98 or whatever) it needed or did it it rely on the compiler implicitly choosing the one your program needed?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:09 UTC (Tue) by HelloWorld (guest, #56129) [Link] (11 responses)

> Compiled languages are better if raw execution speed is necessary. Many tasks do not require that. Examples: web frameworks (eg Django), front-end interfaces (eg Keras), and, yes, things like Launchpad. Python is a great choice for those.

Compilation vs. interpretation are implementation techniques, and not properties of the languages themselves. Virtually all language implementations use compilation these days, including CPython, which will internally compile your source code into bytecode. It can also write that bytecode to disk (e. g. using the py_compile module).

Presumably what you meant by compiled languages is statically typed languages, and you suggest that statically typed languages trade programmer productivity for performance while dynamically typed languages do the reverse. This is a myth that is based on outdated and underpowered type systems such as those found in languages like C. Modern, statically typed languages are superior in basically every way, and for several different reasons. Static types are a form of machine-checked documentation. This helps human programmers, who now know precisely what they can pass to a function and what they can do with the function's return value. But it also helps when creating tools, such as compilers or editors. These can now point out problems with your code and even make suggestions to fix them. And this is why people bring up static typing when confronted with incompatible changes: they're just much, much easier to deal with when the compiler tells you exactly which pieces of the code need to be fixed. Scala 3 was released not long ago, and we can already see how the migration is going much smoother than the Python 3 migration did.

There may be good reasons to use Python (such as the large library ecosystem), but dynamic typing isn't one of them. Even the Python community is starting to recognize this with the support for type hints and tools like mypy. And this isn't limited to Python, every dynamic language is doing it: Ruby is growing support for type signatures: https://github.com/ruby/rbs. TypeScript as a type-safe JavaScript variant is gaining more and more traction. It's pretty unambiguous at this point. “dynamic typing” is a failed experiment.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:53 UTC (Tue) by rsidd (subscriber, #2582) [Link] (10 responses)

> Presumably what you meant by compiled languages is statically typed languages, and you suggest that statically typed languages trade programmer productivity for performance while dynamically typed languages do the reverse.

No, by compiled languages I meant languages that compile to machine code. I use Julia and know how good it is at type inference, how readily it catches certain bugs at compile time, and how fast it is. It is also as easy to write as python. But that doesn't mean one should use Julia everywhere.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:15 UTC (Tue) by HelloWorld (guest, #56129) [Link] (9 responses)

> No, by compiled languages I meant languages that compile to machine code.
Languages don't compile to machine code, language implementations do.

> I use Julia and know how good it is at type inference, how readily it catches certain bugs at compile time, and how fast it is. It is also as easy to write as python. But that doesn't mean one should use Julia everywhere.
And yet you haven't named a single reason why Python would be preferrable. Again, there may be such reasons, but it's unlikely that they have anything to do with the fact that CPython doesn't compile to machine code.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 22:11 UTC (Tue) by rschroev (subscriber, #4164) [Link] (8 responses)

> > No, by compiled languages I meant languages that compile to machine code.
> Languages don't compile to machine code, language implementations do.

In theory, yes. In practice, not so much. Python, for example, makes it very hard to compile to machine code, because of its extremely dynamic nature. Have you ever looked at the byte code that CPython compiles to? It's still very dynamic. Its BINARY_ADD instruction to name a simple example necessarily does dynamic dispatch resulting in arbitrarily complex operations. Yes, it's compiled, but it's worlds apart from "lea eax, [rdi + rsi]".

But Python is not CPython ... in theory. When people speak about Python, they almost always refer to CPython. www.python.org is all about CPython. There are other implementations, but they are hopelessly outdated, or are not complete, or have poor support for external libraries. PyPy is the exception: only slightly behind CPython, mostly compatible. None of those compile Python to machine code (some of the implementations use a JIT compiler, which is not the same thing). In practice, Python is almost always CPython and other implementations don't (yet?) bring all that much difference to the table conceptually.

Consider the difference with languages like C or C++. There exist interpreters for both, both those are pretty exotic. The normal way of using C or C++ is with a compiler that compiles to machine code.

Language design has a profound effect on the feasibility of compiling to machine code vs byte code.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 22:42 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> But Python is not CPython ... in theory. When people speak about Python, they almost always refer to CPython. www.python.org is all about CPython. There are other implementations, but they are hopelessly outdated, or are not complete, or have poor support for external libraries. PyPy is the exception: only slightly behind CPython, mostly compatible. None of those compile Python to machine code (some of the implementations use a JIT compiler, which is not the same thing). In practice, Python is almost always CPython and other implementations don't (yet?) bring all that much difference to the table conceptually.

Well, there's also Cython, which can transpile nearly any Python into C. But it has its own drawbacks, since what it really "wants to" do is let you write code in a mixture of Python and Python-with-C-type-annotations, and selectively speed up the latter on a line-by-line basis. If you try to read the C that it produces, you will find that it's almost completely impenetrable (each line of pure Python translates into 4-5 lines of calls to the C/Python API, error handling, dispatch, etc., and the whole thing also has a huge wall of #defines and other preprocessor garbage). So in practice, it's more like compiling to machine code, but in two steps.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 11:20 UTC (Wed) by HelloWorld (guest, #56129) [Link] (6 responses)

What you're describing doesn't make the language more difficult to compile, it just makes it hard to run efficiently, and that is true regardless of whether compilation or interpretation is used. That doesn't change the fact that the latter in an implementation detail and not a property of the language per se.

And besides, you've missed the point of the thread. rsidd implied in his original post that Python had some advantage over other languages due to the fact that the most commonly used implementation is an interpreter, and I have yet to see any evidence for that claim. Even more confusingly, he even stated that Julia is faster, just as easy to write and catches more bugs at compile time, so at that point one has to ask what advantages Python as a language could possibly bring to the table, and how they relate to the fact that CPython doesn't compile to machine code (and obviously my presumption is that they don't, but I'd like to hear rsidd's arguments).

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:18 UTC (Wed) by rsidd (subscriber, #2582) [Link] (4 responses)

> one has to ask what advantages Python as a language could possibly bring to the table

Python is a good language, with a large ecosystem, and wide expertise. Those things matter. Julia's a very good language and its ecosystem is getting there. Python currently has one other advantage over Julia: the latter has very large start-up time (reduced recently but still significant) because of JIT compilation. Great throughput, terrible latency.

If Julia can get over the JIT delays, and especially if it can produce machine-executable binaries, I don't see why it can't be used for large projects (generally, not just in science) instead of python. Especially given its ability to import python packages (increasingly not needed, IMO).

But, even in that case, it makes no sense to rewrite an existing large python codebase, from a time when Julia didn't exist, in Julia.

Many physicists still use Fortran 77. The code exists and it works.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:32 UTC (Wed) by HelloWorld (guest, #56129) [Link] (3 responses)

None of what you mentioned has anything to do with your previous point about “compiled languages”. “Good language” is entirely subjective, large ecosystem and wide expertise has nothing to do with whether it's compiled or interpreted. Large start-up time is also not an issue with compiled languages in general. In fact, it's usually the other way around. So yeah, your point about “compiled languages”: I'm not seeing it. It's a mostly irrelevant implementation detail.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 16:12 UTC (Wed) by rsidd (subscriber, #2582) [Link] (2 responses)

At the time python started becoming popular, the options were:

C/C++ (compiled)
Perl (unsuitable for large projects)
and a bunch of niche languages.

As I said, things are different now, Julia may be a great choice for a new project, so (more likely, for general purpose projects) may Go or Rust or Scala.

"Good language" is not "entirely subjective" any more than taste in music is "entirely subjective". Experts agree that Bach and Beethoven were great.

Paul Graham wrote this about Python in 2004. It is largely true 17 years later, even though the entire landscape has changed. Python may not necessarily be the best choice today for new projects, but it's still a very good choice if they aren't CPU-bound.

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 9:22 UTC (Thu) by niner (subscriber, #26151) [Link] (1 responses)

Everything Paul Graham says about Python in your linked article equally applies to Perl. Indeed, he even mentions Perl and that his main point applies to it. Yet you claim that Perl is unsuitable for large projects while claiming that Python is somehow objectively a good language. Looks to me like you first picked your conclusion and then looked for arguments supporting this conclusion rather than following the data.

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 9:36 UTC (Thu) by rsidd (subscriber, #2582) [Link]

Yes, he likes perl except that it looks like a "cartoon character swearing". Superficially but not fundamentally ugly. But he likes python better. So did many, many others. Perl basically fell out of favour because python was easier and better for almost everything.

Watson: Launchpad now runs on Python 3

Posted Aug 17, 2021 17:57 UTC (Tue) by JanC_ (guest, #34940) [Link]

Launchpad development started in 2004, Julia didn’t exist until 2012, so all this theorizing about whether Launchpad should have been written in Julia is pointless…

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 11:57 UTC (Tue) by Paf (subscriber, #91811) [Link] (2 responses)

Interpretation and weak/aggressively dynamic typing do not have to go together.

Microsoft’s power shell shows how easy it is to have *optional* type specification in an interpreted language with strong type inference. You literally just stick a type on a variable and it’s enforced with type errors. Or don’t do it and quickly write your easy script.

Because, yes: unresrictable dynamic typing is bad for large programs. I actually don’t think it’s much debated any more - it’s a positive for smaller stuff and scripts and a negative for lager stuff. It’s not a fatal negative, but it’s absolutely a negative.

Watson: Launchpad now runs on Python 3

Posted Aug 9, 2021 11:54 UTC (Mon) by swilmet (subscriber, #98424) [Link] (1 responses)

Good to hear that it's evolving.

A compiler (or a static-analysis tool) permits to discover lots of problems in the code without the need to run the specific code path to trigger the error that would be required with an interpreter. In case of a migration to a new major version of the language or a library, it's up to the language/library's developers to trigger compilation errors with not-yet-migrated code.

And explicitly writing types alongside variable declarations, it's primarily useful for us humans, to have a better code understanding (what does this variable represent? What functions can I call on it?). It seems that all modern languages present type inference as a good thing (e.g., Rust), where it's possible to omit type information in many places…

Anyway, I wrote the following essay some time ago, with solutions for code migration (see section 3.3: Build system support to have a smooth repercussion of an API break):

A DAG of components – for an internal architecture too

Watson: Launchpad now runs on Python 3

Posted Aug 10, 2021 10:19 UTC (Tue) by HelloWorld (guest, #56129) [Link]

> And explicitly writing types alongside variable declarations, it's primarily useful for us humans, to have a better code understanding (what does this variable represent? What functions can I call on it?). It seems that all modern languages present type inference as a good thing (e.g., Rust), where it's possible to omit type information in many places…
Type inference is a good thing, because annotating each and every variable just gets tedious, especially in lambda expressions. And yes, it's very valuable to know what functions you can call on a variable, but that doesn't preclude type inference, it just means you need a decent editor to display the inferred types as you're editing the source code. And you know what? If that means people are going to have a hard time using vi on a VT100 to edit their source code, that's fine. I'm sick and tired of being told that I can't have nice things because other people insist on using 1985-style tooling.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:25 UTC (Tue) by juliank (guest, #45896) [Link] (6 responses)

Just statically type your Python. mypy --strict is great

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:00 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (5 responses)

It is indeed great, but bear in mind that it has a runtime cost. Because you end up having to import a lot of stuff just for the type, and that slows things down.

Types are mostly not used at runtime, so at runtime they make stuff slower.

Of course people (including me) started to use types at runtime too, so the plan to make them into strings didn't really work out and is to be discussed more before changes being made. (https://lwn.net/Articles/858576/)

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 18:08 UTC (Tue) by juliank (guest, #45896) [Link] (1 responses)

I somewhat disagree with that assessment.

Import time is only relevant for short-lived scripts, not the large applications that swilmet said Python is unsuited for.

For big applications, and especially web applications not launched by CGI (but WSGI or FastCGI or just doing HTTP themselves) or other daemons, imports are a non-issue.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 18:51 UTC (Wed) by LtWorf (subscriber, #124958) [Link]

Short lived applications can also be very complicated and benefit from using mypy...

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 11:05 UTC (Thu) by sammythesnake (guest, #17693) [Link] (2 responses)

I happen to be up to my eyeballs in a hobby project in python where I'm trying to be really disciplined about type hints. Sadly, mypy has a lot of false positives so it's not as clean an aspiration as I'd like. The options to turn off specific checks mean the risk of missing real positives, so that isn't a silver bullet, either...

The easy way around this particular problem, however, is to do something like this:

]from typing import TYPE_CHECKING
]if TYPE_CHECKING:
] from typing import Any, Optional, Mapping # etc.

This way, lots of imports can be completely skipped at runtime.

Personally, I wish there were more runtime type checking support, such as being able to do things like:

]if not isinstance (foo, Mapping[int, str]):
] raise Type error(" Bad foo!")

(At runtime, isinstance(foo, Mapping) has to suffice)

Better still, not have to add that boilerplate in the first place (perhaps with a @validate_parameter_types decorator to activate it where appropriate) I could write that decorator, but I'd be reimplementing a bunch of esoteric stuff that isn't really the meat of my current project...

Given the choice, I'd prefer to have comprehensive type safety before runtime (running mypy or the like as a faux "compilation" step would be sufficient of the checks were solid enough) Catching (and fixing) unexpected type problems at runtime in embedded code is troublesome!

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 11:22 UTC (Thu) by juliank (guest, #45896) [Link]

Well, I just do "import typing" and write "typing." in the couple of places it's needed.

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 15:55 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

I happen to have written a thing called typedload.

You can do like

typedload.load([1,2,3,4,5], List[int])

and it will return a List[int] or raise an exception.

It has a few options about casting and so.

load(["1",1], Tuple[int, int])

will return (1,1), but if casting is not wanted basiccast=False can be passed.

It doesn't support generic mapping, it must be a dict or frozendict as output.

load({1: 'a', 2: 'b', 3: 3}, Dict[int, Union[str, int]])

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 10:54 UTC (Tue) by scientes (guest, #83068) [Link] (40 responses)

What Python 3 does with NOT just using UTF-8 is, frankly, stupid.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 11:24 UTC (Tue) by Wol (subscriber, #4433) [Link]

What do you mean?

And what if the user does not want a UTF-8 string?

Cheers,
Wol

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 11:42 UTC (Tue) by willy (subscriber, #9762) [Link] (36 responses)

UTF8 is a (and clearly the best) interchange format. It's not clearly the winner if, eg, you're writing a word processor. A wchar_t in-memory format is probably best for that case.

I do not presume to tell the Python people how they should represent strings in memory.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 12:44 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (34 responses)

What kind of advantages do you see in using wchat_t over UTF-8 encoded, char-based strings in memory?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 13:31 UTC (Tue) by dskoll (subscriber, #1630) [Link] (22 responses)

Having each character the same size makes a lot of things much simpler, such as finding the length of a string (in characters, not bytes), indexing to a specific string position, copying substrings around, etc. UTF-8 is a great interchange format, but not the best internal representation format.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 13:45 UTC (Tue) by Sesse (subscriber, #53779) [Link] (3 responses)

ICU uses UTF-16, are you going to claim the Unicode consortium is incompetent as well? :-) The length of a string in code points is nearly never interesting in a system designed for Unicode, and neither is indexing a given numbered code point. What you want is either the size in bytes, the length in number of grapheme clusters, or the width in pixels (given a font). Copying substrings around is no harder in UTF-8 than in UCS-4; you need to be equally careful about not breaking grapheme clusters for both.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:23 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> ICU uses UTF-16, are you going to claim the Unicode consortium is incompetent as well? :-)

They are using UTF-16 because of backward compatibility requirements. When Unicode 1.0 was developed and Windows API was developed and Java was developed UCS-2 (not UTF-16!) made sense.

It stopped making sense in 1996, when Unicode 2.0 was released, but it took industry decades to finally switch to UTF-8.

But python's strings were broken in 2008, by that time choosing anything else but UTF-8 if you weren't forced by compatibility reasons was just stupid.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:49 UTC (Tue) by Sesse (subscriber, #53779) [Link] (1 responses)

> They are using UTF-16 because of backward compatibility requirements.

Citation needed. You can compile it for UTF-8, so clearly they are not API-constrained.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:00 UTC (Tue) by khim (subscriber, #9252) [Link]

Sigh. Can you open Wikipedia and read? It's almost in the beginning: Java internationalization classes were then ported to C++ and C as part of a library known as ICU4C ("ICU for C").

Java was developed before Unicode 2.0 and first version was even released before Unicode 2.0. Unicode 1.0 created huge mess with CJK specifically to try to fit into 2 bytes. And if you have everything as two bytes then yes, it simplifies things and memory waste is not that big (you couldn't fit into one byte anyway and even with UTF-8 lots of characters require 2 bytes or more).

When Java classes were ported to C++ they, naturally, retained UCS-2 representation — even if it was obvious by then that that's not the best choice… they were, essentially, stuck because they wanted to do what Java did. And what Windows NT did. And what browsers did, too.

So yes, there absolutely were API-constrained.

True, they could have introduced translation layers, but the more they dealt with UTF-16 the harder it was to switch.

In some alternate world, where Windows would have won on servers, not just on client, this would have been the end of the story, but in our world Windows lost to BSD and Linux on servers. And these were UTF-8 based thus eventually ICU4C got UTF-8 mode, too.

But objectively speaking if you are not API-bound, UTF-8 is the best option today: UCS-4 overhead is just too big (CPUs are fast today, memory is slow, don't forget) and UTF-16, essentially, combines issues from UTF-8 and issues from UCS-4 (too much overhead for ASCII-mostly texts like XML and you still have to deal with surrogate pairs and all that mess).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 14:03 UTC (Tue) by excors (subscriber, #95769) [Link] (10 responses)

> Having each character the same size makes a lot of things much simpler, such as finding the length of a string (in characters, not bytes), indexing to a specific string position, copying substrings around, etc.

That depends on what you mean by "character". If you're dealing with human-readable text, and you want to extract substrings that look like sensible fragments of the original string, you probably need to split on grapheme cluster boundaries. If you just split by code point, "🤦🏼‍♂️ " might turn into the two substrings " 🤦🏼" and " ‍♂️ " which (depending on font) looks very different and will confuse the user. (Example from https://hsivonen.fi/string-length/)

Grapheme clusters can contain an arbitrary number of code points, so you can't have a fixed-sized representation of them. And once you're forced to use a variable-sized representation, there's no extra harm in using a variable number of UTF-8 code units rather than a variable number of UTF-32 code units.

Are there many situations where splitting by code point is an actually useful thing to do, apart from situations where you really ought to split by grapheme cluster and splitting by code point is wrong and buggy but it works well enough for typical Western text that you don't care enough to implement it properly? (The latter seems very common, but is not something that should be encouraged, for the same reason we now encourage people to write Unicode-aware code instead of pretending everything is ASCII and hoping it won't mangle UTF-8 inputs too badly. I find it hard to think of many valid use cases for splitting by code point, so that doesn't feel like something that programming languages should be optimised for.)

Brilliant explanation

Posted Aug 3, 2021 15:17 UTC (Tue) by HelloWorld (guest, #56129) [Link] (8 responses)

Unfortunately almost no programming language gets this right, one exception being Raku.

Brilliant explanation

Posted Aug 3, 2021 18:34 UTC (Tue) by excors (subscriber, #95769) [Link] (7 responses)

> Unfortunately almost no programming language gets this right, one exception being Raku.

Hmm, I wondered how Raku compares to Swift. Swift seems relatively simple - it uses UTF-8 internally (since Swift 5: https://swift.org/blog/utf8-string/) while the API is based around grapheme clusters (https://docs.swift.org/swift-book/LanguageGuide/StringsAn...), and you can't use integer indexes into strings (though you can construct an Index type with an "offsetBy" integer, with O(n) cost), and String.count is O(n). Strings are not stored or processed in normalised form, though comparison operators are based on canonical representation.

(https://hsivonen.fi/string-length/ explains some drawbacks with that: "🤦🏼‍♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database. That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)

I can't find any detailed documentation of Raku's approach, but it sounds like a similar grapheme-centric API to Swift, except that it allows random access to graphemes and the operations are O(1). The representation is basically UTF-32, but first it converts all strings to NFC, and then for every grapheme cluster that's >1 code point it dynamically allocates a new integer (outside of Unicode's 21-bit code point range) and adds that to a global dictionary, so the string is stored as a sequence of 32-bit integers that are either single-grapheme code points or indexes into the grapheme dictionary.

That sounds interesting, but kind of problematic (if I'm understanding it correctly). A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings. And if you try to write a carefully RAM-limited DOS-resistant Raku network server that only accepts small strings from each client, an attacker could just send lots of small strings with unique grapheme combinations, until eventually your server is bloated with many gigabytes of grapheme dictionary (because MoarVM doesn't attempt to garbage-collect old unused graphemes). Then there's other problems like being unable to roundtrip Unicode text (it will all get converted to NFC), and worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses), and worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8). And it's not supported by the JVM backend for Raku.

hsivonen argues that random access to code points is almost never actually useful, and I think the same applies to grapheme clusters. You only need iterators. Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.

Brilliant explanation

Posted Aug 4, 2021 8:10 UTC (Wed) by niner (subscriber, #26151) [Link] (3 responses)

All in all your analysis of Raku and MoarVM is pretty good. Just a small addition: MoarVM doesn't actually force all strings into the 32 bit representation. Pure ASCII strings for example remain in their 7 bit version. Bugs like the potential overflow of synthetics id space will of course be addressed in time.

Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.

Brilliant explanation

Posted Aug 4, 2021 11:36 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

> Bugs like the potential overflow of synthetics id space will of course be addressed in time.

Are there ideas on how that would be addressed? 2^32 graphemes seems like an unfortunately low limit that might be hit under plausible use cases, now that we've got enough RAM for programs to easily slurp a multi-gigabyte file into memory, so it seems bad if the language can't handle those strings correctly. But I don't see how it can handle those strings without changing the basic NFG idea of representing each grapheme with a unique 32-bit integer.

> Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.

You shouldn't need to re-iterate forwards through the whole string to jump backwards, you can just iterate backwards over code points from the current position until you find the next grapheme cluster boundary. (Unicode says the grapheme cluster rules can be implemented as a DFA (https://unicode.org/reports/tr29/#State_Machines), and it's always possible to reverse a DFA. That's defined over code points but it's easy to layer it over a backwards UTF-8 byte iterator.)

But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later? If you're doing grapheme-based iteration over the string, you don't need to push grapheme indexes onto the stack, you can just push an opaque iterator corresponding to the current position (which may be implemented internally as a byte index so it's trivial to dereference the iterator, and the implementation can guarantee it's always the index of the first byte of a grapheme). So there's still no need for random access or even for reverse iterators.

Brilliant explanation

Posted Aug 4, 2021 12:32 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later?

So you've just confirmed the previous poster's point! To do this, you need random byte-count access into the string - which is exactly what is allegedly impossible (or not necessary, I think we're all losing track of the arguments :-)

Cheers,
Wol

Brilliant explanation

Posted Aug 4, 2021 13:48 UTC (Wed) by excors (subscriber, #95769) [Link]

No, in this context "random access" is about whether you can access the Nth element (grapheme cluster, code point, code unit, whatever) in O(1) time for any N.

An array of code points (as used by Python) supports random access to code points but not to grapheme clusters. Raku's NFG supports random access to grapheme clusters but not to code points. An array of UTF-8 bytes (as used by Rust and Swift) or an array of UTF-16 code units (as used by Java) support neither.

In all of them, given an iterator (i.e. a pointer to an existing element) you can move forward/backward by one element in O(1) time, and you can save the iterator in some other data structure (like a stack) and go back to it later.

I think the interesting question around Raku is whether random access to grapheme clusters is actually useful in practice, or whether iterators are almost always sufficient (in which case you can implement strings as UTF-8 and avoid the complexity/performance sacrifices that Raku makes).

Have you misunderstood NFG?

Posted Aug 6, 2021 0:09 UTC (Fri) by raiph (guest, #89283) [Link] (2 responses)

I think you've misunderstood NFG. But I'm not an expert so hope you will reply to this comment to clear things up as far as my comments below are concerned. TIA.

I'll start by noting where Raku has a problem you didn't mention, and one I think that's shared by Raku, Swift, and Elixir:

> "🤦🏼‍♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database.

Raku doesn't rely on a user's system's ICU. Instead it bundles (automatically extracted and massaged) relevant data with the compiler.

Thus "🤦🏼‍♂️".chars, and more generally operations on Unicode strings, give the same results when using the same compiler version.

So Rakudo has the opposite problem: when and how to update the Unicode data bundled with the compiler.

For example, Unicode 9, released in 2016, introduced prepend combiners, with consequences for grapheme clustering. Afaik that wasn't introduced into a Rakudo compiler until the end of 2016.

I note the result for "🤦🏼‍♂️".chars is 3 for Rakudo versions until the end of 2016, and then 1 thereafter. I don't know why, but I presume it was Unicode 9.

Unicode evolution is a blessing (dealing with real problems that need to be solved) and a curse.

I think it'll be a decade or so before grapheme handling becomes a truly mature aspect of Unicode, Swift, Raku, Elixir, and whichever other new PLs are brave enough to join them in making their standard string type be a sequence of elements, each of which is "what a user thinks of as a character" (quoting the Unicode standard).

> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)

Right.

Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.

> I can't find any detailed documentation of Raku's approach

The best *overview* I know of of the MoarVM implementation is https://github.com/MoarVM/MoarVM/blob/master/docs/strings...

Beyond that I think one has to read the comments in the relevant MoarVM source code.

> The representation is basically UTF-32

No. NFG is a new internal storage form that uses strands (cf ropes).

None of the strands are UTF-32, though some will be sequences of 32 bit integers.

> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.

Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/...

Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?

As for the DoS vulnerability, let me pick two exchanges that I think distil the Raku position.

The first focuses on the technical. See core devs discussing it one day in 2017: https://logs.liz.nl/moarvm/2017-01-12.html (Note that this log service is in its early days following the freenode --> libera transition, and the search functionality is buggy.) A key comment by jnthn, MoarVM's lead dev:

> All the negative numbers = space for over 2 billion synthetics. ... there's easily 100 bytes of data stored per synthetic, perhaps more. If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)

The second focuses on the need to eventually deal with it -- folk will want to use Rakudo with more than 200GB RAM used for strings. From an exchange in 2018 (TimToady is Larry Wall):

timotimo: we need a --fast flag for rakudo that just throws out all checks for invariants :P
TimToady: then we'll make that the default, and make people use a --slow flag when the security flaws emerge :P
jnthn: Worked for speculative execution :P
TimToady: still worries about dos attacks on nfg...
TimToady: but yeah, we need to get things into widespread use before we slow them down again :)

Short term, Rakudo users need to accept that you don't run Rakudo in a production setting and let it go over 200GB RAM used for strings. Longer term, something better has to be done.

> unable to roundtrip Unicode text (it will all get converted to NFC)

Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8

> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)

Aiui NFG iterates the encoded data *once* when a string is first encountered, and thereafter it is a rope of fixed width strands, and that's mostly it. What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"? Was that a guess?

> worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8).

As per several earlier notes, that's not the case.

> it's not supported by the JVM backend for Raku.

True, but the JVM backend is experimental status, with several problems, not just NFG. It's about lack of tuits of those working on the JVM backend, not the merits of NFG.

> hsivonen argues that random access to code points is almost never actually useful

Fwiw I fully agree. One could say it's almost always *dangerous*. Why bother with random access to code points when their correct interpretation as text absolutely requires iteration?

> I think the same applies to grapheme clusters.

Why? To be clear, a grapheme cluster is (the Unicode standard's abstraction of) what the Unicode standard defines as "what a user thinks of as a character".

A character, something an ordinary human understands, is not remotely comparable to a code point, which is a computing implementation detail.

And there are string processing applications where you really need O(1) character handling.

> You only need iterators.

I don't think that's true. I'll return to that in another comment.

> Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.

Larry Wall committed to NFG, fully aware Rakoons would have to deal with that vulnerability. It requires something like 15GBs worth of carefully constructed attack strings to overflow the grapheme space, which, in 2017, would allocate more than 200GB of RAM for string related storage. It's a theoretical risk that must be addressed longer term; in the near term one must constrain a Rakudo instance to reject allocating string space beyond 200GB or so.

The UTF-32 memory usage expense you thought existed does not. The strand implementation can be expanded as necessary to include forms other than the 8 bit width strand that was designed for efficiently handling mostly ASCII strings.

Does the iteration performance expense you describe exist? I'd appreciate it if you could link to a line or lines in the source code I linked, indicating what you mean. TIA.

In the meantime, aiui there is great value to NFG's O(1) character handling. In short, it's about meeting the performance needs of a string processing PL that, like Swift and Elixir, takes Unicode seriously, but, in Raku's case, also takes seriously a need for directly PL integrated performant Unicode era string pattern matching / regexing / parsing. That's not Swift's focus, which is fair enough, but is fundamental to Raku and its features.

Have you misunderstood NFG?

Posted Aug 6, 2021 16:43 UTC (Fri) by mbunkus (subscriber, #87248) [Link]

That was very insightful & interesting. Thank you very much.

Have you misunderstood NFG?

Posted Aug 6, 2021 18:15 UTC (Fri) by excors (subscriber, #95769) [Link]

>> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)
> Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.

I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku's built-in definition of 'character'), and a user registers with an 8-character password, but then you upgrade Raku and now that user's password is only 7 characters and the form won't let them log in.

To avoid bugs in cases like that, you need to realise beforehand that Raku's definition of 'character' is not stable and you need to implement some alternative form of length counting for any persistent data. I don't mean that's a huge problem, but it's unfortunate that the language's default seemingly-simple string API is creating those traps for unwary programmers.

>> The representation is basically UTF-32
> No. NFG is a new internal storage form that uses strands (cf ropes).
> None of the strands are UTF-32, though some will be sequences of 32 bit integers.

If there are no multi-codepoint graphemes and a single strand and at least one non-ASCII character, then it's sequences of 32-bit integers that are the numeric values of Unicode code points, i.e. in that basic case it's UTF-32 :-) . And that has the benefits of UTF-32 (you can find the Nth character in constant time) and the drawbacks (4 bytes of memory per code point; worse than UTF-8 even for CJK), which are not really affected by the more advanced features that NFG adds.

>> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.
>
> Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/... ?
>
> Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?

By "dictionary" I don't mean a specific data structure like a Python dict / hash table, I just mean any kind of key-value mapping. In MoarVM it looks like there's actually two: MVM_nfg_codes_to_grapheme maps from a sequence of code points to a synthetic (i.e. a negative 32-bit number), or allocates a new synthetic if it hasn't seen this sequence before, and is implemented with a trie; and MVM_nfg_get_synthetic_info maps from a synthetic to a struct containing the sequence of code points (and some other data), implemented with an array. Both of those are limited to 2^31 entries before they'll run out of unique synthetics.

>> If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)

You can get a VM with 256GB RAM for maybe $1.50/hour - that's not a lot of memory by modern standards. It might have been a reasonable limit in a programming language designed a couple of decades ago, but it seems quite shortsighted to design that now, particularly in a language that's meant to be good at text processing.

I think it's not obvious that the implementation could be optimised later, without significant tradeoffs in performance or complexity. Specifically the performance guarantees of the string API require the string to be stored as an array of fixed-size graphemes (to allow the O(1) indexing), so the only obvious way to increase the grapheme limit is to increase the element size, which would greatly increase memory usage and reduce performance in the vast majority of programs that don't use billions of graphemes, and/or to dynamically switch between multiple string representations. (There may be non-obvious solutions of course; I'm certainly not an expert and haven't investigated this very deeply, and I'd be interested if there were existing discussions about this.)

(I don't particularly care about technical limitations of a non-production-ready compiler/runtime which can be fixed later; but I am interested when those limitations are an inevitable consequence of the language definition. Raku expects the O(1) grapheme indexing and that constrains all future implementations of the language, and it's interesting to compare that to other languages' string models.)

(Incidentally I'm ignoring strands here, because it looks like MoarVM has a fixed limit of 64 strands per string. Finding the Nth character might require iterating through 64 strands before doing an array lookup, but technically that's still O(1) even if the constant factors will be bad.)

> Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8

It can, as long as you want to do almost nothing with the string apart from decode and encode it (in which case why bother using a Unicode string at all? You could just keep it as bytes). Even if it's perfectly valid UTF-8, but it's NFD instead of NFC, you'll get garbage if you try to print it or encode it like a normal Unicode string:

say Buf.new(0x6f, 0xcc, 0x88).decode('utf8');
# ö
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8');
# o􏿽xCC􏿽x88
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8').encode('utf8');
# utf8:0x<6F F4 8F BF BD 78 43 43 F4 8F BF BD 78 38 38>

The single UTF-8 grapheme gets decoded into 3, so any grapheme-based text processing algorithms will misbehave. The only way to get correct processing is to not use UTF8-C8, and let Raku lossily normalize your string.

That contrasts with Swift where strings aren't stored in normalized form but most string operations behave as if they were. E.g.:

let a = String(decoding: [0x6f, 0xcc, 0x88], as: UTF8.self)
let b = String(decoding: [0xc3, 0xb6], as: UTF8.self)
print(a, b, a == b)
# ö ö true
print(a.count, b.count, a.utf8.count, b.utf8.count)
# 1 1 3 2

so it operates over graphemes but it preserves the original bytes when you re-encode as UTF-8.

>> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)
> What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"?

I was slightly wrong about the "walk" because I mixed up the grapheme array and the trie - reading the string only needs the array, which is much cheaper than the trie.

But still, if you're doing an operation where you need to access each code point (e.g. to encode the string) you will iterate over the string's 32-bit elements, and if an element is a synthetic then you have to read nfg->synthetics[n].codes (which will likely be a cache miss, at least once per unique grapheme in the string) then read nfg->synthetics[n].codes[0..m] (another cache miss). That sounds slower than if the string was just stored as UTF-8/UTF-16/UTF-32, where all the relevant data is stored sequentially and will be automatically prefetched.

Admittedly that's only particularly relevant when iterating over code points, which doesn't seem too important outside of encoding. Encoding does seem quite important though. I don't have any benchmarks or anything, just a vague concern about non-linear data structures in general. It's a deliberate tradeoff to get better performance in some grapheme operations, but I worry it's a high cost for questionable benefit.

> And there are string processing applications where you really need O(1) character handling.

Do you have specific examples where it is really needed? I suspect that in a large majority of cases, all you really need is forward/backward iteration and the ability to store iterators in memory. E.g. for backtracking in pattern matching / regexing / parsing, you just store a stack of iterators to jump back to, instead of storing a stack of numeric indexes. That can be done with a variable-sized-grapheme string representation (like UTF-8 or UTF-32), avoiding the compromises that are required by a fixed-size-grapheme representation.

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 7:01 UTC (Thu) by ssmith32 (subscriber, #72404) [Link]

>splitting by code point is wrong and buggy but it works well enough for typical Western text that you don't care enough to implement it properly?

Of course, the good news is that the "typical Western text" more & more includes useful test cases like 🤦🏽‍♀️

I find putting emojis in my code & git commits not only serves as a nice outlet, but also as a great test!

Mixing it up with some right to left scripts for an added bonus & language practice!

Of course, until some annoying person decided method names must be ascii. And, thus, my unit tests are no longer ممتاز

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:18 UTC (Tue) by khim (subscriber, #9252) [Link] (5 responses)

> Having each character the same size makes a lot of things much simpler

A lot? It makes it possible to do something useless and pointless quickly, yes.

> such as finding the length of a string (in characters, not bytes)

Wow. What an achievement! What do you plan to do with than info? Even if you used monospaced font you still have to deal with the simple fact that 14 character “Ｈｅｌｌｏ, ｗｏｒｌｄ！” is 2x wider than 13 “Hello, world!”

And if your editor supposed diacritics then “length of a string” becomes even more pointless.

I literally couldn't recall any algorithm which works with characters and which is useful for any purpose if you have to support the whole range of human scripts.

And if you don't need to support all types of scripts then it's not clear why would you need Unicode at all: use iso-8859-x or koi-r (depending on what languages you need to support) why would you need anything else?

> indexing to a specific string position

No, you can't. Zero-width joiner and non-joiner, bidirection and many other surprises would make such index pointless.

> copying substrings around

Works perfectly in UTF-8, too.

> UTF-8 is a great interchange format, but not the best internal representation format.

Well, maybe in some exotic cases, but in general UTF-8 is absolutely the best format: most algorithms either work fine with UTF-8, or they don't work with UCS-2 or UCS-4 either.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:59 UTC (Tue) by jafd (subscriber, #129642) [Link]

Interestinly enough, PyPy switched to UTF-8 as the internal representation of their strings, and they say that performance improvements are numerous.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 0:09 UTC (Wed) by roc (subscriber, #30627) [Link] (2 responses)

This is absolutely right. Unicode code point offsets and lengths are almost entirely useless in their own right. (Of course they're still needed when you implement or use legacy APIs that require them.) In fact they're worse than useless because they encourage developers to write code that works OK for simple cases but fails with nontrivial grapheme clusters etc.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 4:06 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That's why I love emojis! They are common and they stress-test pretty much all of the Unicode machinery.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:23 UTC (Wed) by felix.s (guest, #104710) [Link]

s/love/tolerate/

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 22:38 UTC (Thu) by bartoc (guest, #124262) [Link]

Well, the unicode grapheme clusterization algorithm works directly on characters and is useful for exactly this :)

More seriously you can get useful results for "display width" with just the simple, non-tailored grapheme clusterization algorithm. If you're a word processor or text layout system then you're probably going to want to tailor clusterization results for both language/region and font (some sequences of code-points are considered as "one character" to users in some cultures but not others, even when everyone is using the same font!)

Also: if you're doing a ton of clusterization and decoding (like in a text editor) then utf-16 may be cheaper to go through and decode than utf-8, just in terms of number of instructions. (obviously it may loose out since the characters are bigger, but still, benchmarking would be required). If you're smart with how you're decoding things it shouldn't matter either way though, on modern machines.

Watson: Launchpad now runs on Python 3

Posted Aug 5, 2021 11:28 UTC (Thu) by immibis (subscriber, #105511) [Link]

You wrote "characters" but I think you meant "code points".

Is counting code points really useful? ü can be two code points (not sure whether the one I typed is). Is the string "ü" two characters long? Can I put my cursor between the "u" and the "◌̈" ?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 19:36 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (10 responses)

I admit, I was fully aware that this was a trick question. I apologize for triggering everyone intentionally.

In general I'm convinced that anyone suggesting that having code points directly accessible/be the same size helps with anything that a human cares about, simply hasn't learned enough about doing Unicode correctly yet (the latter applies to me as well). Seese has summed up pretty well what the relevant topics are:

> What you want is either the size in bytes,…

for copying strings around in memory or storing strings somewhere, e.g. a database or a file format with fields with a certain length limit given in bytes

> …the length in number of grapheme clusters,…

for doing anything related to displaying the string, e.g. knowing where you may split so that the human perceives the resulting parts as being correct or deciding how far to move the cursor when the human presses a cursor key (you want one cursor press to skip over a whole emoji, no matter how many code points that emoji consists of)

> …or the width in pixels (given a font).

again for displaying the string somewhere or calculating how wide controls become.

Neither of those three topics benefits from using UCS-4 or, even worse, UCS-2 (which wchar_t is on Windows), because it's only the code points that consist of fixed number of bytes, but not the grapheme clusters which can still consist of several code points. For the purpose of handling Unicode correctly, all three encodings (UTF-8, UTF-16 aka UCS-2, UTF-32 aka UCS-4) are actually variable-length encodings.

Of course if one doesn't care about doing Unicode correctly one could simply split a string between arbitrary code points. Then you can also chose to simplify your life in other ways, e.g. https://xkcd.com/221/

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 19:46 UTC (Tue) by mbunkus (subscriber, #87248) [Link]

Huh, looks like I've even triggered myself 😁

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 23:08 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (8 responses)

>> What you want is either the size in bytes,…
> for copying strings around in memory or storing strings somewhere, e.g. a database or a file format with fields with a certain length limit given in bytes

PostgreSQL VARCHAR counts Unicode code points ("characters" according to the documentation). T-SQL wants you to use NVARCHAR instead of VARCHAR, and NVARCHAR has the annoying property of counting UTF-16 code units (so BMP code points count as 1, non-BMP code points count as 2). Oracle can be instructed to count code points instead of bytes (but it uses bytes by default). MySQL and SQLite use bytes (to the extent that the latter even has such limits; it collapses all CHAR/VARCHAR types to TEXT, but TEXT is not completely unlimited), so I suppose this claim is not completely false. But it is misleading, by suggesting that *all* databases count bytes. They do not, and in my experience, when the engines differ on points like this, PostgreSQL's behavior is usually the closest to the actual SQL standard (which is not available online as far as I can tell, so I cannot confirm what it says to do for characters vs. bytes).

On a related note, you also have to consider interfacing with other systems. Twitter, for example, counts Unicode code points and not bytes or grapheme clusters when determining the maximum length of a tweet. I imagine there are other APIs which similarly impose code point-based limits and not byte-based limits. It is unrealistic to assume that all APIs will "be reasonable" when so much of the computing industry was operating under the "Unicode is a 16-bit encoding" falsehood for such a long period of time (and many people still wrongly believe this!).

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 1:14 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

> Twitter, for example, counts Unicode code points and not bytes or grapheme clusters when determining the maximum length of a tweet.

That's not correct now, per https://developer.twitter.com/en/docs/counting-characters - there's a limit of 280 "characters" where Latin-1 code points and most punctuation code points count as 1, all other code points (most notably CJK characters) count as 2, certain emoji count as 2 (even a family like "👨‍👩‍👧‍👦" which is 7 code points, plus you could add skin tone and gender modifiers and I think it would still count as 2), URLs count as 23, etc. Also the string is converted to NFC before counting. And they've had at least three different versions of the counting algorithm.

(That illustrates the general uselessness of counting code points; you need something far more sophisticated if you want it to feel reasonably fair to users.)

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 1:50 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

I had not heard that. I can sort of understand double-counting CJK, as those scripts are (usually) more concise than phonetic scripts. But their decision to refer to the entire range U+0000 to U+10FF as "the Latin-1 code pages" is immensely frustrating. Latin-1 is U+0000 to U+00FF, and their bigger range includes a huge swath of completely unrelated code points, many of which were never even in the ISO-8859 family. U+10FF is, as far as I can tell, a completely arbitrary and nonsensical cutoff, although it does at least have the virtue of falling at the end of a block.

OTOH, the emoji counting makes sense, because they preprocess the emoji and turn them into static images (emoji support varies significantly by target system). So you might as well give users a sensible limit instead of counting code points. Similarly, URLs count as 23 because they are automatically link-shortened (whether you want it to or not...).

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:10 UTC (Wed) by mbunkus (subscriber, #87248) [Link]

> But it is misleading, by suggesting that *all* databases count bytes.

That is not at all what I intended to say. I'd hoped writing "_a_ database… with …length limit given in bytes" instead of "…databases which all give length limits in bytes" or simply just "…databases…" made that clear enough, but it seems it didn't.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 11:15 UTC (Wed) by Sesse (subscriber, #53779) [Link] (2 responses)

> MySQL and SQLite use bytes

MySQL fairly consistently uses code points for Unicode collations (including the default utf8mb4_0900_ai_ci). There is next to no support for grapheme clusters. If you cast to BINARY, you can use bytes, e.g. LENGTH(CAST(x AS BINARY)) should give you the number of bytes needed to store x, but LENGTH(x) will give you the number of code points.

(I work on the MySQL optimizer team, which among others, is responsible for collation functionality.)

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:03 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

I was responding to a comment about a "length limit given in bytes." When I was reading https://dev.mysql.com/doc/refman/8.0/en/char.html, I must have misinterpreted it, because it does indeed say that length is in "characters." However, the examples it gives are all US-ASCII and there is only a very brief mention of character sets. Furthermore, it has a whole table showing byte counts of various US-ASCII strings, and how many bytes they occupy in the database. So one might forgive me for being confused by the ASCII-centric nature of this documentation page. PostgreSQL, by contrast, has an explicit note on their VARCHAR(n) page calling out the "characters, not bytes" distinction.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:10 UTC (Wed) by Sesse (subscriber, #53779) [Link]

You wouldn't be the first to be confused; the term “character” is ambiguous (does it mean bytes? codepoints? grapheme clusters?), and I would prefer avoiding it altogether, but unfortunately, the SQL standard insists on using it. And it seems to mostly mean Unicode code points there.

Internally in RAM, MySQL allocates (non-blob) strings as (max width) * (max number of bytes used for a single code point) bytes. This is why “utf8” historically means UTF-8 with max. 3 bytes per code point (utf8mb3); at the time, supporting full UTF-8 would mean having utf8mb6, which I guess was seen as prohibitively expensive wrt. latin1, and emoji wasn't a thing. Now, Unicode has made it clear that nothing will be allocated in the astral planes, support for 5- and 6-byte UTF-8 has been entirely removed, and the recommended default is utf8mb4 (the “utf8” alias still expands to utf8mb3, but you get a deprecation warning that it might change in the future).

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 20:02 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

> It is unrealistic to assume that all APIs will "be reasonable" when so much of the computing industry was operating under the "Unicode is a 16-bit encoding" falsehood for such a long period of time (and many people still wrongly believe this!).

I agree (and even explicitly wrote that before) that in some kind of alternate world, where Linux died and Windows won UTF-16 would have been sensible choice simply because of API compatibility issues.

But in today's world enough of APIs are in UTF-8 (and even Windows finally supports it properly).

Thus today use of UTF-16 should be relegated to legacy apps.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 22:19 UTC (Wed) by mbunkus (subscriber, #87248) [Link]

> But in today's world enough of APIs are in UTF-8 (and even Windows finally supports it properly).

Oh wow, I hadn't been aware of that. That might mean that several libraries ported poorly from Linux might actually start working with non-ASCII paths. "Ported poorly" here means that they still use the POSIX API calls, which in turn means that passing any kind of path with characters that aren't part of the ANSI code page cannot be found. Examples of such libraries include more obscure ones (e.g. libdvdread) and some popular ones (e.g. libintl/gettext).

Thanks for pointing that out; will investigate whether this means POSIX APIs, too, not just the various -A variants of Windows' native API.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 18:42 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> UTF8 is a (and clearly the best) interchange format. It's not clearly the winner if, eg, you're writing a word processor. A wchar_t in-memory format is probably best for that case.

That nonsense again... If you're writing a modern text processor then you need to be able to decompose text into grapheme clusters that can consist of multiple wchar_t code points. There is no functional difference between translating utf-8 or utf-32 into grapheme clusters.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:40 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

> What Python 3 does with NOT just using UTF-8 is, frankly, stupid.

Right! And so how do you represent a filename? That's an arbitrary sequence of bytes, but one that is *usually* a valid UTF-8 string, and it is common to want to, say, regex-match on it. (The same for the contents of text files.)

Emacs once did this via a brutally complex encoding of its own, then migrated to UTF-8-with-stuff-on-top later on, because *most* things fit into it. But... not everything does, and you do still have to represent the things that don't without data loss in a general-purpose programming language.

(non) UTF-8 filenames

Posted Aug 4, 2021 0:46 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

One good idea (as with many problems) is to avoid dealing with it whenever you didn't have to, which is very often.

For example Rust will let you write:

let x = std::fs::File::open("/some/file/name.extension")?;

You can write a LOT of code with this and never realise non-UTF8 filenames can even exist, you never use one. But, if you know perfectly well the file has some crazy non-UTF8 name, you can write it as an OSStr by whatever means and open() that instead since Rust knows how to deal with that. OSStr is a type designed to allow whatever insanity is expected by your operating system, so, arbitrary non-zero bytes for Linux and arbitrary non-zero u16s for Windows, doubtless other possibilities exist.

But then the other side of the coin. Maybe we've got a filename, but it might not be valid UTF-8?

Rust presents such a filename as the Path type and offers you all three plausible ways forward to: 1. Ask the Path for a UTF-8 string, if that works you've got a UTF-8 filename, otherwise you get None. 2. Lossily convert the Path into a UTF-8 string. If it wasn't Unicode already then Unicode's replacement character U+FFFD is presented instead of each sequence that could not be decoded so it's safe to display, however you shouldn't open this (possibly different) name as a file. 3. Convert the Path into a raw OSStr, getting the raw OS name for this file and figure out your own path forward for what to do with that.

From the command line this is nice because you can get the program arguments as OSStr if you want, so you can take "invalid" (non-UTF-8) filenames as command line parameters without either pretending they're raw bytes, which they definitely aren't, or being unable to write error messages about them.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 13:33 UTC (Tue) by martin.langhoff (guest, #61417) [Link] (36 responses)

I like Ubuntu -- both the distro and the folks behind it -- but can't help notice in the blog entry that all the description of Launchpad seems to be describing an infra with a lot of tech debt.

This is a worry – Launchpad is in a key spot in the supply chain to quite a bit of internet infra.

...

Semi-off-topic Python3 rant... Python 3's over-eager use of UTF-8 is a massive PITA for applications that don't control their inputs.

I maintain a "simple" code analysis tool (SLOC and other stats), which must be able to process random 3rd party code.

The switch from Py2 to Py3 was relatively straightforward -- but the "long tail" of dealing with input files that are not UTF-8, or that are _mostly_ in one codepage but have a couple of stray bytes is... long indeed. I've applied all the tricks in the book, and I'm -still- dealing with frequent "barfs on this input" reports.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:18 UTC (Tue) by mb (subscriber, #50428) [Link] (32 responses)

>Python 3's over-eager use of UTF-8

I don't think Python enforces UTF-8 anywhere.

>but the "long tail" of dealing with input files that are not UTF-8, or that are _mostly_ in one codepage but have a couple of stray bytes is... long indeed. I've applied all the tricks in the book, and I'm -still- dealing with frequent "barfs on this input" reports.

Well, so your data is broken.
How do you want to deal with that?
Raise exceptions?
Replace invalid characters?
Ignore invalid characters?

Python let's you choose any of these options at decode() time. It doesn't enforce one.
You can even choose to never decode and work purely with bytes instead.
The language cannot answer these questions for you.
You, as the developer, have to write your code to handle these cases.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:29 UTC (Tue) by martin.langhoff (guest, #61417) [Link] (20 responses)

> Well, so your data is broken.

That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.

For example, version control systems will try to "parse" text in limited ways -- newlines for diff/patching -- while being agnostic about individual characters. And Unix allows any old crud in a directory entry, _it's a bag of bytes_. The Mercurial folks had a similarly hard times with Py3.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:18 UTC (Tue) by mb (subscriber, #50428) [Link] (11 responses)

>I don't control the inputs

But you do control how you process that data.
If your input data might be broken, then of course you can't just decode it with the default parameters, because that will raise exceptions and abort if not caught. (and that's a sane default).
You need to tell the program/libraries/language what you want to do.

>newlines for diff/patching -- while being agnostic about individual characters.

That's just waiting to blow up, if you just scan for values at bytes boundaries, which could be in the middle of any Unicode character.
But if you really want to do such things, just do _not_ decode it and work with bytes (b'\n', etc...).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:50 UTC (Tue) by comex (subscriber, #71521) [Link] (2 responses)

UTF-8 is intentionally designed so that the encoding of one character will never appear within the encoding of another character. As part of that, non-ASCII characters are encoded using a series of bytes that all have the highest bit set. So as long as you only care about UTF-8 rather than legacy multibyte character sets, you don’t have to worry about finding newline bytes in the middle of other characters.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:26 UTC (Tue) by mb (subscriber, #50428) [Link] (1 responses)

>UTF-8 is intentionally designed so that the encoding of one character will never appear within the encoding of another character.

But that doesn't help, if the input data encoding is not known or if it's not even known to be correctly encoded.
That's where the error handling options of Python's codecs come into play.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 23:41 UTC (Tue) by lsl (subscriber, #86508) [Link]

Most of the time I work with protocols or interfaces where certain bytes values (e.g. '/', '\n' or '\0') are assigned some meaning but I don't have to care at all what meaning is encoded into other byte values, I just need to be able to reproduce them again at some later point.

Python 3 made that very inconvenient, especially in inital versions where there weren't any byte string versions of common functionality (e.g. getenv).

Even today, many Python libraries and programs just explode because they don't use the byte string versions of standard library functions even when it would be appropriate (common examples are Unix filesystems or environment variables as well as many network protocols). The str variants tend to be more prominently documented and more convenient so that's what many authors use.

Reliability suffers as a consequence.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:07 UTC (Tue) by khim (subscriber, #9252) [Link] (7 responses)

> You need to tell the program/libraries/language what you want to do.

Except Python3 made it, initially, impossible. Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.

Python3 had binary strings which were possible to use for processing such data but they were limited (on purpose).

That was huge regression compared to Python 2 (not even sure they fixed everything since I have never switched from Python 2 to Python 3).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:23 UTC (Tue) by mb (subscriber, #50428) [Link] (6 responses)

>Except Python3 made it, initially, impossible.

Do you have an example?

>Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.

If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.
That's how it should have been done in the Python 2 script, too.

>That was huge regression compared to Python 2

What exactly is impossible to do in Python 3, that had worked fine in Python 2?
A short example helps the discussion.

>not even sure they fixed everything since I have never switched from Python 2 to Python 3

Ok, so this is just FUD?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:50 UTC (Tue) by khim (subscriber, #9252) [Link] (4 responses)

> Do you have an example?

Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1.

How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?

> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.

How do you propose to do that if we are dealing with XML which includes random binary strings? Yes, I know it's not valid XML, but you know, customers don't care. If old version of program works and new doesn't then they would just say it's broken and would ask to fix it.

> What exactly is impossible to do in Python 3, that had worked fine in Python 2?

Python 2.6 and Python 3.0: the ability to get the filename and, you know, open the file, e.g.

Yes, I know, they fixed that in Python 3.4. By stopping to pretend that filenames are strings.

And maybe, by now, they have even made it possible to combine them with raw sequences of binaries.

But I'm not really interested in that now: I have left the Python camp when they broke strings in Python 3. That was better choice than waiting for when they would make it kinda-sorta usable again.

> Ok, so this is just FUD?

If press-release by language makers is FUD to you then yeah, that FUD, I guess.

Python 2.6 was much better than Python 3.0 and Python 2.7 was much better than Python 3.3. I later found out that issues with filenames was fixed in Python 3.4 (kinda-sorta: you still can't work with filenames as well as you can in 2.7, but at least you can support them now… what an achievement), but I suspect even latest versions of Python 3 are still worse than Python 2.7 (although, true, I don't use it thus I wouldn't know).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 23:54 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1.
>
> How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?

This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 6:28 UTC (Wed) by khim (subscriber, #9252) [Link] (2 responses)

> This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.

Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years.

Granted, all three “big P” languages (Perl, PHP and Python) tried to do that and only Python managed to break the developers expectations yet hold the mindshare, but I wonder what would have happened in an alternate history where at least one camp would have tried to evolve language without huge breakages.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:08 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years.

Meh. So what? Python 2 lost support well over a year ago. As far as I'm concerned, this is all ancient history at this point. But every time anyone mentions Python, in any capacity, on LWN (or Hacker News, for that matter), the comments *always* turn into a ridiculous flame war over it.

Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:39 UTC (Wed) by khim (subscriber, #9252) [Link]

> Python 2 lost support well over a year ago.

Yup. Means now it's time to think what to do about that mess. Look here, for example

Distributions are starting to remove Python 2 which means that simple solution which worked for years (ignore the Python3 existence and use Python 2) no longer works. Enterprise guys are starting to switch. Some of them. But that saga is still far from being over. When RHEL 8 would go out of support? 2029? Well, I guess by 2030 or so we may declare Python 2 dead and buried then.

> As far as I'm concerned, this is all ancient history at this point.

Sorry but if that's all “ancient history at this point” then why do you even leave comments under article which proves that it's most definitely not an ancient history?

Look on it's title for chrissake! You may try to go away and pretend that it doesn't exist but it's a stupid to pretend that all that pain happened years ago somehow when commenting something that shows that story is still ongoing.

> Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?

Sure. If that would have been what mb would have said then I would have stopped the discussion. But that's not what happened.

It's a bit like pain if switching from Winows Classic to Windows XP or from Mac Classic to MacOS X. You may say what's “done is done” (and that would be true!) but that still doesn't change the fact that it was painful and unnecessary.

Computer industry is now mature industry. Decades-long deprecation cycles are typical. You can't avoid it.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 18:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Do you have an example?

For example, it was initially impossible in Python 3.0 to use byte strings in "%" formatting. It was fixed only in Python 3.5: https://www.python.org/dev/peps/pep-0461/

> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode. That's how it should have been done in the Python 2 script, too.

In many cases what I want is to read the data, parse it a bit and then give it verbatim to the next layer that might make more sense about it. Even if the data can't be properly represented as valid UTF-8.

Filesystems were a great example. Up until Py3.6 the built-in Python filesystem module simply SKIPPED undecodable file names in directory listings. How's that for reliability?

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 5:49 UTC (Wed) by dvdeug (guest, #10998) [Link] (7 responses)

> That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.

There's no space where you can't reject data, because there's simply data you can't make head-or-tails of. And you can always delegate massaging data into a sane format into a separate function.

> And Unix allows any old crud in a directory entry, _it's a bag of bytes_.

That's broken, because it's not. What's stored in 62696e? If we weren't having this discussion, would you even start to think that was a standard Unix directory with a three-letter name? https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Common... says that certain libraries must be installed with runtime names like libcrypt.so.1. At no point does it mention that that's a transliteration into ASCII. (It could be κιβγρψοτ.σξ.1, with ELOT 927; that's an entirely acceptable reading of a bag of bytes.) If it were a bag of bytes, then the standard would be careful not to be implicit about it. In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:43 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (1 responses)

I wasn't aware that "bgn" was a standard Unix directory.

Live and learn, I guess.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:44 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

... Derp. I should *finish* my coffee, that's "bin".

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 11:11 UTC (Wed) by HelloWorld (guest, #56129) [Link] (4 responses)

> In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.
People have been using file names in encodings other than UTF-8 (such as ISO-8859-1) for decades, and that can easily result in arbitrary byte sequences. There's nothing in the APIs that prevents you from creating non-UTF-8 file names, and it's easy to create multiple files with names that represent the same sequence of graphemes when interpreted as UTF-8. The simple fact of the matter is: the only sane way to make sense of Unix file names is to treat them as a bag of bytes.

> At no point does it mention that that's a transliteration into ASCII.
Because most real-world encodings (UTF-8, ISO-8859-*) are ASCII compatible, so it doesn't matter. And besides: LSB? Nobody cares.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:27 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

And is there anything in the spec for Unix that forbids EBCDIC? or KOI-8?

Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern. (I've had systems where control characters were deliberately inserted into file identifiers to protect the files as much as possible from accidental damage.)

While I think utf-8 can contain pretty much the entire sequence of valid bytes, there are order limits so it's not a *random* byte pattern.

Cheers,
Wol

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:40 UTC (Wed) by HelloWorld (guest, #56129) [Link] (1 responses)

> And is there anything in the spec for Unix that forbids EBCDIC? or KOI-8?
No, because everybody knows that when file names are specified as strings in such specifications, there's an implicit assumption that the encoding is ASCII. And that works because such specifications generally don't specify file names with non-ASCII characters in them.

> Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern.
Exactly: a file name is a bag of bytes except 0 and 2F. Would another format be better? Perhaps so, but it's too late to enforce that at this point.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 21:20 UTC (Wed) by dvdeug (guest, #10998) [Link]

> Would another format be better? Perhaps so, but it's too late to enforce that at this point.

It's been discussed; see https://dwheeler.com/essays/fixing-unix-linux-filenames.html . The POSIX requirement is for incredibly limited names only, and it's easy to restrict it greatly at the cost of zero to most users.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 20:58 UTC (Wed) by dvdeug (guest, #10998) [Link]

> the only sane way to make sense of Unix file names is to treat them as a bag of bytes.

It's not a bag. I skipped this first time around, but a bag is generally a name for a multiset, or at least some other data structure that has no order. A Unix file name is a sequence of bytes.

Again, nobody treats it as a sequence of bytes. Coreutils will make sure not to dump random noise to the screen, but there's no standard C functions to do that that I recall, and I don't know if GNU Libc has something there. It's not sane if everyone just treats them as strings, and anyone wanting to handle them as a sequence of bytes has to write special code and tiptoe around everything.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:35 UTC (Tue) by martin.langhoff (guest, #61417) [Link] (10 responses)

To be clear, we are laboriously using "open(fpath, 'r', encoding=encoding_found, errors="surrogateescape")" where encoding_found is seeded from libmagic + chardet.

It still blows up in random ways.

Maybe there's a couple tricks I am missing (hey, I can always learn more) but it should not involve mysterious tricks nor be this fiddly.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:21 UTC (Tue) by mb (subscriber, #50428) [Link] (9 responses)

>It still blows up in random ways.

What does that mean?
Do you have an example?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 21:32 UTC (Tue) by martin.langhoff (guest, #61417) [Link] (8 responses)

> What does that mean?
> Do you have an example?

Just today, some XML files that seem to be UTF-16 throw a UnicodeDecodeError: utf-16-le can't decode bytes ... illegal encoding.

There will be an explanation for this -- an infrequent codepath, maybe some open() or decode() I haven't hunted down yet, because Python 3 needs significantly more handholding when opening a file than Py2.

surrogateescape

Posted Aug 4, 2021 1:35 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (7 responses)

I very much doubt that "surrogateescape" is what you actually need here.

The idea of "surrogateescape" is that hey, if we could smuggle gibberish through the decoder/ encoder pipeline, and the rest of our code never looks at the resulting non-sense, then we can have what we "want" of just pretending it's ASCII. This is a bad idea, and for security reasons it blows up in some cases, so it can't achieve this false panacea anyway.

Without knowing way too many details about your application it's impossible to be certain, but I'd suggest "replace". Here's the consequences of "replace" so you can assess if they're closer to what you need:

1. Any time your guess doesn't work and the decoder is looking at stuff that is not in fact Unicode text in whatever encoding you hoped for, sequences that don't decode turn into the Unicode code point U+FFFD, a code point reserved specifically for this purpose. This code point is definitely not: a letter, a digit, punctuation, whitespace, part of a variable name, an ASCII control character, or anything else, but it is definitely itself. There's an excellent chance your code needs no extra work to cope with this which is nice. If it does, this code point can (but usually doesn't) exist in Unicode data anyway, so you likely should fix that. No exceptions, no decode errors, just Unicode data, some of which may be U+FFFD which again, was already valid Unicode.

2. If your user/customers see this replacement character U+FFFD it looks like this: � and that's pretty obviously not what they meant. Or I guess, in a few cases maybe it's exactly what they meant. Maybe they're writing a Unicode decoder. But unlike escaping and other nonsense it's very obvious what's going on here. You can even Google for it.

3. On output, in the worst case where you were sure the output should be ASCII, Python has no choice but to choose ? because that's the best it can do. The good news is that humans can generally realise that function_???????_??? means something went wrong. The better news is that you probably don't see this too often when you're outputting ASCII. Any time your output is some flavour of Unicode, Python can emit an actual U+FFFD.

Now, personally I think "surrogateescape" is as much Python's fault as yours. A programming language should encourage its users not to set themselves on fire. Providing "I'm sure I know what I'm doing" features is near guaranteed to be contrary to that purpose. If you're C++ then you have some excuse, but I don't see how Python does. PEP 383 should not have been accepted.

surrogateescape

Posted Aug 4, 2021 2:07 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (6 responses)

The sole (intended) purpose of surrogateescape is to work around the "feature" of Unix where filenames can be any arbitrary string of bytes, but at the same time they are usually UTF-8 (or, on older systems, some ASCII-superset 8-bit encoding like ISO-8859-1). If you want to enumerate a directory and open the files in it, you need to be able to smuggle mis-encoded garbage through your strings and have exactly the same mis-encoded garbage come out the other side, because if the byte sequences don't match, the filesystem won't be able to find the file you wanted.

Of course, the *correct* way to do that is to use a high-level library like pathlib instead, so that you don't have to fiddle around with individual strings in the first place, and it can do all the ugly string magic that Unix requires. So in practice, nobody has any business using surrogateescape, unless you're implementing a pathlib-like library and know exactly what you are doing.

Alternatively, the correct way is to keep your paths as bytes, but then you can't do most string-like operations on them, which is painful. Also, the Windows API will not give you filenames in any format other than UTF-16, unless you want to use some random 8-bit encoding that Microsoft confusingly refers to as "ANSI," but which in fact could be almost any Windows code page depending on the user's locale. Faced with two bad options, Python chose UTF-16, which means it is now in the position of having to convert those UTF-16 strings (which can also contain invalid code points and IIRC even mismatched surrogate pairs!) in such a way that your code (which you wrote to run correctly on a Unix platform) doesn't break on them. Hence, "do nothing and return raw bytes" was never really a good option at the language level, for a language that wants to be cross-platform.

surrogateescape

Posted Aug 4, 2021 17:08 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (5 responses)

> If you want to enumerate a directory and open the files in it, you need to be able to smuggle mis-encoded garbage through your strings and have exactly the same mis-encoded garbage come out the other side, because if the byte sequences don't match, the filesystem won't be able to find the file you wanted.

The better answer here would be to treat filenames as opaque blobs of unstructured data, and perform any necessary conversion at the UI level—without surrogates. The same goes for other interfaces such as the argument list and environment variables where the data is not guaranteed to be UTF-8. It's not "mis-encoded garbage", you're just applying an inappropriate decoding. There is no good reason to attempt UTF-8 decoding while enumerating a directory when you're just getting data from one filesystem API and passing it to another without presenting any of this to the user.

If you do need to display the data to the user, or otherwise treat it as human-readable text e.g. for collation, *at that point* attempt to decode it as UTF-8 (or whatever the actual locale is set to) and substitute the U+FFFD replacement character for anything that can't be decoded. This does imply that there is probably no way to name certain files through, say, a GUI text box, but one should at least be able to select them from a file chooser if the program is properly designed.

Not being able to do string operations on filenames isn't really much of a handicap. This is like complaining that you can't use strcmp() to compare arbitrary binary data; filenames aren't strings, so it follows that string operations are not well-defined on filenames. Apart from concatenation, which works in (almost) every encoding, the only other operations you need are platform-specific anyway: joining with a platform-specific path separator, splitting on the same separator (or platform-specific alternatives, like '\' and '/' on Windows), and pattern matching. These operations are better handled by something like pathlib than by string functions. If you *really* need to treat a filename as a string for some reason (e.g. to perform a search) then go ahead and convert to UTF-8, but with the understanding that this conversion is lossy and converting the result back won't necessarily give you the same filename.

surrogateescape

Posted Aug 4, 2021 17:45 UTC (Wed) by Wol (subscriber, #4433) [Link]

> If you *really* need to treat a filename as a string for some reason (e.g. to perform a search) then go ahead and convert to UTF-8, but with the understanding that this conversion is lossy and converting the result back won't necessarily give you the same filename.

So do what DataBASIC does (which can get confusing) and have multiple representations of the data within a single variable. The canonical form is always a string, but it can be a file id, a number, etc etc so the variable is internally a structure. You can compare against the utf-8 version to check whether it's what you're looking for, and then go back to the original when you actually want to access the file.

Cheers,
Wol

surrogateescape

Posted Aug 4, 2021 18:40 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (3 responses)

> This does imply that there is probably no way to name certain files through, say, a GUI text box, but one should at least be able to select them from a file chooser if the program is properly designed.

You can see some fruits of these "do it for the UI" munging and then losing track of reality in `explorer.exe`. First, make a file named `NUL` available in some way (usually via a Samba share hosted on Linux). Explorer doesn't like this, so it gives it some other mangled name (let's say `zxcv` for example). Whatever it is, make another file with *that* name in the share on the host. Explorer will render these as the same filename and deleting *either one* will delete the file actually named `zxcv` after which you can delete the `NUL` file by selection. I have no idea what the file open dialog ends up doing though. Or what happens if one is a directory and the other is a file for that matter.

surrogateescape

Posted Aug 4, 2021 19:20 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

You don't need a Samba share to get into that sort of trouble. Just write a Python (or any language) program that creates a file with a name like the following:

\\?\C:\Path\To\File

It has to be an absolute path, and it has to start with that funny \\?\ prefix. This will let you smuggle almost any string under the sun through the vast majority of Windows's sanity checks, as long as it's valid UTF-16 ("Unicode" to use Microsoft's confusing terminology) and the underlying volume is NTFS (and not FAT or some other legacy filesystem). It bypasses the MAX_PATH check, and purportedly even allows you to use names like "." or ".." without path resolution interpreting them as current/parent directory.

Of course, nothing in the Microsoft ecosystem can handle such files correctly. Explorer is completely lost, most programs will get confused and tell you the file does not exist, etc.

surrogateescape

Posted Aug 5, 2021 21:18 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

NTFS doesn't require UTF-16 names. It hardly could, it was invented before UTF-16. Like Linux, file names are a bunch of meaningless symbols, in this case 16-bit rather than 8-bit, with certain rules about which symbols can be in names.

The symbols are in some sense UTF-16 code units, but the actual name needn't be a UTF-16 string. So the four u16s DFFF DFFF D800 0041 make a perfectly reasonable name for NTFS, but obviously that's not a valid UTF-16 string.

Your Windows UI won't like that very much but the filesystem and core OS services think that's a perfectly reasonable name for a file.

This is all getting far off topic. I was rather hoping martin.langhoff might have feedback on my suggestion instead :(

surrogateescape

Posted Aug 6, 2021 0:18 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

> Explorer will render these as the same filename and deleting *either one* will delete the file actually named `zxcv` after which you can delete the `NUL` file by selection.

Yes, that's where the "if the program is properly designed" part comes into play. There would be situations where different filenames were rendered as the same UTF-8—you can get this even with valid UTF-8 if you're not enforcing a particular normalization at the filesystem level—but the file list should be keeping track of the original opaque filename for each file in the list so that when you select a file (distinguished e.g. by modification date) and instruct the tool to delete it the tool deletes the correct file. It shouldn't take the converted UTF-8 name which was only suited for presentation and delete something unrelated which just happens to have a similar UTF-8 version of its name.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:06 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (2 responses)

I like Ubuntu -- both the distro and the folks behind it -- but can't help notice in the blog entry that all the description of Launchpad seems to be describing an infra with a lot of tech debt.

That's not too surprising, given that the basic topic of the essay is describing how they went about paying down a bunch of that technical debt. The topic of technical debt is going to dominate a discussion like that, even if the project is in good shape overall. That's not to say that Launchpad is in good shape, just that an article about paying down technical debt isn't necessarily the best basis for judging the health of the project.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 1:24 UTC (Wed) by martin.langhoff (guest, #61417) [Link] (1 responses)

Agreed, not surprising, but concerning. High time for the foundational Linux distros to be running on anonymous/cattle non-persistent container images for key software supply chain infrastructure, no?

Named servers have a tendency to go hand-in-hand with on-disk secrets, and missing component updates.

I just hope Colin has a chance to rally some troops / get some help and revamp the whole infra. Not just the Python code.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 14:09 UTC (Wed) by cjwatson (subscriber, #7322) [Link]

Yes, this is definitely something we're working on incrementally.

Leak diagnosis

Posted Aug 4, 2021 2:10 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (3 responses)

Colin mentions a huge memory leak. Unfortunately he never finds out the cause of the leak, having instead entirely replaced the suspect component (zope.server) as part of an existing migration and then the leak went away.

Still, for anybody in this sort of situation a reminder: When you have really BIG leaks like this, you can use the technique Raymond Chen describes. Since most of your heap is now leaked, it stands to reason that any random page of heap from your process is most likely part of the leak. So, pick a random page of heap and display it, then stare at the data and you may be enlightened. Both a hex dump and ASCII might be helpful depending on what's leaking and how well you understand your program.

Some of us don't remember off the top of our heads how to dump a random page of heap data from a running process. So I built a tool:
https://github.com/tialaramex/leakdice is the original C program, and https://github.com/tialaramex/leakdice-rust is the same program ported to learn Rust

Leak diagnosis

Posted Aug 4, 2021 10:32 UTC (Wed) by t-v (subscriber, #112111) [Link] (1 responses)

For a memory leak in Python, I'd probably just look at the types of stuff from gc.get_objects() first.

import gc
import collections
c = collections.Counter(str(type(o)) for o in gc.get_objects())
c.most_common(20)

shows a lot of somewhat surprising parso objects and not so surprising matplotlib the notebook where I tried. The gc module also has get_referrers and friends.

Leak diagnosis

Posted Aug 4, 2021 14:13 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

The nice thing with Raymond's method, and thus leakdice, is you don't need to do any instrumentation, or logging, or have a way to go interactive, you don't need a debugger, you don't need source code, you don't need anything the OS doesn't already provide.

It's extremely unsophisticated, but for really huge leaks it's been effective several times, and so that's why I wrote leakdice. I rewrote it in Rust because (a) I wanted to learn Rust and it's easier to do that with a non-toy problem and (b) Rust prevents a bunch of problems but it cannot prevent Leaks, indeed Rust provides an explicit and _safe_ way to leak objects as Box::leak() because you might have a good reason to do that and the language can't tell. Of course just because you might sometimes want to leak memory does not mean the leaks you actually do have are wanted.

Leak diagnosis

Posted Aug 7, 2021 9:35 UTC (Sat) by cjwatson (subscriber, #7322) [Link]

FWIW I did try various approaches along these lines. The problem is that if something is holding excess references to something like the top-level request object which in turn hold references to everything else your program allocates, then the results are not very useful: they end up just being a sample of your program's normal activity, because most of your program's normal activity is not being freed. (I don't think it was exactly that, but that's a better approximation of what I was seeing to start from. In different circumstances I would absolutely have preferred to chase this down all the way.)

In other words, Chen's technique works better when the problem is large memory leaks in non-reference-counted systems, or where only a small number of kinds of objects are affected. If you have reference leaks such that the graph of uncollectable objects is large and heterogeneous, you need different approaches. https://mg.pov.lt/objgraph/ came closest to helping me track things down.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 3:07 UTC (Wed) by pabs (subscriber, #43278) [Link] (1 responses)

I wonder if they are using Breezy, or if they ported Bazaar to Python 3.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 14:08 UTC (Wed) by cjwatson (subscriber, #7322) [Link]

I covered this in the previous article in this series (https://www.chiark.greenend.org.uk/~cjwatson/blog/lp-pyth...). We switched to Breezy - repeating the Bazaar porting work would have been a lot of duplicated effort that we weren't interested in.