Watson: Launchpad now runs on Python 3
I’m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it's already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3's lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for "bilingual" strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn't what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee's porting FAQ as somewhere to start.)
Posted Aug 3, 2021 5:23 UTC (Tue)
by swilmet (subscriber, #98424)
[Link] (34 responses)
For writing large programs, Python (or any other interpreted language) is a huge mistake in my opinion.
Of course this has already been debated many times…
What really helps is: a good compiler, plus a syntax where it's trivial to know the type of any variable (for both humans and the compiler).
Posted Aug 3, 2021 8:43 UTC (Tue)
by jafd (subscriber, #129642)
[Link] (5 responses)
Posted Aug 3, 2021 9:42 UTC (Tue)
by epa (subscriber, #39769)
[Link] (4 responses)
Posted Aug 3, 2021 15:02 UTC (Tue)
by jafd (subscriber, #129642)
[Link] (3 responses)
Theoretically, you could have had them, it would not be easy, true, but the developers didn't bother at all. Better be "courageous", as they say at one fruit company, and waste 10 years on migration, then just shoot Python 2 in the head and make sure no one ever picks it up, with trademark infringement threats and all.
This is not cool in the least.
Posted Aug 3, 2021 16:12 UTC (Tue)
by niner (subscriber, #26151)
[Link] (2 responses)
But then, Python has never been optimized for its users but instead been optimized for the developers of Python compilers and runtimes. E.g. maintainability of the runtime's code has always been preferred over performance. The reason why one could not name a method "not" or "and" was not that this would have conflicted with builtin syntax. The reason was that prohibiting those made writing the parser easier. And this is not conjecture. It was even documented officially.
Posted Aug 4, 2021 9:10 UTC (Wed)
by ehiggs (subscriber, #90713)
[Link] (1 responses)
I think it would. Otherwise how would you import python2 code from python3?
Posted Aug 5, 2021 9:27 UTC (Thu)
by niner (subscriber, #26151)
[Link]
Like Inline::Python in Perl https://metacpan.org/dist/Inline-Python/view/Python.pod or Inline::Perl5 in Raku: https://github.com/niner/Inline-Perl5
Once you realize that inter-operation between languages (and Python 2 and 3 can be considered different languages) is a communications problem, it quickly follows that distance is only a factor for performance, not for the possibilities. Being in the same process just makes it faster, but everything you do can be done just as well with multiple processes and even over the network. That's why I mentioned performance.
Posted Aug 3, 2021 9:29 UTC (Tue)
by rsidd (subscriber, #2582)
[Link] (17 responses)
The issue is not that it is an interpreted language, it is the incompatible change in language. This happened previously with C++ and other compiled languages too.
Posted Aug 3, 2021 10:27 UTC (Tue)
by ibukanov (subscriber, #3942)
[Link] (4 responses)
Posted Aug 3, 2021 17:14 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
Sadly that was different world where C++ developers still cared about users. Today they adopted schizophreniac stance that it's Ok to break perfectly working programs without warnings but it's a big no-no to make them non-compileable. Thus you get almost the same issue as with Python 3: after each update you hit some random issues which you are supposed to fix without any guidance from the compiler, because, essentially “according to our lawyers this code was always incorrect and worked for 30 years on billions computers by pure accident”.
Posted Aug 12, 2021 17:21 UTC (Thu)
by codewiz (subscriber, #63050)
[Link]
Features that have been deprecated and later removed from C++, such as trigraphs and auto_ptr, were never very popular in the first place.
Posted Aug 4, 2021 7:36 UTC (Wed)
by gdt (subscriber, #6284)
[Link] (1 responses)
Posted Aug 4, 2021 12:20 UTC (Wed)
by pizza (subscriber, #46)
[Link]
Posted Aug 3, 2021 15:09 UTC (Tue)
by HelloWorld (guest, #56129)
[Link] (11 responses)
Compilation vs. interpretation are implementation techniques, and not properties of the languages themselves. Virtually all language implementations use compilation these days, including CPython, which will internally compile your source code into bytecode. It can also write that bytecode to disk (e. g. using the py_compile module).
Presumably what you meant by compiled languages is statically typed languages, and you suggest that statically typed languages trade programmer productivity for performance while dynamically typed languages do the reverse. This is a myth that is based on outdated and underpowered type systems such as those found in languages like C. Modern, statically typed languages are superior in basically every way, and for several different reasons. Static types are a form of machine-checked documentation. This helps human programmers, who now know precisely what they can pass to a function and what they can do with the function's return value. But it also helps when creating tools, such as compilers or editors. These can now point out problems with your code and even make suggestions to fix them. And this is why people bring up static typing when confronted with incompatible changes: they're just much, much easier to deal with when the compiler tells you exactly which pieces of the code need to be fixed. Scala 3 was released not long ago, and we can already see how the migration is going much smoother than the Python 3 migration did.
There may be good reasons to use Python (such as the large library ecosystem), but dynamic typing isn't one of them. Even the Python community is starting to recognize this with the support for type hints and tools like mypy. And this isn't limited to Python, every dynamic language is doing it: Ruby is growing support for type signatures: https://github.com/ruby/rbs. TypeScript as a type-safe JavaScript variant is gaining more and more traction. It's pretty unambiguous at this point. “dynamic typing” is a failed experiment.
Posted Aug 3, 2021 15:53 UTC (Tue)
by rsidd (subscriber, #2582)
[Link] (10 responses)
No, by compiled languages I meant languages that compile to machine code. I use Julia and know how good it is at type inference, how readily it catches certain bugs at compile time, and how fast it is. It is also as easy to write as python. But that doesn't mean one should use Julia everywhere.
Posted Aug 3, 2021 16:15 UTC (Tue)
by HelloWorld (guest, #56129)
[Link] (9 responses)
> I use Julia and know how good it is at type inference, how readily it catches certain bugs at compile time, and how fast it is. It is also as easy to write as python. But that doesn't mean one should use Julia everywhere.
Posted Aug 3, 2021 22:11 UTC (Tue)
by rschroev (subscriber, #4164)
[Link] (8 responses)
In theory, yes. In practice, not so much. Python, for example, makes it very hard to compile to machine code, because of its extremely dynamic nature. Have you ever looked at the byte code that CPython compiles to? It's still very dynamic. Its BINARY_ADD instruction to name a simple example necessarily does dynamic dispatch resulting in arbitrarily complex operations. Yes, it's compiled, but it's worlds apart from "lea eax, [rdi + rsi]".
But Python is not CPython ... in theory. When people speak about Python, they almost always refer to CPython. www.python.org is all about CPython. There are other implementations, but they are hopelessly outdated, or are not complete, or have poor support for external libraries. PyPy is the exception: only slightly behind CPython, mostly compatible. None of those compile Python to machine code (some of the implementations use a JIT compiler, which is not the same thing). In practice, Python is almost always CPython and other implementations don't (yet?) bring all that much difference to the table conceptually.
Consider the difference with languages like C or C++. There exist interpreters for both, both those are pretty exotic. The normal way of using C or C++ is with a compiler that compiles to machine code.
Language design has a profound effect on the feasibility of compiling to machine code vs byte code.
Posted Aug 3, 2021 22:42 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Well, there's also Cython, which can transpile nearly any Python into C. But it has its own drawbacks, since what it really "wants to" do is let you write code in a mixture of Python and Python-with-C-type-annotations, and selectively speed up the latter on a line-by-line basis. If you try to read the C that it produces, you will find that it's almost completely impenetrable (each line of pure Python translates into 4-5 lines of calls to the C/Python API, error handling, dispatch, etc., and the whole thing also has a huge wall of #defines and other preprocessor garbage). So in practice, it's more like compiling to machine code, but in two steps.
Posted Aug 4, 2021 11:20 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (6 responses)
And besides, you've missed the point of the thread. rsidd implied in his original post that Python had some advantage over other languages due to the fact that the most commonly used implementation is an interpreter, and I have yet to see any evidence for that claim. Even more confusingly, he even stated that Julia is faster, just as easy to write and catches more bugs at compile time, so at that point one has to ask what advantages Python as a language could possibly bring to the table, and how they relate to the fact that CPython doesn't compile to machine code (and obviously my presumption is that they don't, but I'd like to hear rsidd's arguments).
Posted Aug 4, 2021 12:18 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (4 responses)
Python is a good language, with a large ecosystem, and wide expertise. Those things matter. Julia's a very good language and its ecosystem is getting there. Python currently has one other advantage over Julia: the latter has very large start-up time (reduced recently but still significant) because of JIT compilation. Great throughput, terrible latency.
If Julia can get over the JIT delays, and especially if it can produce machine-executable binaries, I don't see why it can't be used for large projects (generally, not just in science) instead of python. Especially given its ability to import python packages (increasingly not needed, IMO).
But, even in that case, it makes no sense to rewrite an existing large python codebase, from a time when Julia didn't exist, in Julia.
Many physicists still use Fortran 77. The code exists and it works.
Posted Aug 4, 2021 12:32 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (3 responses)
Posted Aug 4, 2021 16:12 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (2 responses)
C/C++ (compiled)
As I said, things are different now, Julia may be a great choice for a new project, so (more likely, for general purpose projects) may Go or Rust or Scala.
"Good language" is not "entirely subjective" any more than taste in music is "entirely subjective". Experts agree that Bach and Beethoven were great.
Paul Graham wrote this about Python in 2004. It is largely true 17 years later, even though the entire landscape has changed. Python may not necessarily be the best choice today for new projects, but it's still a very good choice if they aren't CPU-bound.
Posted Aug 5, 2021 9:22 UTC (Thu)
by niner (subscriber, #26151)
[Link] (1 responses)
Posted Aug 5, 2021 9:36 UTC (Thu)
by rsidd (subscriber, #2582)
[Link]
Posted Aug 17, 2021 17:57 UTC (Tue)
by JanC_ (guest, #34940)
[Link]
Posted Aug 3, 2021 11:57 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (2 responses)
Microsoft’s power shell shows how easy it is to have *optional* type specification in an interpreted language with strong type inference. You literally just stick a type on a variable and it’s enforced with type errors. Or don’t do it and quickly write your easy script.
Because, yes: unresrictable dynamic typing is bad for large programs. I actually don’t think it’s much debated any more - it’s a positive for smaller stuff and scripts and a negative for lager stuff. It’s not a fatal negative, but it’s absolutely a negative.
Posted Aug 9, 2021 11:54 UTC (Mon)
by swilmet (subscriber, #98424)
[Link] (1 responses)
Good to hear that it's evolving. A compiler (or a static-analysis tool) permits to discover lots of problems in the code without the need to run the specific code path to trigger the error that would be required with an interpreter. In case of a migration to a new major version of the language or a library, it's up to the language/library's developers to trigger compilation errors with not-yet-migrated code. And explicitly writing types alongside variable declarations, it's primarily useful for us humans, to have a better code understanding (what does this variable represent? What functions can I call on it?). It seems that all modern languages present type inference as a good thing (e.g., Rust), where it's possible to omit type information in many places… Anyway, I wrote the following essay some time ago, with solutions for code migration (see section 3.3: Build system support to have a smooth repercussion of an API break):
Posted Aug 10, 2021 10:19 UTC (Tue)
by HelloWorld (guest, #56129)
[Link]
Posted Aug 3, 2021 15:25 UTC (Tue)
by juliank (guest, #45896)
[Link] (6 responses)
Posted Aug 3, 2021 17:00 UTC (Tue)
by LtWorf (subscriber, #124958)
[Link] (5 responses)
Types are mostly not used at runtime, so at runtime they make stuff slower.
Of course people (including me) started to use types at runtime too, so the plan to make them into strings didn't really work out and is to be discussed more before changes being made. (https://lwn.net/Articles/858576/)
Posted Aug 3, 2021 18:08 UTC (Tue)
by juliank (guest, #45896)
[Link] (1 responses)
Import time is only relevant for short-lived scripts, not the large applications that swilmet said Python is unsuited for.
For big applications, and especially web applications not launched by CGI (but WSGI or FastCGI or just doing HTTP themselves) or other daemons, imports are a non-issue.
Posted Aug 4, 2021 18:51 UTC (Wed)
by LtWorf (subscriber, #124958)
[Link]
Posted Aug 5, 2021 11:05 UTC (Thu)
by sammythesnake (guest, #17693)
[Link] (2 responses)
The easy way around this particular problem, however, is to do something like this:
]from typing import TYPE_CHECKING
This way, lots of imports can be completely skipped at runtime.
Personally, I wish there were more runtime type checking support, such as being able to do things like:
]if not isinstance (foo, Mapping[int, str]):
(At runtime, isinstance(foo, Mapping) has to suffice)
Better still, not have to add that boilerplate in the first place (perhaps with a @validate_parameter_types decorator to activate it where appropriate) I could write that decorator, but I'd be reimplementing a bunch of esoteric stuff that isn't really the meat of my current project...
Given the choice, I'd prefer to have comprehensive type safety before runtime (running mypy or the like as a faux "compilation" step would be sufficient of the checks were solid enough) Catching (and fixing) unexpected type problems at runtime in embedded code is troublesome!
Posted Aug 5, 2021 11:22 UTC (Thu)
by juliank (guest, #45896)
[Link]
Posted Aug 5, 2021 15:55 UTC (Thu)
by LtWorf (subscriber, #124958)
[Link]
You can do like
typedload.load([1,2,3,4,5], List[int])
and it will return a List[int] or raise an exception.
It has a few options about casting and so.
load(["1",1], Tuple[int, int])
will return (1,1), but if casting is not wanted basiccast=False can be passed.
It doesn't support generic mapping, it must be a dict or frozendict as output.
load({1: 'a', 2: 'b', 3: 3}, Dict[int, Union[str, int]])
Posted Aug 3, 2021 10:54 UTC (Tue)
by scientes (guest, #83068)
[Link] (40 responses)
Posted Aug 3, 2021 11:24 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
And what if the user does not want a UTF-8 string?
Cheers,
Posted Aug 3, 2021 11:42 UTC (Tue)
by willy (subscriber, #9762)
[Link] (36 responses)
I do not presume to tell the Python people how they should represent strings in memory.
Posted Aug 3, 2021 12:44 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link] (34 responses)
Posted Aug 3, 2021 13:31 UTC (Tue)
by dskoll (subscriber, #1630)
[Link] (22 responses)
Having each character the same size makes a lot of things much simpler, such as finding the length of a string (in characters, not bytes), indexing to a specific string position, copying substrings around, etc. UTF-8 is a great interchange format, but not the best internal representation format.
Posted Aug 3, 2021 13:45 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Aug 3, 2021 15:23 UTC (Tue)
by khim (subscriber, #9252)
[Link] (2 responses)
They are using UTF-16 because of backward compatibility requirements. When Unicode 1.0 was developed and Windows API was developed and Java was developed UCS-2 (not UTF-16!) made sense. It stopped making sense in 1996, when Unicode 2.0 was released, but it took industry decades to finally switch to UTF-8. But python's strings were broken in 2008, by that time choosing anything else but UTF-8 if you weren't forced by compatibility reasons was just stupid.
Posted Aug 3, 2021 15:49 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Citation needed. You can compile it for UTF-8, so clearly they are not API-constrained.
Posted Aug 3, 2021 17:00 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Sigh. Can you open Wikipedia and read? It's almost in the beginning: Java internationalization classes were then ported to C++ and C as part of a library known as ICU4C ("ICU for C"). Java was developed before Unicode 2.0 and first version was even released before Unicode 2.0. Unicode 1.0 created huge mess with CJK specifically to try to fit into 2 bytes. And if you have everything as two bytes then yes, it simplifies things and memory waste is not that big (you couldn't fit into one byte anyway and even with UTF-8 lots of characters require 2 bytes or more). When Java classes were ported to C++ they, naturally, retained UCS-2 representation — even if it was obvious by then that that's not the best choice… they were, essentially, stuck because they wanted to do what Java did. And what Windows NT did. And what browsers did, too. So yes, there absolutely were API-constrained. True, they could have introduced translation layers, but the more they dealt with UTF-16 the harder it was to switch. In some alternate world, where Windows would have won on servers, not just on client, this would have been the end of the story, but in our world Windows lost to BSD and Linux on servers. And these were UTF-8 based thus eventually ICU4C got UTF-8 mode, too. But objectively speaking if you are not API-bound, UTF-8 is the best option today: UCS-4 overhead is just too big (CPUs are fast today, memory is slow, don't forget) and UTF-16, essentially, combines issues from UTF-8 and issues from UCS-4 (too much overhead for ASCII-mostly texts like XML and you still have to deal with surrogate pairs and all that mess).
Posted Aug 3, 2021 14:03 UTC (Tue)
by excors (subscriber, #95769)
[Link] (10 responses)
That depends on what you mean by "character". If you're dealing with human-readable text, and you want to extract substrings that look like sensible fragments of the original string, you probably need to split on grapheme cluster boundaries. If you just split by code point, "🤦🏼♂️ " might turn into the two substrings " 🤦🏼" and " ♂️ " which (depending on font) looks very different and will confuse the user. (Example from https://hsivonen.fi/string-length/)
Grapheme clusters can contain an arbitrary number of code points, so you can't have a fixed-sized representation of them. And once you're forced to use a variable-sized representation, there's no extra harm in using a variable number of UTF-8 code units rather than a variable number of UTF-32 code units.
Are there many situations where splitting by code point is an actually useful thing to do, apart from situations where you really ought to split by grapheme cluster and splitting by code point is wrong and buggy but it works well enough for typical Western text that you don't care enough to implement it properly? (The latter seems very common, but is not something that should be encouraged, for the same reason we now encourage people to write Unicode-aware code instead of pretending everything is ASCII and hoping it won't mangle UTF-8 inputs too badly. I find it hard to think of many valid use cases for splitting by code point, so that doesn't feel like something that programming languages should be optimised for.)
Posted Aug 3, 2021 15:17 UTC (Tue)
by HelloWorld (guest, #56129)
[Link] (8 responses)
Posted Aug 3, 2021 18:34 UTC (Tue)
by excors (subscriber, #95769)
[Link] (7 responses)
Hmm, I wondered how Raku compares to Swift. Swift seems relatively simple - it uses UTF-8 internally (since Swift 5: https://swift.org/blog/utf8-string/) while the API is based around grapheme clusters (https://docs.swift.org/swift-book/LanguageGuide/StringsAn...), and you can't use integer indexes into strings (though you can construct an Index type with an "offsetBy" integer, with O(n) cost), and String.count is O(n). Strings are not stored or processed in normalised form, though comparison operators are based on canonical representation.
(https://hsivonen.fi/string-length/ explains some drawbacks with that: "🤦🏼♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database. That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)
I can't find any detailed documentation of Raku's approach, but it sounds like a similar grapheme-centric API to Swift, except that it allows random access to graphemes and the operations are O(1). The representation is basically UTF-32, but first it converts all strings to NFC, and then for every grapheme cluster that's >1 code point it dynamically allocates a new integer (outside of Unicode's 21-bit code point range) and adds that to a global dictionary, so the string is stored as a sequence of 32-bit integers that are either single-grapheme code points or indexes into the grapheme dictionary.
That sounds interesting, but kind of problematic (if I'm understanding it correctly). A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings. And if you try to write a carefully RAM-limited DOS-resistant Raku network server that only accepts small strings from each client, an attacker could just send lots of small strings with unique grapheme combinations, until eventually your server is bloated with many gigabytes of grapheme dictionary (because MoarVM doesn't attempt to garbage-collect old unused graphemes). Then there's other problems like being unable to roundtrip Unicode text (it will all get converted to NFC), and worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses), and worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8). And it's not supported by the JVM backend for Raku.
hsivonen argues that random access to code points is almost never actually useful, and I think the same applies to grapheme clusters. You only need iterators. Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.
Posted Aug 4, 2021 8:10 UTC (Wed)
by niner (subscriber, #26151)
[Link] (3 responses)
Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.
Posted Aug 4, 2021 11:36 UTC (Wed)
by excors (subscriber, #95769)
[Link] (2 responses)
Are there ideas on how that would be addressed? 2^32 graphemes seems like an unfortunately low limit that might be hit under plausible use cases, now that we've got enough RAM for programs to easily slurp a multi-gigabyte file into memory, so it seems bad if the language can't handle those strings correctly. But I don't see how it can handle those strings without changing the basic NFG idea of representing each grapheme with a unique 32-bit integer.
> Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.
You shouldn't need to re-iterate forwards through the whole string to jump backwards, you can just iterate backwards over code points from the current position until you find the next grapheme cluster boundary. (Unicode says the grapheme cluster rules can be implemented as a DFA (https://unicode.org/reports/tr29/#State_Machines), and it's always possible to reverse a DFA. That's defined over code points but it's easy to layer it over a backwards UTF-8 byte iterator.)
But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later? If you're doing grapheme-based iteration over the string, you don't need to push grapheme indexes onto the stack, you can just push an opaque iterator corresponding to the current position (which may be implemented internally as a byte index so it's trivial to dereference the iterator, and the implementation can guarantee it's always the index of the first byte of a grapheme). So there's still no need for random access or even for reverse iterators.
Posted Aug 4, 2021 12:32 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
So you've just confirmed the previous poster's point! To do this, you need random byte-count access into the string - which is exactly what is allegedly impossible (or not necessary, I think we're all losing track of the arguments :-)
Cheers,
Posted Aug 4, 2021 13:48 UTC (Wed)
by excors (subscriber, #95769)
[Link]
An array of code points (as used by Python) supports random access to code points but not to grapheme clusters. Raku's NFG supports random access to grapheme clusters but not to code points. An array of UTF-8 bytes (as used by Rust and Swift) or an array of UTF-16 code units (as used by Java) support neither.
In all of them, given an iterator (i.e. a pointer to an existing element) you can move forward/backward by one element in O(1) time, and you can save the iterator in some other data structure (like a stack) and go back to it later.
I think the interesting question around Raku is whether random access to grapheme clusters is actually useful in practice, or whether iterators are almost always sufficient (in which case you can implement strings as UTF-8 and avoid the complexity/performance sacrifices that Raku makes).
Posted Aug 6, 2021 0:09 UTC (Fri)
by raiph (guest, #89283)
[Link] (2 responses)
I'll start by noting where Raku has a problem you didn't mention, and one I think that's shared by Raku, Swift, and Elixir:
> "🤦🏼♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database.
Raku doesn't rely on a user's system's ICU. Instead it bundles (automatically extracted and massaged) relevant data with the compiler.
Thus "🤦🏼♂️".chars, and more generally operations on Unicode strings, give the same results when using the same compiler version.
So Rakudo has the opposite problem: when and how to update the Unicode data bundled with the compiler.
For example, Unicode 9, released in 2016, introduced prepend combiners, with consequences for grapheme clustering. Afaik that wasn't introduced into a Rakudo compiler until the end of 2016.
I note the result for "🤦🏼♂️".chars is 3 for Rakudo versions until the end of 2016, and then 1 thereafter. I don't know why, but I presume it was Unicode 9.
Unicode evolution is a blessing (dealing with real problems that need to be solved) and a curse.
I think it'll be a decade or so before grapheme handling becomes a truly mature aspect of Unicode, Swift, Raku, Elixir, and whichever other new PLs are brave enough to join them in making their standard string type be a sequence of elements, each of which is "what a user thinks of as a character" (quoting the Unicode standard).
> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)
Right.
Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.
> I can't find any detailed documentation of Raku's approach
The best *overview* I know of of the MoarVM implementation is https://github.com/MoarVM/MoarVM/blob/master/docs/strings...
Beyond that I think one has to read the comments in the relevant MoarVM source code.
> The representation is basically UTF-32
No. NFG is a new internal storage form that uses strands (cf ropes).
None of the strands are UTF-32, though some will be sequences of 32 bit integers.
> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.
Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/...
Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?
As for the DoS vulnerability, let me pick two exchanges that I think distil the Raku position.
The first focuses on the technical. See core devs discussing it one day in 2017: https://logs.liz.nl/moarvm/2017-01-12.html (Note that this log service is in its early days following the freenode --> libera transition, and the search functionality is buggy.) A key comment by jnthn, MoarVM's lead dev:
> All the negative numbers = space for over 2 billion synthetics. ... there's easily 100 bytes of data stored per synthetic, perhaps more. If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)
The second focuses on the need to eventually deal with it -- folk will want to use Rakudo with more than 200GB RAM used for strings. From an exchange in 2018 (TimToady is Larry Wall):
timotimo: we need a --fast flag for rakudo that just throws out all checks for invariants :P
Short term, Rakudo users need to accept that you don't run Rakudo in a production setting and let it go over 200GB RAM used for strings. Longer term, something better has to be done.
> unable to roundtrip Unicode text (it will all get converted to NFC)
Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8
> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)
Aiui NFG iterates the encoded data *once* when a string is first encountered, and thereafter it is a rope of fixed width strands, and that's mostly it. What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"? Was that a guess?
> worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8).
As per several earlier notes, that's not the case.
> it's not supported by the JVM backend for Raku.
True, but the JVM backend is experimental status, with several problems, not just NFG. It's about lack of tuits of those working on the JVM backend, not the merits of NFG.
> hsivonen argues that random access to code points is almost never actually useful
Fwiw I fully agree. One could say it's almost always *dangerous*. Why bother with random access to code points when their correct interpretation as text absolutely requires iteration?
> I think the same applies to grapheme clusters.
Why? To be clear, a grapheme cluster is (the Unicode standard's abstraction of) what the Unicode standard defines as "what a user thinks of as a character".
A character, something an ordinary human understands, is not remotely comparable to a code point, which is a computing implementation detail.
And there are string processing applications where you really need O(1) character handling.
> You only need iterators.
I don't think that's true. I'll return to that in another comment.
> Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.
Larry Wall committed to NFG, fully aware Rakoons would have to deal with that vulnerability. It requires something like 15GBs worth of carefully constructed attack strings to overflow the grapheme space, which, in 2017, would allocate more than 200GB of RAM for string related storage. It's a theoretical risk that must be addressed longer term; in the near term one must constrain a Rakudo instance to reject allocating string space beyond 200GB or so.
The UTF-32 memory usage expense you thought existed does not. The strand implementation can be expanded as necessary to include forms other than the 8 bit width strand that was designed for efficiently handling mostly ASCII strings.
Does the iteration performance expense you describe exist? I'd appreciate it if you could link to a line or lines in the source code I linked, indicating what you mean. TIA.
In the meantime, aiui there is great value to NFG's O(1) character handling. In short, it's about meeting the performance needs of a string processing PL that, like Swift and Elixir, takes Unicode seriously, but, in Raku's case, also takes seriously a need for directly PL integrated performant Unicode era string pattern matching / regexing / parsing. That's not Swift's focus, which is fair enough, but is fundamental to Raku and its features.
Posted Aug 6, 2021 16:43 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link]
Posted Aug 6, 2021 18:15 UTC (Fri)
by excors (subscriber, #95769)
[Link]
I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku's built-in definition of 'character'), and a user registers with an 8-character password, but then you upgrade Raku and now that user's password is only 7 characters and the form won't let them log in.
To avoid bugs in cases like that, you need to realise beforehand that Raku's definition of 'character' is not stable and you need to implement some alternative form of length counting for any persistent data. I don't mean that's a huge problem, but it's unfortunate that the language's default seemingly-simple string API is creating those traps for unwary programmers.
>> The representation is basically UTF-32
If there are no multi-codepoint graphemes and a single strand and at least one non-ASCII character, then it's sequences of 32-bit integers that are the numeric values of Unicode code points, i.e. in that basic case it's UTF-32 :-) . And that has the benefits of UTF-32 (you can find the Nth character in constant time) and the drawbacks (4 bytes of memory per code point; worse than UTF-8 even for CJK), which are not really affected by the more advanced features that NFG adds.
>> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.
By "dictionary" I don't mean a specific data structure like a Python dict / hash table, I just mean any kind of key-value mapping. In MoarVM it looks like there's actually two: MVM_nfg_codes_to_grapheme maps from a sequence of code points to a synthetic (i.e. a negative 32-bit number), or allocates a new synthetic if it hasn't seen this sequence before, and is implemented with a trie; and MVM_nfg_get_synthetic_info maps from a synthetic to a struct containing the sequence of code points (and some other data), implemented with an array. Both of those are limited to 2^31 entries before they'll run out of unique synthetics.
>> If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)
You can get a VM with 256GB RAM for maybe $1.50/hour - that's not a lot of memory by modern standards. It might have been a reasonable limit in a programming language designed a couple of decades ago, but it seems quite shortsighted to design that now, particularly in a language that's meant to be good at text processing.
I think it's not obvious that the implementation could be optimised later, without significant tradeoffs in performance or complexity. Specifically the performance guarantees of the string API require the string to be stored as an array of fixed-size graphemes (to allow the O(1) indexing), so the only obvious way to increase the grapheme limit is to increase the element size, which would greatly increase memory usage and reduce performance in the vast majority of programs that don't use billions of graphemes, and/or to dynamically switch between multiple string representations. (There may be non-obvious solutions of course; I'm certainly not an expert and haven't investigated this very deeply, and I'd be interested if there were existing discussions about this.)
(I don't particularly care about technical limitations of a non-production-ready compiler/runtime which can be fixed later; but I am interested when those limitations are an inevitable consequence of the language definition. Raku expects the O(1) grapheme indexing and that constrains all future implementations of the language, and it's interesting to compare that to other languages' string models.)
(Incidentally I'm ignoring strands here, because it looks like MoarVM has a fixed limit of 64 strands per string. Finding the Nth character might require iterating through 64 strands before doing an array lookup, but technically that's still O(1) even if the constant factors will be bad.)
> Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8
It can, as long as you want to do almost nothing with the string apart from decode and encode it (in which case why bother using a Unicode string at all? You could just keep it as bytes). Even if it's perfectly valid UTF-8, but it's NFD instead of NFC, you'll get garbage if you try to print it or encode it like a normal Unicode string:
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8');
The single UTF-8 grapheme gets decoded into 3, so any grapheme-based text processing algorithms will misbehave. The only way to get correct processing is to not use UTF8-C8, and let Raku lossily normalize your string.
That contrasts with Swift where strings aren't stored in normalized form but most string operations behave as if they were. E.g.:
let a = String(decoding: [0x6f, 0xcc, 0x88], as: UTF8.self)
so it operates over graphemes but it preserves the original bytes when you re-encode as UTF-8.
>> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)
I was slightly wrong about the "walk" because I mixed up the grapheme array and the trie - reading the string only needs the array, which is much cheaper than the trie.
But still, if you're doing an operation where you need to access each code point (e.g. to encode the string) you will iterate over the string's 32-bit elements, and if an element is a synthetic then you have to read nfg->synthetics[n].codes (which will likely be a cache miss, at least once per unique grapheme in the string) then read nfg->synthetics[n].codes[0..m] (another cache miss). That sounds slower than if the string was just stored as UTF-8/UTF-16/UTF-32, where all the relevant data is stored sequentially and will be automatically prefetched.
Admittedly that's only particularly relevant when iterating over code points, which doesn't seem too important outside of encoding. Encoding does seem quite important though. I don't have any benchmarks or anything, just a vague concern about non-linear data structures in general. It's a deliberate tradeoff to get better performance in some grapheme operations, but I worry it's a high cost for questionable benefit.
> And there are string processing applications where you really need O(1) character handling.
Do you have specific examples where it is really needed? I suspect that in a large majority of cases, all you really need is forward/backward iteration and the ability to store iterators in memory. E.g. for backtracking in pattern matching / regexing / parsing, you just store a stack of iterators to jump back to, instead of storing a stack of numeric indexes. That can be done with a variable-sized-grapheme string representation (like UTF-8 or UTF-32), avoiding the compromises that are required by a fixed-size-grapheme representation.
Posted Aug 5, 2021 7:01 UTC (Thu)
by ssmith32 (subscriber, #72404)
[Link]
Of course, the good news is that the "typical Western text" more & more includes useful test cases like 🤦🏽♀️
I find putting emojis in my code & git commits not only serves as a nice outlet, but also as a great test!
Mixing it up with some right to left scripts for an added bonus & language practice!
Of course, until some annoying person decided method names must be ascii. And, thus, my unit tests are no longer ممتاز
Posted Aug 3, 2021 15:18 UTC (Tue)
by khim (subscriber, #9252)
[Link] (5 responses)
A lot? It makes it possible to do something useless and pointless quickly, yes. Wow. What an achievement! What do you plan to do with than info? Even if you used monospaced font you still have to deal with the simple fact that 14 character “ And if your editor supposed diacritics then “length of a string” becomes even more pointless. I literally couldn't recall any algorithm which works with characters and which is useful for any purpose if you have to support the whole range of human scripts. And if you don't need to support all types of scripts then it's not clear why would you need Unicode at all: use iso-8859-x or koi-r (depending on what languages you need to support) why would you need anything else? No, you can't. Zero-width joiner and non-joiner, bidirection and many other surprises would make such index pointless. Works perfectly in UTF-8, too. Well, maybe in some exotic cases, but in general UTF-8 is absolutely the best format: most algorithms either work fine with UTF-8, or they don't work with UCS-2 or UCS-4 either.
Posted Aug 3, 2021 15:59 UTC (Tue)
by jafd (subscriber, #129642)
[Link]
Posted Aug 4, 2021 0:09 UTC (Wed)
by roc (subscriber, #30627)
[Link] (2 responses)
Posted Aug 4, 2021 4:06 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Aug 4, 2021 7:23 UTC (Wed)
by felix.s (guest, #104710)
[Link]
Posted Aug 5, 2021 22:38 UTC (Thu)
by bartoc (guest, #124262)
[Link]
More seriously you can get useful results for "display width" with just the simple, non-tailored grapheme clusterization algorithm. If you're a word processor or text layout system then you're probably going to want to tailor clusterization results for both language/region and font (some sequences of code-points are considered as "one character" to users in some cultures but not others, even when everyone is using the same font!)
Also: if you're doing a ton of clusterization and decoding (like in a text editor) then utf-16 may be cheaper to go through and decode than utf-8, just in terms of number of instructions. (obviously it may loose out since the characters are bigger, but still, benchmarking would be required). If you're smart with how you're decoding things it shouldn't matter either way though, on modern machines.
Posted Aug 5, 2021 11:28 UTC (Thu)
by immibis (subscriber, #105511)
[Link]
Is counting code points really useful? ü can be two code points (not sure whether the one I typed is). Is the string "ü" two characters long? Can I put my cursor between the "u" and the "◌̈" ?
Posted Aug 3, 2021 19:36 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link] (10 responses)
In general I'm convinced that anyone suggesting that having code points directly accessible/be the same size helps with anything that a human cares about, simply hasn't learned enough about doing Unicode correctly yet (the latter applies to me as well). Seese has summed up pretty well what the relevant topics are:
> What you want is either the size in bytes,…
for copying strings around in memory or storing strings somewhere, e.g. a database or a file format with fields with a certain length limit given in bytes
> …the length in number of grapheme clusters,…
for doing anything related to displaying the string, e.g. knowing where you may split so that the human perceives the resulting parts as being correct or deciding how far to move the cursor when the human presses a cursor key (you want one cursor press to skip over a whole emoji, no matter how many code points that emoji consists of)
> …or the width in pixels (given a font).
again for displaying the string somewhere or calculating how wide controls become.
Neither of those three topics benefits from using UCS-4 or, even worse, UCS-2 (which wchar_t is on Windows), because it's only the code points that consist of fixed number of bytes, but not the grapheme clusters which can still consist of several code points. For the purpose of handling Unicode correctly, all three encodings (UTF-8, UTF-16 aka UCS-2, UTF-32 aka UCS-4) are actually variable-length encodings.
Of course if one doesn't care about doing Unicode correctly one could simply split a string between arbitrary code points. Then you can also chose to simplify your life in other ways, e.g. https://xkcd.com/221/
Posted Aug 3, 2021 19:46 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link]
Posted Aug 3, 2021 23:08 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (8 responses)
PostgreSQL VARCHAR counts Unicode code points ("characters" according to the documentation). T-SQL wants you to use NVARCHAR instead of VARCHAR, and NVARCHAR has the annoying property of counting UTF-16 code units (so BMP code points count as 1, non-BMP code points count as 2). Oracle can be instructed to count code points instead of bytes (but it uses bytes by default). MySQL and SQLite use bytes (to the extent that the latter even has such limits; it collapses all CHAR/VARCHAR types to TEXT, but TEXT is not completely unlimited), so I suppose this claim is not completely false. But it is misleading, by suggesting that *all* databases count bytes. They do not, and in my experience, when the engines differ on points like this, PostgreSQL's behavior is usually the closest to the actual SQL standard (which is not available online as far as I can tell, so I cannot confirm what it says to do for characters vs. bytes).
On a related note, you also have to consider interfacing with other systems. Twitter, for example, counts Unicode code points and not bytes or grapheme clusters when determining the maximum length of a tweet. I imagine there are other APIs which similarly impose code point-based limits and not byte-based limits. It is unrealistic to assume that all APIs will "be reasonable" when so much of the computing industry was operating under the "Unicode is a 16-bit encoding" falsehood for such a long period of time (and many people still wrongly believe this!).
Posted Aug 4, 2021 1:14 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
That's not correct now, per https://developer.twitter.com/en/docs/counting-characters - there's a limit of 280 "characters" where Latin-1 code points and most punctuation code points count as 1, all other code points (most notably CJK characters) count as 2, certain emoji count as 2 (even a family like "👨👩👧👦" which is 7 code points, plus you could add skin tone and gender modifiers and I think it would still count as 2), URLs count as 23, etc. Also the string is converted to NFC before counting. And they've had at least three different versions of the counting algorithm.
(That illustrates the general uselessness of counting code points; you need something far more sophisticated if you want it to feel reasonably fair to users.)
Posted Aug 4, 2021 1:50 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
OTOH, the emoji counting makes sense, because they preprocess the emoji and turn them into static images (emoji support varies significantly by target system). So you might as well give users a sensible limit instead of counting code points. Similarly, URLs count as 23 because they are automatically link-shortened (whether you want it to or not...).
Posted Aug 4, 2021 7:10 UTC (Wed)
by mbunkus (subscriber, #87248)
[Link]
That is not at all what I intended to say. I'd hoped writing "_a_ database… with …length limit given in bytes" instead of "…databases which all give length limits in bytes" or simply just "…databases…" made that clear enough, but it seems it didn't.
Posted Aug 4, 2021 11:15 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (2 responses)
MySQL fairly consistently uses code points for Unicode collations (including the default utf8mb4_0900_ai_ci). There is next to no support for grapheme clusters. If you cast to BINARY, you can use bytes, e.g. LENGTH(CAST(x AS BINARY)) should give you the number of bytes needed to store x, but LENGTH(x) will give you the number of code points.
(I work on the MySQL optimizer team, which among others, is responsible for collation functionality.)
Posted Aug 4, 2021 19:03 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Aug 4, 2021 19:10 UTC (Wed)
by Sesse (subscriber, #53779)
[Link]
Internally in RAM, MySQL allocates (non-blob) strings as (max width) * (max number of bytes used for a single code point) bytes. This is why “utf8” historically means UTF-8 with max. 3 bytes per code point (utf8mb3); at the time, supporting full UTF-8 would mean having utf8mb6, which I guess was seen as prohibitively expensive wrt. latin1, and emoji wasn't a thing. Now, Unicode has made it clear that nothing will be allocated in the astral planes, support for 5- and 6-byte UTF-8 has been entirely removed, and the recommended default is utf8mb4 (the “utf8” alias still expands to utf8mb3, but you get a deprecation warning that it might change in the future).
Posted Aug 4, 2021 20:02 UTC (Wed)
by khim (subscriber, #9252)
[Link] (1 responses)
I agree (and even explicitly wrote that before) that in some kind of alternate world, where Linux died and Windows won UTF-16 would have been sensible choice simply because of API compatibility issues. But in today's world enough of APIs are in UTF-8 (and even Windows finally supports it properly). Thus today use of UTF-16 should be relegated to legacy apps.
Posted Aug 4, 2021 22:19 UTC (Wed)
by mbunkus (subscriber, #87248)
[Link]
Oh wow, I hadn't been aware of that. That might mean that several libraries ported poorly from Linux might actually start working with non-ASCII paths. "Ported poorly" here means that they still use the POSIX API calls, which in turn means that passing any kind of path with characters that aren't part of the ANSI code page cannot be found. Examples of such libraries include more obscure ones (e.g. libdvdread) and some popular ones (e.g. libintl/gettext).
Thanks for pointing that out; will investigate whether this means POSIX APIs, too, not just the various -A variants of Windows' native API.
Posted Aug 3, 2021 18:42 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
That nonsense again... If you're writing a modern text processor then you need to be able to decompose text into grapheme clusters that can consist of multiple wchar_t code points. There is no functional difference between translating utf-8 or utf-32 into grapheme clusters.
Posted Aug 3, 2021 17:40 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Right! And so how do you represent a filename? That's an arbitrary sequence of bytes, but one that is *usually* a valid UTF-8 string, and it is common to want to, say, regex-match on it. (The same for the contents of text files.)
Emacs once did this via a brutally complex encoding of its own, then migrated to UTF-8-with-stuff-on-top later on, because *most* things fit into it. But... not everything does, and you do still have to represent the things that don't without data loss in a general-purpose programming language.
Posted Aug 4, 2021 0:46 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
For example Rust will let you write:
let x = std::fs::File::open("/some/file/name.extension")?;
You can write a LOT of code with this and never realise non-UTF8 filenames can even exist, you never use one. But, if you know perfectly well the file has some crazy non-UTF8 name, you can write it as an OSStr by whatever means and open() that instead since Rust knows how to deal with that. OSStr is a type designed to allow whatever insanity is expected by your operating system, so, arbitrary non-zero bytes for Linux and arbitrary non-zero u16s for Windows, doubtless other possibilities exist.
But then the other side of the coin. Maybe we've got a filename, but it might not be valid UTF-8?
Rust presents such a filename as the Path type and offers you all three plausible ways forward to: 1. Ask the Path for a UTF-8 string, if that works you've got a UTF-8 filename, otherwise you get None. 2. Lossily convert the Path into a UTF-8 string. If it wasn't Unicode already then Unicode's replacement character U+FFFD is presented instead of each sequence that could not be decoded so it's safe to display, however you shouldn't open this (possibly different) name as a file. 3. Convert the Path into a raw OSStr, getting the raw OS name for this file and figure out your own path forward for what to do with that.
From the command line this is nice because you can get the program arguments as OSStr if you want, so you can take "invalid" (non-UTF-8) filenames as command line parameters without either pretending they're raw bytes, which they definitely aren't, or being unable to write error messages about them.
Posted Aug 3, 2021 13:33 UTC (Tue)
by martin.langhoff (guest, #61417)
[Link] (36 responses)
This is a worry – Launchpad is in a key spot in the supply chain to quite a bit of internet infra.
...
Semi-off-topic Python3 rant... Python 3's over-eager use of UTF-8 is a massive PITA for applications that don't control their inputs.
I maintain a "simple" code analysis tool (SLOC and other stats), which must be able to process random 3rd party code.
The switch from Py2 to Py3 was relatively straightforward -- but the "long tail" of dealing with input files that are not UTF-8, or that are _mostly_ in one codepage but have a couple of stray bytes is... long indeed. I've applied all the tricks in the book, and I'm -still- dealing with frequent "barfs on this input" reports.
Posted Aug 3, 2021 15:18 UTC (Tue)
by mb (subscriber, #50428)
[Link] (32 responses)
I don't think Python enforces UTF-8 anywhere.
>but the "long tail" of dealing with input files that are not UTF-8, or that are _mostly_ in one codepage but have a couple of stray bytes is... long indeed. I've applied all the tricks in the book, and I'm -still- dealing with frequent "barfs on this input" reports.
Well, so your data is broken.
Python let's you choose any of these options at decode() time. It doesn't enforce one.
Posted Aug 3, 2021 15:29 UTC (Tue)
by martin.langhoff (guest, #61417)
[Link] (20 responses)
That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.
For example, version control systems will try to "parse" text in limited ways -- newlines for diff/patching -- while being agnostic about individual characters. And Unix allows any old crud in a directory entry, _it's a bag of bytes_. The Mercurial folks had a similarly hard times with Py3.
Posted Aug 3, 2021 16:18 UTC (Tue)
by mb (subscriber, #50428)
[Link] (11 responses)
But you do control how you process that data.
>newlines for diff/patching -- while being agnostic about individual characters.
That's just waiting to blow up, if you just scan for values at bytes boundaries, which could be in the middle of any Unicode character.
Posted Aug 3, 2021 16:50 UTC (Tue)
by comex (subscriber, #71521)
[Link] (2 responses)
Posted Aug 3, 2021 17:26 UTC (Tue)
by mb (subscriber, #50428)
[Link] (1 responses)
But that doesn't help, if the input data encoding is not known or if it's not even known to be correctly encoded.
Posted Aug 3, 2021 23:41 UTC (Tue)
by lsl (subscriber, #86508)
[Link]
Python 3 made that very inconvenient, especially in inital versions where there weren't any byte string versions of common functionality (e.g. getenv).
Even today, many Python libraries and programs just explode because they don't use the byte string versions of standard library functions even when it would be appropriate (common examples are Unix filesystems or environment variables as well as many network protocols). The str variants tend to be more prominently documented and more convenient so that's what many authors use.
Reliability suffers as a consequence.
Posted Aug 3, 2021 17:07 UTC (Tue)
by khim (subscriber, #9252)
[Link] (7 responses)
Except Python3 made it, initially, impossible. Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”. Python3 had binary strings which were possible to use for processing such data but they were limited (on purpose). That was huge regression compared to Python 2 (not even sure they fixed everything since I have never switched from Python 2 to Python 3).
Posted Aug 3, 2021 17:23 UTC (Tue)
by mb (subscriber, #50428)
[Link] (6 responses)
Do you have an example?
>Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.
If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.
>That was huge regression compared to Python 2
What exactly is impossible to do in Python 3, that had worked fine in Python 2?
>not even sure they fixed everything since I have never switched from Python 2 to Python 3
Ok, so this is just FUD?
Posted Aug 3, 2021 17:50 UTC (Tue)
by khim (subscriber, #9252)
[Link] (4 responses)
Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1. How, pray tell, should I format anything if I'm told not to use % and How do you propose to do that if we are dealing with XML which includes random binary strings? Yes, I know it's not valid XML, but you know, customers don't care. If old version of program works and new doesn't then they would just say it's broken and would ask to fix it. Python 2.6 and Python 3.0: the ability to get the filename and, you know, open the file, e.g. Yes, I know, they fixed that in Python 3.4. By stopping to pretend that filenames are strings. And maybe, by now, they have even made it possible to combine them with raw sequences of binaries. But I'm not really interested in that now: I have left the Python camp when they broke strings in Python 3. That was better choice than waiting for when they would make it kinda-sorta usable again. If press-release by language makers is FUD to you then yeah, that FUD, I guess. Python 2.6 was much better than Python 3.0 and Python 2.7 was much better than Python 3.3. I later found out that issues with filenames was fixed in Python 3.4 (kinda-sorta: you still can't work with filenames as well as you can in 2.7, but at least you can support them now… what an achievement), but I suspect even latest versions of Python 3 are still worse than Python 2.7 (although, true, I don't use it thus I wouldn't know).
Posted Aug 3, 2021 23:54 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.
Posted Aug 4, 2021 6:28 UTC (Wed)
by khim (subscriber, #9252)
[Link] (2 responses)
Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years. Granted, all three “big P” languages (Perl, PHP and Python) tried to do that and only Python managed to break the developers expectations yet hold the mindshare, but I wonder what would have happened in an alternate history where at least one camp would have tried to evolve language without huge breakages.
Posted Aug 4, 2021 19:08 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Meh. So what? Python 2 lost support well over a year ago. As far as I'm concerned, this is all ancient history at this point. But every time anyone mentions Python, in any capacity, on LWN (or Hacker News, for that matter), the comments *always* turn into a ridiculous flame war over it.
Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?
Posted Aug 4, 2021 19:39 UTC (Wed)
by khim (subscriber, #9252)
[Link]
Yup. Means now it's time to think what to do about that mess. Look here, for example Distributions are starting to remove Python 2 which means that simple solution which worked for years (ignore the Python3 existence and use Python 2) no longer works. Enterprise guys are starting to switch. Some of them. But that saga is still far from being over. When RHEL 8 would go out of support? 2029? Well, I guess by 2030 or so we may declare Python 2 dead and buried then. Sorry but if that's all “ancient history at this point” then why do you even leave comments under article which proves that it's most definitely not an ancient history? Look on it's title for chrissake! You may try to go away and pretend that it doesn't exist but it's a stupid to pretend that all that pain happened years ago somehow when commenting something that shows that story is still ongoing. Sure. If that would have been what mb would have said then I would have stopped the discussion. But that's not what happened. It's a bit like pain if switching from Winows Classic to Windows XP or from Mac Classic to MacOS X. You may say what's “done is done” (and that would be true!) but that still doesn't change the fact that it was painful and unnecessary. Computer industry is now mature industry. Decades-long deprecation cycles are typical. You can't avoid it.
Posted Aug 3, 2021 18:53 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
For example, it was initially impossible in Python 3.0 to use byte strings in "%" formatting. It was fixed only in Python 3.5: https://www.python.org/dev/peps/pep-0461/
> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode. That's how it should have been done in the Python 2 script, too.
In many cases what I want is to read the data, parse it a bit and then give it verbatim to the next layer that might make more sense about it. Even if the data can't be properly represented as valid UTF-8.
Filesystems were a great example. Up until Py3.6 the built-in Python filesystem module simply SKIPPED undecodable file names in directory listings. How's that for reliability?
Posted Aug 4, 2021 5:49 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (7 responses)
There's no space where you can't reject data, because there's simply data you can't make head-or-tails of. And you can always delegate massaging data into a sane format into a separate function.
> And Unix allows any old crud in a directory entry, _it's a bag of bytes_.
That's broken, because it's not. What's stored in 62696e? If we weren't having this discussion, would you even start to think that was a standard Unix directory with a three-letter name? https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Common... says that certain libraries must be installed with runtime names like libcrypt.so.1. At no point does it mention that that's a transliteration into ASCII. (It could be κιβγρψοτ.σξ.1, with ELOT 927; that's an entirely acceptable reading of a bag of bytes.) If it were a bag of bytes, then the standard would be careful not to be implicit about it. In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.
Posted Aug 4, 2021 7:43 UTC (Wed)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Live and learn, I guess.
Posted Aug 4, 2021 7:44 UTC (Wed)
by mpr22 (subscriber, #60784)
[Link]
Posted Aug 4, 2021 11:11 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (4 responses)
> At no point does it mention that that's a transliteration into ASCII.
Posted Aug 4, 2021 12:27 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern. (I've had systems where control characters were deliberately inserted into file identifiers to protect the files as much as possible from accidental damage.)
While I think utf-8 can contain pretty much the entire sequence of valid bytes, there are order limits so it's not a *random* byte pattern.
Cheers,
Posted Aug 4, 2021 12:40 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (1 responses)
> Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern.
Posted Aug 4, 2021 21:20 UTC (Wed)
by dvdeug (guest, #10998)
[Link]
It's been discussed; see https://dwheeler.com/essays/fixing-unix-linux-filenames.html . The POSIX requirement is for incredibly limited names only, and it's easy to restrict it greatly at the cost of zero to most users.
Posted Aug 4, 2021 20:58 UTC (Wed)
by dvdeug (guest, #10998)
[Link]
It's not a bag. I skipped this first time around, but a bag is generally a name for a multiset, or at least some other data structure that has no order. A Unix file name is a sequence of bytes.
Again, nobody treats it as a sequence of bytes. Coreutils will make sure not to dump random noise to the screen, but there's no standard C functions to do that that I recall, and I don't know if GNU Libc has something there. It's not sane if everyone just treats them as strings, and anyone wanting to handle them as a sequence of bytes has to write special code and tiptoe around everything.
Posted Aug 3, 2021 15:35 UTC (Tue)
by martin.langhoff (guest, #61417)
[Link] (10 responses)
It still blows up in random ways.
Maybe there's a couple tricks I am missing (hey, I can always learn more) but it should not involve mysterious tricks nor be this fiddly.
Posted Aug 3, 2021 16:21 UTC (Tue)
by mb (subscriber, #50428)
[Link] (9 responses)
What does that mean?
Posted Aug 3, 2021 21:32 UTC (Tue)
by martin.langhoff (guest, #61417)
[Link] (8 responses)
Just today, some XML files that seem to be UTF-16 throw a UnicodeDecodeError: utf-16-le can't decode bytes ... illegal encoding.
There will be an explanation for this -- an infrequent codepath, maybe some open() or decode() I haven't hunted down yet, because Python 3 needs significantly more handholding when opening a file than Py2.
Posted Aug 4, 2021 1:35 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (7 responses)
The idea of "surrogateescape" is that hey, if we could smuggle gibberish through the decoder/ encoder pipeline, and the rest of our code never looks at the resulting non-sense, then we can have what we "want" of just pretending it's ASCII. This is a bad idea, and for security reasons it blows up in some cases, so it can't achieve this false panacea anyway.
Without knowing way too many details about your application it's impossible to be certain, but I'd suggest "replace". Here's the consequences of "replace" so you can assess if they're closer to what you need:
1. Any time your guess doesn't work and the decoder is looking at stuff that is not in fact Unicode text in whatever encoding you hoped for, sequences that don't decode turn into the Unicode code point U+FFFD, a code point reserved specifically for this purpose. This code point is definitely not: a letter, a digit, punctuation, whitespace, part of a variable name, an ASCII control character, or anything else, but it is definitely itself. There's an excellent chance your code needs no extra work to cope with this which is nice. If it does, this code point can (but usually doesn't) exist in Unicode data anyway, so you likely should fix that. No exceptions, no decode errors, just Unicode data, some of which may be U+FFFD which again, was already valid Unicode.
2. If your user/customers see this replacement character U+FFFD it looks like this: � and that's pretty obviously not what they meant. Or I guess, in a few cases maybe it's exactly what they meant. Maybe they're writing a Unicode decoder. But unlike escaping and other nonsense it's very obvious what's going on here. You can even Google for it.
3. On output, in the worst case where you were sure the output should be ASCII, Python has no choice but to choose ? because that's the best it can do. The good news is that humans can generally realise that function_???????_??? means something went wrong. The better news is that you probably don't see this too often when you're outputting ASCII. Any time your output is some flavour of Unicode, Python can emit an actual U+FFFD.
Now, personally I think "surrogateescape" is as much Python's fault as yours. A programming language should encourage its users not to set themselves on fire. Providing "I'm sure I know what I'm doing" features is near guaranteed to be contrary to that purpose. If you're C++ then you have some excuse, but I don't see how Python does. PEP 383 should not have been accepted.
Posted Aug 4, 2021 2:07 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Of course, the *correct* way to do that is to use a high-level library like pathlib instead, so that you don't have to fiddle around with individual strings in the first place, and it can do all the ugly string magic that Unix requires. So in practice, nobody has any business using surrogateescape, unless you're implementing a pathlib-like library and know exactly what you are doing.
Alternatively, the correct way is to keep your paths as bytes, but then you can't do most string-like operations on them, which is painful. Also, the Windows API will not give you filenames in any format other than UTF-16, unless you want to use some random 8-bit encoding that Microsoft confusingly refers to as "ANSI," but which in fact could be almost any Windows code page depending on the user's locale. Faced with two bad options, Python chose UTF-16, which means it is now in the position of having to convert those UTF-16 strings (which can also contain invalid code points and IIRC even mismatched surrogate pairs!) in such a way that your code (which you wrote to run correctly on a Unix platform) doesn't break on them. Hence, "do nothing and return raw bytes" was never really a good option at the language level, for a language that wants to be cross-platform.
Posted Aug 4, 2021 17:08 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link] (5 responses)
The better answer here would be to treat filenames as opaque blobs of unstructured data, and perform any necessary conversion at the UI level—without surrogates. The same goes for other interfaces such as the argument list and environment variables where the data is not guaranteed to be UTF-8. It's not "mis-encoded garbage", you're just applying an inappropriate decoding. There is no good reason to attempt UTF-8 decoding while enumerating a directory when you're just getting data from one filesystem API and passing it to another without presenting any of this to the user.
If you do need to display the data to the user, or otherwise treat it as human-readable text e.g. for collation, *at that point* attempt to decode it as UTF-8 (or whatever the actual locale is set to) and substitute the U+FFFD replacement character for anything that can't be decoded. This does imply that there is probably no way to name certain files through, say, a GUI text box, but one should at least be able to select them from a file chooser if the program is properly designed.
Not being able to do string operations on filenames isn't really much of a handicap. This is like complaining that you can't use strcmp() to compare arbitrary binary data; filenames aren't strings, so it follows that string operations are not well-defined on filenames. Apart from concatenation, which works in (almost) every encoding, the only other operations you need are platform-specific anyway: joining with a platform-specific path separator, splitting on the same separator (or platform-specific alternatives, like '\' and '/' on Windows), and pattern matching. These operations are better handled by something like pathlib than by string functions. If you *really* need to treat a filename as a string for some reason (e.g. to perform a search) then go ahead and convert to UTF-8, but with the understanding that this conversion is lossy and converting the result back won't necessarily give you the same filename.
Posted Aug 4, 2021 17:45 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
So do what DataBASIC does (which can get confusing) and have multiple representations of the data within a single variable. The canonical form is always a string, but it can be a file id, a number, etc etc so the variable is internally a structure. You can compare against the utf-8 version to check whether it's what you're looking for, and then go back to the original when you actually want to access the file.
Cheers,
Posted Aug 4, 2021 18:40 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
You can see some fruits of these "do it for the UI" munging and then losing track of reality in `explorer.exe`. First, make a file named `NUL` available in some way (usually via a Samba share hosted on Linux). Explorer doesn't like this, so it gives it some other mangled name (let's say `zxcv` for example). Whatever it is, make another file with *that* name in the share on the host. Explorer will render these as the same filename and deleting *either one* will delete the file actually named `zxcv` after which you can delete the `NUL` file by selection. I have no idea what the file open dialog ends up doing though. Or what happens if one is a directory and the other is a file for that matter.
Posted Aug 4, 2021 19:20 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
\\?\C:\Path\To\File
It has to be an absolute path, and it has to start with that funny \\?\ prefix. This will let you smuggle almost any string under the sun through the vast majority of Windows's sanity checks, as long as it's valid UTF-16 ("Unicode" to use Microsoft's confusing terminology) and the underlying volume is NTFS (and not FAT or some other legacy filesystem). It bypasses the MAX_PATH check, and purportedly even allows you to use names like "." or ".." without path resolution interpreting them as current/parent directory.
Of course, nothing in the Microsoft ecosystem can handle such files correctly. Explorer is completely lost, most programs will get confused and tell you the file does not exist, etc.
Posted Aug 5, 2021 21:18 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link]
The symbols are in some sense UTF-16 code units, but the actual name needn't be a UTF-16 string. So the four u16s DFFF DFFF D800 0041 make a perfectly reasonable name for NTFS, but obviously that's not a valid UTF-16 string.
Your Windows UI won't like that very much but the filesystem and core OS services think that's a perfectly reasonable name for a file.
This is all getting far off topic. I was rather hoping martin.langhoff might have feedback on my suggestion instead :(
Posted Aug 6, 2021 0:18 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link]
Yes, that's where the "if the program is properly designed" part comes into play. There would be situations where different filenames were rendered as the same UTF-8—you can get this even with valid UTF-8 if you're not enforcing a particular normalization at the filesystem level—but the file list should be keeping track of the original opaque filename for each file in the list so that when you select a file (distinguished e.g. by modification date) and instruct the tool to delete it the tool deletes the correct file. It shouldn't take the converted UTF-8 name which was only suited for presentation and delete something unrelated which just happens to have a similar UTF-8 version of its name.
Posted Aug 3, 2021 17:06 UTC (Tue)
by rgmoore (✭ supporter ✭, #75)
[Link] (2 responses)
That's not too surprising, given that the basic topic of the essay is describing how they went about paying down a bunch of that technical debt. The topic of technical debt is going to dominate a discussion like that, even if the project is in good shape overall. That's not to say that Launchpad is in good shape, just that an article about paying down technical debt isn't necessarily the best basis for judging the health of the project.
Posted Aug 4, 2021 1:24 UTC (Wed)
by martin.langhoff (guest, #61417)
[Link] (1 responses)
Named servers have a tendency to go hand-in-hand with on-disk secrets, and missing component updates.
I just hope Colin has a chance to rally some troops / get some help and revamp the whole infra. Not just the Python code.
Posted Aug 4, 2021 14:09 UTC (Wed)
by cjwatson (subscriber, #7322)
[Link]
Posted Aug 4, 2021 2:10 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (3 responses)
Still, for anybody in this sort of situation a reminder: When you have really BIG leaks like this, you can use the technique Raymond Chen describes. Since most of your heap is now leaked, it stands to reason that any random page of heap from your process is most likely part of the leak. So, pick a random page of heap and display it, then stare at the data and you may be enlightened. Both a hex dump and ASCII might be helpful depending on what's leaking and how well you understand your program.
Some of us don't remember off the top of our heads how to dump a random page of heap data from a running process. So I built a tool:
Posted Aug 4, 2021 10:32 UTC (Wed)
by t-v (subscriber, #112111)
[Link] (1 responses)
Posted Aug 4, 2021 14:13 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
It's extremely unsophisticated, but for really huge leaks it's been effective several times, and so that's why I wrote leakdice. I rewrote it in Rust because (a) I wanted to learn Rust and it's easier to do that with a non-toy problem and (b) Rust prevents a bunch of problems but it cannot prevent Leaks, indeed Rust provides an explicit and _safe_ way to leak objects as Box::leak() because you might have a good reason to do that and the language can't tell. Of course just because you might sometimes want to leak memory does not mean the leaks you actually do have are wanted.
Posted Aug 7, 2021 9:35 UTC (Sat)
by cjwatson (subscriber, #7322)
[Link]
In other words, Chen's technique works better when the problem is large memory leaks in non-reference-counted systems, or where only a small number of kinds of objects are affected. If you have reference leaks such that the graph of uncollectable objects is large and heterogeneous, you need different approaches. https://mg.pov.lt/objgraph/ came closest to helping me track things down.
Posted Aug 4, 2021 3:07 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Aug 4, 2021 14:08 UTC (Wed)
by cjwatson (subscriber, #7322)
[Link]
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Disclaimer: I'm the maintainer of both and the author of the latter.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Languages don't compile to machine code, language implementations do.
And yet you haven't named a single reason why Python would be preferrable. Again, there may be such reasons, but it's unlikely that they have anything to do with the fact that CPython doesn't compile to machine code.
Watson: Launchpad now runs on Python 3
> Languages don't compile to machine code, language implementations do.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
At the time python started becoming popular, the options were:
Watson: Launchpad now runs on Python 3
Perl (unsuitable for large projects)
and a bunch of niche languages.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
A DAG of components – for an internal architecture tooWatson: Launchpad now runs on Python 3
Type inference is a good thing, because annotating each and every variable just gets tedious, especially in lambda expressions. And yes, it's very valuable to know what functions you can call on a variable, but that doesn't preclude type inference, it just means you need a decent editor to display the inferred types as you're editing the source code. And you know what? If that means people are going to have a hard time using vi on a VT100 to edit their source code, that's fine. I'm sick and tired of being told that I can't have nice things because other people insist on using 1985-style tooling.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
]if TYPE_CHECKING:
] from typing import Any, Optional, Mapping # etc.
] raise Type error(" Bad foo!")
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Wol
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
> ICU uses UTF-16, are you going to claim the Unicode consortium is incompetent as well? :-)
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Brilliant explanation
Brilliant explanation
Brilliant explanation
Brilliant explanation
Brilliant explanation
Wol
Brilliant explanation
Have you misunderstood NFG?
TimToady: then we'll make that the default, and make people use a --slow flag when the security flaws emerge :P
jnthn: Worked for speculative execution :P
TimToady: still worries about dos attacks on nfg...
TimToady: but yeah, we need to get things into widespread use before we slow them down again :)
Have you misunderstood NFG?
Have you misunderstood NFG?
> Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.
> No. NFG is a new internal storage form that uses strands (cf ropes).
> None of the strands are UTF-32, though some will be sequences of 32 bit integers.
>
> Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/... ?
>
> Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?
# ö
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8');
# oxCCx88
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8').encode('utf8');
# utf8:0x<6F F4 8F BF BD 78 43 43 F4 8F BF BD 78 38 38>
let b = String(decoding: [0xc3, 0xb6], as: UTF8.self)
print(a, b, a == b)
# ö ö true
print(a.count, b.count, a.utf8.count, b.utf8.count)
# 1 1 3 2
> What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"?
Watson: Launchpad now runs on Python 3
> Having each character the same size makes a lot of things much simpler
Watson: Launchpad now runs on Python 3
Hello, world!” is 2x wider than 13 “Hello, world!”Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
> for copying strings around in memory or storing strings somewhere, e.g. a database or a file format with fields with a certain length limit given in bytes
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
> It is unrealistic to assume that all APIs will "be reasonable" when so much of the computing industry was operating under the "Unicode is a 16-bit encoding" falsehood for such a long period of time (and many people still wrongly believe this!).
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
(non) UTF-8 filenames
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
How do you want to deal with that?
Raise exceptions?
Replace invalid characters?
Ignore invalid characters?
You can even choose to never decode and work purely with bytes instead.
The language cannot answer these questions for you.
You, as the developer, have to write your code to handle these cases.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
If your input data might be broken, then of course you can't just decode it with the default parameters, because that will raise exceptions and abort if not caught. (and that's a sane default).
You need to tell the program/libraries/language what you want to do.
But if you really want to do such things, just do _not_ decode it and work with bytes (b'\n', etc...).
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
That's where the error handling options of Python's codecs come into play.
Watson: Launchpad now runs on Python 3
> You need to tell the program/libraries/language what you want to do.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
That's how it should have been done in the Python 2 script, too.
A short example helps the discussion.
> Do you have an example?
Watson: Launchpad now runs on Python 3
format doesn't work with real data and % is supposed to be removed?Watson: Launchpad now runs on Python 3
>
> How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?
> This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
> Python 2 lost support well over a year ago.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
People have been using file names in encodings other than UTF-8 (such as ISO-8859-1) for decades, and that can easily result in arbitrary byte sequences. There's nothing in the APIs that prevents you from creating non-UTF-8 file names, and it's easy to create multiple files with names that represent the same sequence of graphemes when interpreted as UTF-8. The simple fact of the matter is: the only sane way to make sense of Unix file names is to treat them as a bag of bytes.
Because most real-world encodings (UTF-8, ISO-8859-*) are ASCII compatible, so it doesn't matter. And besides: LSB? Nobody cares.
Watson: Launchpad now runs on Python 3
Wol
Watson: Launchpad now runs on Python 3
No, because everybody knows that when file names are specified as strings in such specifications, there's an implicit assumption that the encoding is ASCII. And that works because such specifications generally don't specify file names with non-ASCII characters in them.
Exactly: a file name is a bag of bytes except 0 and 2F. Would another format be better? Perhaps so, but it's too late to enforce that at this point.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Do you have an example?
Watson: Launchpad now runs on Python 3
> Do you have an example?
surrogateescape
surrogateescape
surrogateescape
surrogateescape
Wol
surrogateescape
surrogateescape
surrogateescape
surrogateescape
Watson: Launchpad now runs on Python 3
I like Ubuntu -- both the distro and the folks behind it -- but can't help notice in the blog entry that all the description of Launchpad seems to be describing an infra with a lot of tech debt.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Leak diagnosis
https://github.com/tialaramex/leakdice is the original C program, and https://github.com/tialaramex/leakdice-rust is the same program ported to learn Rust
For a memory leak in Python, I'd probably just look at the types of stuff from gc.get_objects() first.
Leak diagnosis
import gc
import collections
c = collections.Counter(str(type(o)) for o in gc.get_objects())
c.most_common(20)
shows a lot of somewhat surprising parso objects and not so surprising matplotlib the notebook where I tried.
The gc module also has get_referrers and friends.
Leak diagnosis
Leak diagnosis
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
