Fedora and Python 2
Fedora and Python 2
Posted Apr 5, 2018 9:28 UTC (Thu) by ecree (guest, #95790)In reply to: Fedora and Python 2 by cyperpunks
Parent article: Fedora and Python 2
The real problem is that while it's a _different_ language, it's not monotonically a _better_ language. In particular, if you're trying to write the kind of system software that shouldn't care what text encoding its input uses and shouldn't crash when fed badly-encoded text, working around Python 3's halpful Unicode handling leads to considerable friction; you basically have to either pretend that everything in the outside world is Latin-1, or use bytes objects _everywhere_ and accept that half the standard library won't work. Ah well, at least they eventually added %-formatting to bytes objects.
Posted Apr 5, 2018 9:49 UTC (Thu)
by k8to (guest, #15413)
[Link] (43 responses)
When python 3 was more or less announced, I was sure they had the right idea, because the weird exceptions you'd get when a str hit a unicode object in python2 were just no good. But now that I've had time to experience the python 3 way, I think it's worse. I usually would rather deal with bytes. There are so many situations where the code I deal with doesn't know the encoding of the bytes its receiving, and python3 doesn't give me a reasonable way to accept those bytes and use most of the tools I would use in python2.
One of those unintended outcomes sort of things, I think. It feels like python3 was a few years to early to pick the right strategy with unicode.
Posted Apr 5, 2018 10:09 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (5 responses)
(Perl 6 is a different story altogether…)
Posted Apr 5, 2018 17:21 UTC (Thu)
by k8to (guest, #15413)
[Link] (4 responses)
Posted Apr 5, 2018 18:52 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
However, I have personally converted a largish console-based Perl5 codebase from running on Latin-1 (or Latin-8 to be exact, as soon as the € happened) to UTF-8. Let me tell you that this is an exercise in hunting down annoying hard-to-reproduce bugs that you wouldn't wish on your worst enemy. We had the whole gamut – from mojibake in the database through strings which crashed the interpreter when printed to tearing our hair out trying to write code that works correctly in both locales. The only thing that saved us is the fact that you can unscramble real-world mixed UTF-8/Latin8 content safely, thanks to the way UTF-8 is encoded.
Python3's way of strict separation between bytes and strings may be more annoying when you start off, esp. on Windows, but IMHO it's a whole lot easier to make sure that the end result is actually correct.
Posted Apr 7, 2018 23:21 UTC (Sat)
by flussence (guest, #85566)
[Link]
Those are all things I encountered just in a single project. To be fair it was one where I ended up needing to write custom Encode::* modules, so maybe not representative… but it's still a lot of pain for the sake of not breaking code written pre-Y2K.
Posted Apr 6, 2018 18:35 UTC (Fri)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
Python's TLS implementation wants to check whether the TLS server has presented a certificate that's valid for the name of the TLS server you're trying to connect to. This makes sense, certainly as a default, if I said I wanted a TLS connection to foo.example.com, I definitely shouldn't need to roll my own certificate validator, or even explicitly say "Wait, this connection I have, is it _really_ to foo.example.com?" because duh, of course I want those things, let the handful of people who don't want checks ask *not* to check.
And all the people involved in designing this stuff at the IETF stage were conscious that this problem could be hard, and we don't want hard because this is a security system. So the certificates have DNS A-labels inside them. All you need to do is match the DNS name in the certificate (written with A-labels, ie ASCII) against the DNS name you looked up in your DNS system, which is also written with A-labels, ie ASCII. This is really boring and easy. Users don't necessarily understand A-labels, they might be gibberish, but the presentation layer is quite separate, securing that is a UX problem and not relevant to TLS or other low level components. All the tricky human stuff is pushed into the layer that was already dealing with humans, everything that's machine-to-machine needn't care about human cultural complexities like language.
Too simple for Python though, they decided to handle everything with U-labels so they can mark all the types "str". So now suddenly this low-level bit banging code that's supposed to securely move packets ends up with the entire i18n system baked into it, and inherits mysterious presentation layer related bugs. Problem with the matching? Oh sorry, you need to go fix this whole other Python sub-system that has nothing whatsoever to do with TLS ...
Eventually, literally in February this year, sanity finally prevailed and the latest Python 3 actually just does what the RFCs said it ought to do in the first place, massively simplifying the code _and_ making it more correct. Most users will never notice, because this was after all the Right Thing anyway.
Posted Apr 9, 2018 13:01 UTC (Mon)
by cortana (subscriber, #24596)
[Link]
Posted Apr 5, 2018 11:56 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (32 responses)
Posted Apr 5, 2018 16:19 UTC (Thu)
by k8to (guest, #15413)
[Link] (31 responses)
Posted Apr 6, 2018 13:55 UTC (Fri)
by barryascott (subscriber, #80640)
[Link] (30 responses)
If you know that its bytes that you care about use the APIs that give you the bytes.
You can get the env vars as bytes, see file system names as bytes and read files in bytes.
I do not understand the criticism.
Barry
Posted Apr 8, 2018 12:59 UTC (Sun)
by togga (guest, #53103)
[Link] (29 responses)
Python3 is not longer a powerful glue environment (just look at the BCC encode/decode mess) and has no apparent strong side anymore, maybe syntax like a new BASIC.
After successfully using Python2 for 14 years I now don't recommend python for anything. There are more modern languages like Go with simple syntax, can also be used interactively and contrary to both python2 and python3 doesn't need a ton of a workarounds to be efficient.
Posted Apr 8, 2018 13:43 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (28 responses)
Posted Apr 8, 2018 19:16 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (26 responses)
Posted Apr 8, 2018 19:59 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (23 responses)
Anyway, the right way to fix this is to report a bug. Bitch about it on LWN only when whoever is responsible for the code refuses to fix it.
Posted Apr 8, 2018 20:06 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (21 responses)
I prefer code that works 100% of the time, barring unrelated hardware/system software issues.
Posted Apr 8, 2018 21:12 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (20 responses)
To that effect I will now take that example code and do what somebody else should have done long ago, i.e. file a bug.
Posted Apr 8, 2018 21:27 UTC (Sun)
by mjblenner (subscriber, #53463)
[Link] (10 responses)
>>> t = type('iface', (ctypes.Structure,), {'_fields_': [(b'c_string_symbol', ctypes.CFUNCTYPE(ctypes.c_uint32))]})
isn't really a bug. The 'c_string_symbol' is the python-side handle to the C structure field. In python3 it needs to be unicode (i.e like the python source file), since you do something like
>>> s = t(some_dll[b'c_string_symbol']) # bytes used here to get the C function symbol
Posted Apr 8, 2018 22:41 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (9 responses)
On the other hand: Python is written in UTF-8 (duh) and Python's way to access symbols is by using attributes (also duh). Requiring code to cater to corner cases that don't actually occur in the real world is a surefire recipe for code bloat but doesn't help anybody.
Posted Apr 8, 2018 23:00 UTC (Sun)
by mjblenner (subscriber, #53463)
[Link] (8 responses)
Uh, no. The bit that gets the symbols from the dll is using bytes. This bit:
some_dll[b'c_function_name']
You can't refer to it in python by the same random bytes though (why would that matter?).
Posted Apr 11, 2018 22:06 UTC (Wed)
by togga (guest, #53103)
[Link] (7 responses)
Because scripting is mostly about automation. It's just broken in this case to convert these symbols to another representation. This is just one example of many in this theme since many of the libraries and third party extensions needs string representation.
In a glue language scenario (which has been a strong side of Python), if I want to grab symbol (or blob) X from system A, handle it and pass it to system B, I do not want to have Y=f(X) as intermediate representation and then do the inverse function before passing it to B.
Simple things such as reading from an UART or a socket has become a mine-field as soon as you want to use something in Python dealing with strings. Especially annoying when developing and things might not be that clean, nice and tidy.
This whole problem-domain is something Python3 has created. For me, Python itself lacks both in performance and multi-threading and doesn't really have anything language-wise (apart from lots of third party libraries) to compensate for this loss of productivity.
Posted Apr 11, 2018 23:52 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (1 responses)
Well, all I can say is that my experience (both with Perl5 and Python2) is rather different. Our fight with Perl5, incrementally switching our corporate code base to UTF8 compatibility, was … ugly.
Thus I'm very happy about the fact that Python3 spews a large unfriendly stack dump to your terminal when I forget to specify how an external byte stream is encoded. While it's somewhat annoying when you JUST KNOW that all your data is UTF8, or latin1, or randombytes … things change, and when it "suddenly" isn't, you get mojibake. Or worse. No thanks.
Posted Apr 12, 2018 16:29 UTC (Thu)
by togga (guest, #53103)
[Link]
The latter sounds to me like utopia and an endless job for achieving nothing but explains lots of the attitudes of the Python community.
Posted Apr 12, 2018 1:12 UTC (Thu)
by mjblenner (subscriber, #53463)
[Link] (1 responses)
OK. I kind of get where you're coming from. Although I'm a bit confused. Or you're a bit confused.
ctypes is an ABI interface, so having the structure field be a different name to the function symbol is of no relevance for using python to glue various other C functions together (even when passing that structure around).
i.e. here:
> type('iface', (ctypes.Structure,), {'_fields_': [(b'c_string_symbol', ...
Anyway, the easy answer is to just use python strings there. ctypes function symbol lookup converts strings to utf-8, so 99%+ of the time, this will work.
Otherwise...
Decode the symbol name to a string with errors='surrogateescape' for use in python, and use the same error handler to decode back to the original bytes for getting the symbol out of the library.
Or you could add a layer of indirection between the structure field names and the function symbols.
Posted Apr 12, 2018 16:36 UTC (Thu)
by togga (guest, #53103)
[Link]
I'm not confused, I'm just experienced lots of issues I didn't have before Python3's software castle in the air.
> "the easy answer is to just use python strings"
Isn't this kind of bloated. These strings can come from anywhere and might not even be visible in Python code at all.
> "Decode the symbol name to a string with errors='surrogateescape' for use in python, and use the same error handler to decode back to the original bytes for getting the symbol out of the library."
You mean use Python3 and stick with tons of workarounds and issues just for it's sake? Change the whole world to Python3? I value my time much more than that.
Posted Apr 12, 2018 23:47 UTC (Thu)
by dvdeug (guest, #10998)
[Link] (2 responses)
If you're just passing something from system A to system B, you shouldn't have to change the data. But there's a fairly thin region where you can choose to not unmangle something and still expect to be able to do anything with it. Stuff not being clean, nice and tidy is all the more reason to make sure you know exactly how the data you're handling is formatted.
Posted Apr 15, 2018 15:13 UTC (Sun)
by togga (guest, #53103)
[Link] (1 responses)
1. Doesnt scale. Changing representation requires one additional pass over the data. Python is already slow to begin with.
Posted Apr 15, 2018 20:53 UTC (Sun)
by dvdeug (guest, #10998)
[Link]
We're not talking about human readable symbols; we're talking about "non-ASCII symbols in your program" that aren't Unicode. Even if the editor mangles it for you, how is bsymFFEAA9 worse than \xff\xea\xa9? Something slightly smarter than base64 would preserve human readable names and only mangle unreadable names, but the only case where not worrying about mangling is going to cause problems in Python 3 is when it's not human readable.
Posted Apr 11, 2018 22:31 UTC (Wed)
by togga (guest, #53103)
[Link] (8 responses)
$ python2 -c "import json; print(json.dumps(b'xx'))"
Posted Apr 11, 2018 23:39 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Your second example works when you set a locale in which \xFF is a valid character.
$ echo -n -e "\xFF" | env LC_ALL=iso-8859-1 python3 -c "import sys; print(repr(sys.stdin.read()))"
Posted Apr 11, 2018 23:58 UTC (Wed)
by mjblenner (subscriber, #53463)
[Link] (6 responses)
Features.
> python2 -c "import json; print(json.dumps(b'xx'))"
JSON is UTF-{8|16|32}. What, exactly, do you want python to do with random bytes?
> echo -n -e "\xFF" | python2 -c "import sys; print(repr(sys.stdin.read()))"
Use sys.stdin.buffer to get bytes rather than UTF-8.
Posted Apr 12, 2018 8:15 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Apr 12, 2018 10:24 UTC (Thu)
by mjblenner (subscriber, #53463)
[Link] (3 responses)
Never ever mix them because if you do it will mostly work? (OK...)
Anyway, sounds like you need
PYTHONIOENCODING="utf-8:surrogateescape"
or use open(0, 'rb') or something, depending on what you're trying to do.
Posted Apr 12, 2018 17:05 UTC (Thu)
by togga (guest, #53103)
[Link] (2 responses)
Scripts should then always start by setting this parameter, or is it to late? Are we talking shell wrappers here or refuse to start if set incorrectly?
Posted Apr 12, 2018 17:52 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Setting the encoding to whatever is actually used is simple enough – besides, that stuff happens to work correctly when your data and your locale match. Surprise: they usually do. And if you want to process binary data, then use "sys.stdin/out.buffer" (or binary mode). This is documented.
On the other hand, allowing a random mix of differently-encoded strings (which is what Python2 or Perl do) and then trying to disentangle the resulting mojibake (or even figure out what causes it) is a frustrating and sometimes futile exercise in preventing data loss after it's too late. Been there, done that, bitten the carpet.
"Explicit is better than implicit" is one of Python's mottos. I happen to think that it's helpful. If you don't, well, there are other languages.
Posted Apr 12, 2018 21:59 UTC (Thu)
by togga (guest, #53103)
[Link]
> "Explicit is better than implicit" is one of Python's mottos. I happen to think that it's helpful. If you don't, well, there are other languages.
Posted Apr 12, 2018 16:54 UTC (Thu)
by togga (guest, #53103)
[Link]
Python2 just keeps them as is, works wonderful.
"Use sys.stdin.buffer to get bytes rather than UTF-8."
Thanks. It worked. I made a compatible version. Awesome.
$ echo -n -e "\xFF" | python3 -c "import sys; S=type('S', (bytes,), {'__repr__': lambda s: bytes.__repr__(s)[1:]}); read_stdin=sys.stdin.buffer.read; sys.stdin.read = lambda: S(read_stdin()); print(repr(sys.stdin.read()))"
Posted Apr 11, 2018 21:41 UTC (Wed)
by togga (guest, #53103)
[Link]
Is it really? Python3 has clearly chosen a design-path incompatible with my use-cases and this is on a fundamental level not fixed by a "bug report".
For me, discussing on LWN isn't bitching. It's one very good platform to discuss and transfer information, idéas and knowledge. Now when my Py2 floor is starting to crack I find LWN one good source to know what to do from here. Also, since I'm invested in Py2 I'm motivated to put some work on alternatives and here is one place to find people in similar situations that could help.
I advice to join us with positive attitude and constructive ideas, I think we need less bitch-talk.
Posted Apr 12, 2018 4:33 UTC (Thu)
by njs (subscriber, #40338)
[Link] (1 responses)
Ctypes does support using bytestrings for symbols:
In [4]: libc = ctypes.CDLL("libc.so.6")
In [5]: libc[b"sprintf"]
So I think this criticism is simply mistaken.
Posted Apr 12, 2018 5:44 UTC (Thu)
by njs (subscriber, #40338)
[Link]
Posted Apr 11, 2018 21:29 UTC (Wed)
by togga (guest, #53103)
[Link]
"And why are you using them, instead of strings, in the first place?"
Posted Apr 5, 2018 12:53 UTC (Thu)
by mordae (guest, #54701)
[Link] (3 responses)
I don't really get this. Python 3 made the encoding handling explicit and much more predictable. There is an implicit conversion of bytes into str assuming ASCII, which I think is sensible nowadays (with EBCDIC gone). And apart from that, it's impossible to convert str into bytes without being explicit about the encoding.
> There are so many situations where the code I deal with doesn't know the encoding of the bytes its receiving...
Can you name one?
Posted Apr 5, 2018 16:11 UTC (Thu)
by lsl (subscriber, #86508)
[Link] (1 responses)
Reading from standard input. Reading a file system directory's contents. Opening a file with user-specified name. The values of environment variables. All kinds of networking protocols that only reserve a small subset of ASCII (e.g. '\n') for control purposes with the rest being opaque bytes.
Posted Apr 5, 2018 16:23 UTC (Thu)
by k8to (guest, #15413)
[Link]
It's very common on that platform that the encoding of command output is special to the command. Some things follow the system selected locale, some don't. Some even mix fixed strings in ascii with filename based output in utf-16 and crap like that.
Python3 would be okay for this if bytes were as full-featured as Python2 strs used to be, but they're really not. A lot of the standard library really insists on strings.
Posted Apr 6, 2018 11:15 UTC (Fri)
by jwilk (subscriber, #63328)
[Link]
Python 2 has implicit bytes→unicode conversion. Python 3 doesn't have it.
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
I certainly didn't believe that "let the container have bytes, and describe the encoding if known" was the right way around the time py3k was being clarified, but I do now. I think the industry at large hadn't come to that decision at all at that time.
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
>>> import ctypes
>>> t = type('iface', (ctypes.Structure,), {'_fields_': [(b'c_string_symbol', ctypes.CFUNCTYPE(ctypes.c_uint32))]})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_fields_' must be a sequence of (name, C type) pairs
Fedora and Python 2
And why are you using them, instead of strings, in the first place?
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
>>> s.c_string_symbol()
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
> "Or you could add a layer of indirection between the structure field names and the function symbols."
Fedora and Python 2
Fedora and Python 2
2. Accessing human readable symbols is convenient when needed by scripts, tests or debug.
Fedora and Python 2
Fedora and Python 2
"xx"
$ echo -n -e "\xFF" | python2 -c "import sys; print(repr(sys.stdin.read()))"
'\xff'
Fedora and Python 2
You can change that. See "pydoc3 json".
'\udcff'
Fedora and Python 2
Fedora and Python 2
And remember to never, ever mix them in the same program. Oh, and it'll mostly work if you forget .buffer in one place. It'll just crash with bad data sometimes.
Fedora and Python 2
Fedora and Python 2
If we do multiple things with multiple needs for encoding, do we need different settings for for different incoming data, in other words set it with each read?
Fedora and Python 2
Fedora and Python 2
Fedora and Python 2
'\xff'
Fedora and Python 2
Fedora and Python 2
Out[5]: <_FuncPtr object at 0x7f14c0f88110>
Fedora and Python 2
Fedora and Python 2
These ends up as attributes, that'll not just be a bug-report but propagate everywhere and probably bring up the need for a Python4.
Because these are not hand-written in python scripts, these are read from external sources from various places, like C-strings from C/API:s.
Fedora and Python 2
Fedora and Python 2
>
> Can you name one?
Fedora and Python 2
Fedora and Python 2