|
|
Log in / Subscribe / Register

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:24 UTC (Mon) by ddevault (subscriber, #99589)
In reply to: DeVault: Announcing the Hare programming language by excors
Parent article: DeVault: Announcing the Hare programming language

This is a case that we were aware of when designing these interfaces. For a time, most filesystem-related interfaces accepted a tagged union of either (str | []u8). However, we ultimately decided that the additional complexity was not justified because the use-case is not justified: filenames *should* be UTF-8.

However, it is still possible to bend the language to your will if you know your use-case demands this to be otherwise. You can force non-UTF-8 data into a str type (knowing that you're breaking the language invariants and that the stdlib and third-party code relies on your broken assumptions) via strings::fromutf8_unsafe. You can then pass this into os::remove or something to get rid of your bad file. A tool like rm could be written specifically with this in mind, to clean up bad files. Additionally, in your git example, if ls and rm are implemented in Hare, then so too is probably your git implementation - or your kernel, which would enforce UTF-8 filenames when opening files.

I will note that this particular decision was a big agonizing, and that we may revisit it before 1.0.


to post comments

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:21 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (4 responses)

I 100% agree with you that, on filesystems that (a) store filenames as byte sequences and (b) do not embed metadata in the filename part of directory entries, all filenames should be well-formed UTF-8.

Sadly, they aren't, and despite entirely agreeing with you that all filenames on Linux systems should be well-formed UTF-8, I find the behaviour excors reports unacceptable in a programming language's standard runtime library.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:27 UTC (Mon) by ddevault (subscriber, #99589) [Link] (3 responses)

I understand where you're coming from in this respect. Again, this was a difficult choice, and we may revisit it, but it simplifies the situation considerably for the 99.999% of use-cases where non-UTF-8 filenames are not present. In the remainder, it's very unlikely that anything worse than the program aborting will occur (e.g. overwriting such files). I would encourage you to present your case for this at the standard library design acceptance review committee (or the filesystem committee, a likely spin-off), which will be formed prior to 1.0.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:39 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

Didn't Python 3 already solve this problem with surrogateescape?

I'm not saying it's an elegant solution, or even necessarily a good solution, but it's less painful for the 99% case (as compared to using a tagged union), and it works for the 1% case, usually without losing any data (unless you're trying to manipulate filenames as strings, in which case any solution will probably be terrible anyway because there's just no good way to do that).

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:42 UTC (Mon) by ddevault (subscriber, #99589) [Link]

I mean, it's a complex set of trade-offs. To consider the Python approach, we'd have to untangle a pretty large can of worms having to do with string handling. We can't just lift surrogateescape wholesale - Python and Hare have very different goals and we have to think carefully through every implication of such a change so that the language remains consistent and reliable in its design throughout. And yes, it's an inelegant solution - and we prefer the elegant ones.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:40 UTC (Mon) by excors (subscriber, #95769) [Link]

surrogateescape means that code like print(*os.listdir()) can throw a UnicodeEncodeError. The programmer has to manually keep track of which values of the 'str' type are really Unicode and will work with all the standard string functions, and which are nearly-Unicode but will occasionally throw if you try to print or encode them like normal strings. It might be the best hack that's possible in Python given its compatibility constraints, but I think it creates as many problems as it solves. A statically typed language should be able to do better - tracking this kind of information about the range of values is what type systems are for.

At least if you get it wrong in Python, you'll probably just get an exception. In a non-memory-safe language, some code might rely on the promise that strings are Unicode and violating that could cause undefined behaviour; you need to either strictly enforce that promise, or not make the promise at all.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:04 UTC (Mon) by excors (subscriber, #95769) [Link] (4 responses)

Deliberately violating str's UTF-8 invariant sounds scary, and an application developer should probably never pass such strings into any standard library function that expects a str (because they can't know if it'll e.g. try to iterate over codepoints and crash), so I expect they'd have to implement their own 'path' type which wraps a []u8, and write their own functions to concatenate and split and compare path-strings and convert to/from str, and use raw syscalls instead of the os module. And that would be needed by every application that wants to work reliably on real-world Unix systems (where filenames occasionally come from ancient backups and from FAT32 USB sticks and from zip files and from malicious users etc, which won't respect anyone's desire for a perfect Unicode world). That sounds like exactly the sort of widely-used low-level functionality that should be the responsibility of the standard library. And it's the language's responsibility to provide features so the library can implement an API that's both correct and convenient.

Otherwise nearly everyone will write applications with the standard library, and it'll be fine for 99.999% of users, then a few years later they'll get a bug report saying it crashes for one user with a mysterious error message and they'll spend hours debugging it and then spend days replacing the standard library with a new library that actually works, and repeat for every application that has a large number of users. That's a lot of effort that would have been saved by doing it correctly from the start.

(But even if application developers do try to avoid the standard library's path handling, os/+linux/environ.ha's init_environ runs before main and asserts when a non-UTF-8 string is passed on the command line.)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:11 UTC (Mon) by ddevault (subscriber, #99589) [Link] (2 responses)

>Deliberately violating str's UTF-8 invariant sounds scary

Well, Hare is standardized, and open source, and runs in a standardized environment (x86, though as someone who has read the Intel and AMD CPU manuals, I can attest that it's not very fun). If you need to break the invariant, it's a serious move to consider, must be very well justified, and should raise eyebrows during code review - but you can objectively evaluate the consequences of that decision by examining where your tainted string will end up and planning for its behavior. We even make it easy for you to vendor standard library modules so you can ensure their behavior is consistent with an earlier evaluation. This is an example of "trust the programmer" - it's pretty ill-advised to do this, but if you really need to, you can. Breaking the str invariant is probably a case where you should really reconsider, though. There are less severe examples - forcing a bad value into a global (e.g. null into a non-nullable pointer) and fixing it up during @init is one I've encountered from time to time.

> (But even if application developers do try to avoid the standard library's path handling, os/+linux/environ.ha's init_environ runs before main and asserts when a non-UTF-8 string is passed on the command line.)

Good catch. You can still technically get around this (vendor os and patch it, don't import os and use rt to make the syscalls directly, etc), but I admit that it's going to be very contrived to get around this problem.

Like I said, we were well aware of all of these issues and this is why it was a very difficult decision to go UTF-8-only for paths.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:20 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

> If you need to break the invariant, it's a serious move to consider, must be very well justified, and should raise eyebrows during code review

*My* concern is less about the code review that adds the `_unsafe` call. I worry more about the code review that later edits the function with the `_unsafe` outside of the default context view doing something "convenient" like printing it. Maybe the variable would handily be named `path_for_os_calls_only`, but my experience is that no one is that nice to their time-separated co-developers.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:23 UTC (Mon) by ddevault (subscriber, #99589) [Link]

At the very least, I would expect any use of _unsafe to include a comment explaining why it was done in spite of the risks. Would not look forward to being that future colleague regardless.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:30 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

> And that would be needed by every application that wants to work reliably on real-world Unix systems (where filenames occasionally come from ancient backups and from FAT32 USB sticks and from zip files and from malicious users etc, which won't respect anyone's desire for a perfect Unicode world)

Well, I can say for certain that there are many places where it's not just ancient backups nor FAT32 USB sticks, but just regular file names used every day. As soon as you have shared file servers for lots of employees, there's never a single moment where you can declare that the encoding will change because you'll break a lot of shortcuts and file names for plenty of employees. Thus you keep in place the perfectly working system you used to have, and do that for decades if needed because in the end the one without encoding is still the one that works best (most users access only their own files with their machine's encoding, and shared files rarely use fancy chars). I personally never put non-ASCII chars in my file names so I'm fine but I've seen quite a bunch of filesystems with mixes of CP1252 from Windows users via a Samba share and ISO8859-1 from UNIX/Linux users via an NFS share.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds