|
|
Log in / Subscribe / Register

DeVault: Announcing the Hare programming language

DeVault: Announcing the Hare programming language

Posted May 2, 2022 8:45 UTC (Mon) by FSMaxB (subscriber, #106415)
Parent article: DeVault: Announcing the Hare programming language

What immediately disqualifies hare as a serious contender in the space of systems programming languages in my opinion is that it proclaims to never support proprietary operating systems in the upstream implementation. Yes, others can provide an implementation for these, but this will both create an unnecessary divide and also facilitate the accidental introduction of unportable features or behavior in the standard library, adding to the divide.

I also think that having manual memory management is a big mistake that will lead to some (although definitely not all) of the mistakes from the C world that so regularly provide us with exploitable vulnerabilities. There are things with which a programmer just shouldn't be trusted, given their track record of screwing it up.

I applaud the effort in significantly cutting down on the amount of undefined behavior and in the enormous ergonomics improvements over C though, not least the error handling improvements.

I guess I have to out myself as one of the "rust fanboys", but all that means is that I follow a set of principles that rust happens to cater to best at the moment (in the realm of systems programming). Some of those are:
* Don't trust the programmer, humans make mistakes, so let's help prevent them. (note that this will always need an escape hatch like "unsafe" because no system can be perfect in preventing all mistakes but also allowing all "good" behavior [without going into a definition of "good" here])
* Provide the programmer with powerful tools to constrain themselves even further from doing incorrect things (that would usually mean a quite advanced type system)
* Be ergonomic to use, not just the language, but the tooling around it as well (build system, dependency management, library documentation etc.)
* Work on the most commonly used platforms with the possibility of supporting even more.

Rust is definitely far from perfect some of my issues that I bump into regularly:
* Compile times are horrendously slow when compared to Go for example (haven't tried that out with hare yet, but probably similarly fast?). That definitely is a hindrance and reduces the speed in which you can iterate when developing. Slower iterations mean less productivity (although some of the other features make more than up for the loss in productivity IMO, but that doesn't mean faster wouldn't be even better).
* Rust is eating resources like crazy. Notably CPU time and disk space (some of those target directories on my machine have grown to >100GiB at times if not regularly cleaned)
* What's up with these abbreviations, like e.g. "fmt" and "std". Typing a standard libraries main scope into a search engine shouldn't get you results about "sexually transmitted diseases" and what's up with "Vec"? C++ already admitted that "vector" was the wrong name for a dynamically sized array, rust not only copies that but then abbreviates it on top. Also "fn" and "str" and "recv" and "ctx" etc.

Btw. I really like hare's bootstrapping capabilities, but I still think that the importance of bootstrapping is a bit overstated. In this day and age, we usually use cross-compilation to get code running on new targets. That's even true for C. Ever bootstrapped C on a new platform by writing an assembler in machine code, then using Assembly to write a more expressive language, use that language to write an even more expressive language and so forth until you have implemented something for which a C Compiler exists? That's just not the way things are done in this day and age, except maybe for fun and education purposes. If it's ok for C to target new platforms not by being bootstrapped but by being cross-compiled, why should the expectation for a systems programming language competing with it be different? And that comes from someone having bootstrapped OpenJDK 8 from GCJ on PowerPC 32 in a 2 week effort a few years back on Gentoo, so I know at least somewhat what I'm talking about. I REALLLY wouldn't want to bootstrap rust from C on PowerPC 32, that won't be done in 2 weeks, of that I'm sure. Instead I would install a binary rust toolchain that has been cross-compiled and then go from there.


to post comments

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:01 UTC (Mon) by pabs (subscriber, #43278) [Link] (10 responses)

I think Bootstrappable Builds is one of the most important projects that exists in FLOSS days, alongside Reproducible Builds. You shouldn't have to trust binaries to be able to use a language. The Bootstrappable Builds folks are working on a bootstrap path from ~512 bytes of machine code plus tons of source code all the way up to a full distro.

https://bootstrappable.org/
https://reproducible-builds.org/

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:26 UTC (Mon) by FSMaxB (subscriber, #106415) [Link] (1 responses)

> I think Bootstrappable Builds is one of the most important projects that exists in FLOSS days, alongside Reproducible Builds. You shouldn't have to trust binaries to be able to use a language.

To me it is a question of priorities. I would rather have a language be easy to bootstrap then hard, and rust for example would definitely benefit from easier bootstrapping. But then again, it would benefit from a lot of things and and all of these require effort that can not be invested in other things.

Reproducible builds are much more important than easy bootstrapping in my opinion, because that way you need less trust because you can infer trust in one step of the bootstrap chain from the one before without having to do the entire bootstrap chain from scratch. And with reproducible builds, it only takes one person going through the full bootstrap chain to notice if something fishy is going on in a binary release.

Also note that nobody can ever review all the code that they're using, so a reliance on trusting other people is always required. Running a bootstrap yourself can help you only marginally with that.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 2:56 UTC (Tue) by pabs (subscriber, #43278) [Link]

A quote from the Bootstrappable Builds IRC channel:

<vagrantc> reproducible builds doesn't mean much without bootstrappability ...
<vagrantc> and bootstrapping is a lot stronger with reproducible builds...

The CREV folks are working on socially scalable cryptographically verifiable code review:

https://github.com/crev-dev/

DeVault: Announcing the Hare programming language

Posted May 2, 2022 9:34 UTC (Mon) by FSMaxB (subscriber, #106415) [Link]

Also note that reproducible cross compilation also allows you to trust binaries for a new platform without having to go through the bootstrap chain on that new platform.

So an easy bootstrap on one platform (e.g. x86_64) that produces a trusted binary is only one reproducible cross compilation away from a trusted binary on a different platform.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 10:42 UTC (Mon) by taladar (subscriber, #68407) [Link] (3 responses)

What is important about it though? Bootstrapping via so much source code that you can't review it all doesn't give you anything useful over having to trust binaries.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 2:55 UTC (Tue) by pabs (subscriber, #43278) [Link] (2 responses)

Binaries are unreviewable by most people, so it is better to start from smaller binaries to reduce the chance of them containing backdoors. The CREV folks are working on socially scalable cryptographically verifiable code review to solve the review problem:

https://github.com/crev-dev/

DeVault: Announcing the Hare programming language

Posted May 4, 2022 16:20 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

> Binaries are unreviewable by most people, so it is better to start from smaller binaries to reduce the chance of them containing backdoors.

Great! Tell me how you plan to exclude huge binaries written in Verilog and VHDL (and then backed in ASICs you actually use) and then you would have a point.

The truth is: from practical POV bootstrapability is just a gimmick which buys you nothing since free hardware doesn't exist (please don't bring a FS's joke which certifies a hardware which includes megabytes of binary code which handles the storage, that is: precisely the part which you, somehow, need to trust to run anything at all).

Today so much proprietary code is executed before and after the first bytes of code you may actually control is executed that the whole excercise is mostly pointless.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 7:09 UTC (Thu) by pabs (subscriber, #43278) [Link]

I do not know how they plan to approach it, but the Bootstrappable Builds folks definitely have thought about those things, I have definitely seen discussions on their channel about bootstrapping hardware from scratch. They are nothing if not ambitious.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:14 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

> The Bootstrappable Builds folks are working on a bootstrap path from ~512 bytes of machine code plus tons of source code all the way up to a full distro.

Write your C compiler in Lisp or Forth. Forth certainly, and Lisp I think, both have a core that can be expressed in precious little assembly, and then just build everything from source on top.

Cheers,
Wol

DeVault: Announcing the Hare programming language

Posted May 3, 2022 2:28 UTC (Tue) by pabs (subscriber, #43278) [Link] (1 responses)

I asked about this on their IRC channel and got this response from oriansj:

Well we did bootstrap a FORTH from hex: https://github.com/oriansj/stage0/blob/master/stage2/forth.s

and we did bootstrap a garbage collecting Lisp from hex: https://github.com/oriansj/stage0/blob/master/stage2/lisp.s

but if you notice: https://github.com/oriansj/stage0/blob/master/stage2/cc_x...

writing a C compiler in assembly that supports structs, unions, arrays, inline assembly and a bunch more was done in less than 24 hours by an inexperienced C programmer. Who then after started doing bootstrapping speed runs to demonstrate how trivial of a problem it is to implement that level of functionality in a C compiler.

In the decades for which Lisp and FORTH existed, why didn't they solve such a trivial problem?

Or better yet, now that you can see how it is done. Could anyone actually produce a C compiler with the same level of functionality in Lisp or FORTH in the same amount of time (or less?).

It is easy to talk a big game and words are cheap, we have the entire cc_* family of C compilers written in assembly for multiple architectures and in even cross-platform arrangements. If your language was any good at bootstrapping you'd be able to beat that. Then show your language written in less lines of assembly than cc_x86 to prove the point.

Please prove me wrong with working code.

Assembly and C have working code good enough to bootstrap GCC+Guile+Linux for more than a year now. https://github.com/fosslinux/live-bootstrap https://github.com/oriansj/stage0-posix

It is time for Lisp and FORTH to either deliver or learn to stop talking about something they never were good at in the first place and learn to admit Assembly and C won not because "worse is better" but because objectively they are better languages for bootstrapping new and better tools.

DeVault: Announcing the Hare programming language

Posted May 7, 2022 20:35 UTC (Sat) by anton (subscriber, #25547) [Link]

In the decades for which Lisp and FORTH existed, why didn't they solve such a trivial problem?
What makes you think they didn't? There is CC64, although you may be unhappy with the language features.

The other answer is: What itch would a Lisp or Forth (rather than assembler) programmer scratch by writing a C compiler?

DeVault: Announcing the Hare programming language

Posted May 2, 2022 13:25 UTC (Mon) by excors (subscriber, #95769) [Link] (24 responses)

> What immediately disqualifies hare as a serious contender in the space of systems programming languages in my opinion is that it proclaims to never support proprietary operating systems in the upstream implementation.

That seemed surprising, but it is indeed proclaimed by https://harelang.org/platforms/ :

> Hare (will) support a variety of platforms. Adding new (Unix-like) platforms and architectures is relatively straightforward.
>
> Hare does not, and will not, support any proprietary operating systems.

So it's not just a "this is an early release and we're focused on Unix-like platforms", it sounds like active hostility towards Windows and macOS. And that sounds like a huge barrier to adoption - why would I bother learning a language that wants to stop me writing applications that most users could run?

Even if a third-party compiler provided Windows support, I assume some Unixisms would creep into the standard library and make it hard to port applications. One significant portability issue for a systems language is filenames (which are arbitrary 8-bit sequences on Linux and arbitrary 16-bit sequences on Windows)... except it looks like Hare's path handling is already inadequate for Linux, because all the fs:: functions use 'str' which is specified as a UTF-8-encoded Unicode string (and violating that is undefined behaviour, according to strings::fromutf8_unsafe), so it can't handle all valid Linux filenames, and the inability to handle all Windows filenames is the least of its problems. (Rust has the OsString type to handle both platforms robustly.)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 17:12 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

> it sounds like active hostility towards Windows and macOS. And that sounds like a huge barrier to adoption - why would I bother learning a language that wants to stop me writing applications that most users could run?

I agree. In haproxy initially we used to laugh when users asked whether it worked on windows. Until one day someone sent us patches to support cygwin and said "OK the performance is terrible and the limitations abysmal but it allows me to check my configs and to progressively familiarize my coworkers with the product without having to start with the big jump". It made sense. We merged the patches (which were not much invasive) and nowadays the cygwin build is part of the other ones in the CI and even helps spotting portability issues. Is it really used for anything serious on windows ? Probably not, at best they use it as an HTTP sniffer/debugger. Do some users value this availability for their own training and ease of testing ? Certainly.

I'm glad we did nothing to make this impossible. That would only have brought some misplaced pride on our side and annoyed users with no valid reason.

Lesson learned: never say "we will never support X or Y", rather "we have no intent of doing it so far".

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:09 UTC (Mon) by excors (subscriber, #95769) [Link] (22 responses)

> it looks like Hare's path handling is already inadequate for Linux

To be a bit more concrete here: If I create a file like "\xffdummy.txt", and try to read the filename with glob::glob or os::diropen, or pass it as a command-line argument, then Hare dies with:

> Abort: Assertion failed: /usr/src/hare/stdlib/strings/cstrings.ha:33:1

because of an "assert(utf8::valid(s));"

If I create a file like "\xf8dummy.txt", then utf8::valid (wrongly) thinks it is valid, so I can read the filename into a str. strings::iter says the first rune (aka Unicode codepoint) is U+935B6D. strings::riter says:

> Abort: /usr/src/hare/stdlib/strings/iter.ha:68:22: Invalid UTF-8 string (this should not happen)

At least it's not a memory-safety error in this case, though it wouldn't be surprising if some other function did non-bounds-checked accesses under the assumption that strings are UTF-8 (as promised by the specification).

Even when it just aborts, that seems unfortunate for a systems programming language - maybe you clone a Git repository with some non-UTF-8 filenames, and your 'ls' and 'rm' were written in Hare so now you can't see or delete the files.

I think the fundamental problem here is that if you want to build a language with Unicode strings, and use it to interact with external systems, you need a good way to handle strings that are not quite Unicode. C/C++/Go/etc just don't bother guaranteeting Unicode. Python 3 got really into Unicode before discovering it didn't work for filenames or lots of other real-world data, and bodged it with surrogateescape (so now Python's Unicode strings aren't Unicode) and with many duplicated APIs between str/bytes/bytearray. Rust uses its type system to provide Path/OsString/etc which handle non-Unicode strings safely, with trait-based conversions that mean the easy cases are still easy to write (File::open("foo.txt") etc) and explicit fallible/lossy conversions when you really need a Unicode string.

Hare wants Unicode strings (which I think is a good goal), but the standard library needs to provide an interface to the not-quite-Unicode real world, and I'm not sure if the language has enough features to ever implement it well.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:24 UTC (Mon) by ddevault (subscriber, #99589) [Link] (10 responses)

This is a case that we were aware of when designing these interfaces. For a time, most filesystem-related interfaces accepted a tagged union of either (str | []u8). However, we ultimately decided that the additional complexity was not justified because the use-case is not justified: filenames *should* be UTF-8.

However, it is still possible to bend the language to your will if you know your use-case demands this to be otherwise. You can force non-UTF-8 data into a str type (knowing that you're breaking the language invariants and that the stdlib and third-party code relies on your broken assumptions) via strings::fromutf8_unsafe. You can then pass this into os::remove or something to get rid of your bad file. A tool like rm could be written specifically with this in mind, to clean up bad files. Additionally, in your git example, if ls and rm are implemented in Hare, then so too is probably your git implementation - or your kernel, which would enforce UTF-8 filenames when opening files.

I will note that this particular decision was a big agonizing, and that we may revisit it before 1.0.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:21 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (4 responses)

I 100% agree with you that, on filesystems that (a) store filenames as byte sequences and (b) do not embed metadata in the filename part of directory entries, all filenames should be well-formed UTF-8.

Sadly, they aren't, and despite entirely agreeing with you that all filenames on Linux systems should be well-formed UTF-8, I find the behaviour excors reports unacceptable in a programming language's standard runtime library.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:27 UTC (Mon) by ddevault (subscriber, #99589) [Link] (3 responses)

I understand where you're coming from in this respect. Again, this was a difficult choice, and we may revisit it, but it simplifies the situation considerably for the 99.999% of use-cases where non-UTF-8 filenames are not present. In the remainder, it's very unlikely that anything worse than the program aborting will occur (e.g. overwriting such files). I would encourage you to present your case for this at the standard library design acceptance review committee (or the filesystem committee, a likely spin-off), which will be formed prior to 1.0.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:39 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

Didn't Python 3 already solve this problem with surrogateescape?

I'm not saying it's an elegant solution, or even necessarily a good solution, but it's less painful for the 99% case (as compared to using a tagged union), and it works for the 1% case, usually without losing any data (unless you're trying to manipulate filenames as strings, in which case any solution will probably be terrible anyway because there's just no good way to do that).

DeVault: Announcing the Hare programming language

Posted May 2, 2022 19:42 UTC (Mon) by ddevault (subscriber, #99589) [Link]

I mean, it's a complex set of trade-offs. To consider the Python approach, we'd have to untangle a pretty large can of worms having to do with string handling. We can't just lift surrogateescape wholesale - Python and Hare have very different goals and we have to think carefully through every implication of such a change so that the language remains consistent and reliable in its design throughout. And yes, it's an inelegant solution - and we prefer the elegant ones.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:40 UTC (Mon) by excors (subscriber, #95769) [Link]

surrogateescape means that code like print(*os.listdir()) can throw a UnicodeEncodeError. The programmer has to manually keep track of which values of the 'str' type are really Unicode and will work with all the standard string functions, and which are nearly-Unicode but will occasionally throw if you try to print or encode them like normal strings. It might be the best hack that's possible in Python given its compatibility constraints, but I think it creates as many problems as it solves. A statically typed language should be able to do better - tracking this kind of information about the range of values is what type systems are for.

At least if you get it wrong in Python, you'll probably just get an exception. In a non-memory-safe language, some code might rely on the promise that strings are Unicode and violating that could cause undefined behaviour; you need to either strictly enforce that promise, or not make the promise at all.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:04 UTC (Mon) by excors (subscriber, #95769) [Link] (4 responses)

Deliberately violating str's UTF-8 invariant sounds scary, and an application developer should probably never pass such strings into any standard library function that expects a str (because they can't know if it'll e.g. try to iterate over codepoints and crash), so I expect they'd have to implement their own 'path' type which wraps a []u8, and write their own functions to concatenate and split and compare path-strings and convert to/from str, and use raw syscalls instead of the os module. And that would be needed by every application that wants to work reliably on real-world Unix systems (where filenames occasionally come from ancient backups and from FAT32 USB sticks and from zip files and from malicious users etc, which won't respect anyone's desire for a perfect Unicode world). That sounds like exactly the sort of widely-used low-level functionality that should be the responsibility of the standard library. And it's the language's responsibility to provide features so the library can implement an API that's both correct and convenient.

Otherwise nearly everyone will write applications with the standard library, and it'll be fine for 99.999% of users, then a few years later they'll get a bug report saying it crashes for one user with a mysterious error message and they'll spend hours debugging it and then spend days replacing the standard library with a new library that actually works, and repeat for every application that has a large number of users. That's a lot of effort that would have been saved by doing it correctly from the start.

(But even if application developers do try to avoid the standard library's path handling, os/+linux/environ.ha's init_environ runs before main and asserts when a non-UTF-8 string is passed on the command line.)

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:11 UTC (Mon) by ddevault (subscriber, #99589) [Link] (2 responses)

>Deliberately violating str's UTF-8 invariant sounds scary

Well, Hare is standardized, and open source, and runs in a standardized environment (x86, though as someone who has read the Intel and AMD CPU manuals, I can attest that it's not very fun). If you need to break the invariant, it's a serious move to consider, must be very well justified, and should raise eyebrows during code review - but you can objectively evaluate the consequences of that decision by examining where your tainted string will end up and planning for its behavior. We even make it easy for you to vendor standard library modules so you can ensure their behavior is consistent with an earlier evaluation. This is an example of "trust the programmer" - it's pretty ill-advised to do this, but if you really need to, you can. Breaking the str invariant is probably a case where you should really reconsider, though. There are less severe examples - forcing a bad value into a global (e.g. null into a non-nullable pointer) and fixing it up during @init is one I've encountered from time to time.

> (But even if application developers do try to avoid the standard library's path handling, os/+linux/environ.ha's init_environ runs before main and asserts when a non-UTF-8 string is passed on the command line.)

Good catch. You can still technically get around this (vendor os and patch it, don't import os and use rt to make the syscalls directly, etc), but I admit that it's going to be very contrived to get around this problem.

Like I said, we were well aware of all of these issues and this is why it was a very difficult decision to go UTF-8-only for paths.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:20 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

> If you need to break the invariant, it's a serious move to consider, must be very well justified, and should raise eyebrows during code review

*My* concern is less about the code review that adds the `_unsafe` call. I worry more about the code review that later edits the function with the `_unsafe` outside of the default context view doing something "convenient" like printing it. Maybe the variable would handily be named `path_for_os_calls_only`, but my experience is that no one is that nice to their time-separated co-developers.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 20:23 UTC (Mon) by ddevault (subscriber, #99589) [Link]

At the very least, I would expect any use of _unsafe to include a comment explaining why it was done in spite of the risks. Would not look forward to being that future colleague regardless.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 14:30 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

> And that would be needed by every application that wants to work reliably on real-world Unix systems (where filenames occasionally come from ancient backups and from FAT32 USB sticks and from zip files and from malicious users etc, which won't respect anyone's desire for a perfect Unicode world)

Well, I can say for certain that there are many places where it's not just ancient backups nor FAT32 USB sticks, but just regular file names used every day. As soon as you have shared file servers for lots of employees, there's never a single moment where you can declare that the encoding will change because you'll break a lot of shortcuts and file names for plenty of employees. Thus you keep in place the perfectly working system you used to have, and do that for decades if needed because in the end the one without encoding is still the one that works best (most users access only their own files with their machine's encoding, and shared files rarely use fancy chars). I personally never put non-ASCII chars in my file names so I'm fine but I've seen quite a bunch of filesystems with mixes of CP1252 from Windows users via a Samba share and ISO8859-1 from UNIX/Linux users via an NFS share.

DeVault: Announcing the Hare programming language

Posted May 2, 2022 18:43 UTC (Mon) by bartoc (guest, #124262) [Link] (9 responses)

C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else). This means constructing them from a char* that you already know is valid utf-8 is….. a memcpy. They are a complete and utter mess. C has such a type too, and its just a big of a mess there too.

python’s pep383 is actually a pretty cool scheme, but is a bit “nuts”.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 10:41 UTC (Tue) by peter-b (guest, #66996) [Link] (2 responses)

> C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else).

To my enormous displeasure, unfortunately it isn't UB for a char8_t* to point to a string buffer that doesn't contain UTF-8.

> They are a complete and utter mess.

I can absolutely confirm this.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 11:07 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> it isn't UB for a char8_t* to point to a string buffer that doesn't contain UTF-8.

One reason that comes to my mind is that it would need to eschew `u8ptr++` because if it currently points at a code point encoded with multiple bytes, incrementing one byte would make it UB, no? Or would `++` inspect the byte encoded length of the current code point and jump an appropriate amount? That certainly seems novel too.

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:50 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Because both of the increment operators in C++ can be overridden for a type, this isn't difficult to do, if C++ wanted to do it.

But I don't expect C++ to actually do that lifting for the same reason it still has both these silly operators in the first place. Backward compatibility trumps all other considerations.

Rust's str actually provides both iterations, if you want the underlying *bytes* you can have those, and of course individual bytes are just UTF-8 code units and on their own don't necessarily mean anything specifically, but if you want "characters" (Rust's char) you can iterate over those and under the hood it is indeed moving forward the appropriate number of bytes each time.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 15:02 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (5 responses)

> C++ has char8_t, which is utf-8 or UB and also pointers to it don't alias char*s (or anything else). This means constructing them from a char* that you already know is valid utf-8 is….. a memcpy. They are a complete and utter mess. C has such a type too, and its just a big of a mess there too.

char* can alias anything, so I would be very surprised if they made a special exception for char8_t... are you absolutely sure that it really is UB to alias them?

DeVault: Announcing the Hare programming language

Posted May 3, 2022 19:39 UTC (Tue) by foom (subscriber, #14868) [Link] (4 responses)

You are allowed to access any object as bytes via a `char*`, but the opposite -- accessing a char-array as some other type -- is (surprisingly!!), not obviously allowed. See e.g. this discussion, https://stackoverflow.com/questions/12612488/aliasing-t-w...

DeVault: Announcing the Hare programming language

Posted May 4, 2022 1:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

Of course that's not allowed. Then there would effectively be no aliasing rule at all, because "X can be aliased as Y" is generally understood to be a transitive relationship.[1] If you can alias anything to char* and then alias char* back to anything else, then you can alias anything to anything else and all bets are off.

Anyway, any halfway-decent compiler will identify the memcpy as redundant and optimize it out, so you should just bite the bullet and call memcpy.

[1]: It has to be transitive. The alternative would be "I have a Y*, but I don't know whether it can be aliased as a Z*, because someone else gave it to me, and I don't know whether it was originally a Y* or originally an X* that they aliased to a Y*."

DeVault: Announcing the Hare programming language

Posted May 4, 2022 9:09 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

I don't think it's meaningful to talk about transitivity of aliasing, because aliasing is not specified as a relationship between two things of the same kind. It's a relationship between the type of an object, and the type of the pointer being used to access it. (In particular it's not a relationship between two pointers, so it can't be transitively extended to a third pointer.)

If I understand correctly, C++20 says the type of an object is determined by its definition, or by (placement) new, or by assignment to a union, etc, or objects can be "implicitly created" by malloc()/memcpy()/etc and the compiler will act as if it magically determined the type of those objects in order to avoid undefined behaviour when they are subsequently used.

You can access an object of any type T through a pointer to char. You can't access an object of type char through a pointer to T, unless T is char (or similar). That's the easy part.

I think the tricky part is: When does an object have type char? C++20 says "An operation that begins the lifetime of an array of char, unsigned char, or std::byte implicitly creates objects within the region of storage occupied by the array". I think that means "char buf[256];" and "char *buf = new char[256];" are implicitly creating objects of an as-yet-unknown type and size. Regardless of the type, you can access it through char*. If you subsequently access it through char8_t*, that means the type of the implicitly created objects must have been char8_t (to avoid undefined behaviour here), so accessing through char8_t* is also fine.

That means there should be no need to memcpy into an array that was explicitly declared as char8_t, you can just cast the char pointer, because it's really a pointer to char8_t objects even though you declared it and allocated it and used it as char.

(But as is often the case with C++, I'm only about 60% confident in that interpretation.)

DeVault: Announcing the Hare programming language

Posted May 5, 2022 0:32 UTC (Thu) by foom (subscriber, #14868) [Link] (1 responses)

"Technically", I think it might only be be defined in C++20 and later, via the change in <https://wg21.link/p0593> -- though, to the best of my knowledge, no compiler made any changes to become compliant with these changes.

DeVault: Announcing the Hare programming language

Posted May 6, 2022 0:38 UTC (Fri) by khim (subscriber, #9252) [Link]

Without that change it's not really possible to use mmap (or Windows equivalent) thus it was just a matter of fixing the standard. None of them compilers ever broke any code which does these things.

The only thing which may trigger UB there is unaligned access (and yes, it may even happen with x86 since compiler can do autovectorization and use SSE instructions).

DeVault: Announcing the Hare programming language

Posted May 2, 2022 23:54 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

utf8::valid() is not a great function as it stands. If the invariants actually hold, it's just a long-winded way to say "true". When in fact they don't hold, it's hard to say with confidence anything about the state of the system without an intimate knowledge of how all the parts fit together, a view that Hare's users would not have.

Attempts to inhabit states where you lack confidence in your own invariants are a bad idea. Hare should either say (like for example Go) that str is just a bunch of bytes and might not be UTF-8, and thus utf8::valid is a useful function, or, it should admit that yeah, there are a lot of APIs where it's just bytes and we need to manage that at the interface to have str work.

Rust frequently needs some of fairly big guns of parametric polymorphism to do what seems (as the user) like basic stuff. To make file open on a string work for example Rust needs AsRef<Path>. That's two types you probably never think about, and a bunch of implementation boiler plate. No actual machine code is emitted we're only satisfying the type system that, in fact, we know what we're doing. For this reason, I can see Hare won't want to go that route even though I personally prefer it.

DeVault: Announcing the Hare programming language

Posted May 3, 2022 20:25 UTC (Tue) by ssokolow (guest, #94568) [Link]

* What's up with these abbreviations, like e.g. "fmt" and "std". Typing a standard libraries main scope into a search engine shouldn't get you results about "sexually transmitted diseases" and what's up with "Vec"? C++ already admitted that "vector" was the wrong name for a dynamically sized array, rust not only copies that but then abbreviates it on top. Also "fn" and "str" and "recv" and "ctx" etc.
The abbreviations are part of its Ocaml ancestry. That's also where the syntax for named lifetimes comes from (it's Ocaml's syntax for generic type parameters). ...and, more generally, abbreviations like that are common in functional programming languages. See, for example, LISP's famous "defun".


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds