User: Password:
|
|
Subscribe / Log in / New account

The long road to getrandom() in glibc

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
January 9, 2017
The GNU C library (glibc) 2.25 release is expected to be available at the beginning of February; among the new features in this release will be a wrapper for the Linux getrandom() system call. One might well wonder why getrandom() is only appearing in this release, given that kernel support arrived with the 3.17 release in 2014 and that the glibc project is supposed to be more receptive to new features these days. A look at the history of this particular change highlights some of the reasons why getting new features into glibc is still hard.

Glibc remains a conservative project. There are a number of good reasons for that, but it does mean that developers proposing new features tend to run into roadblocks; that has certainly happened with getrandom(). The kernel's random number subsystem maintainer, Ted Ts'o, has been known to complain about the delay in support for this system call; he has suggested that "maybe the kernel developers should support a libinux.a library that would allow us to bypass glibc when they are being non-helpful." Peter Gutmann resorted to channeling Sir Humphrey Appleby when describing the glibc project's approach to getrandom(). But what really caused the delay here?

Glibc bug 17252, requesting the addition of getrandom(), was filed in August 2014, five days after the 3.17 kernel release. Glibc developer Joseph Myers responded twice in the following six months, suggesting that, if anybody wanted getrandom() in glibc, they would need to go onto the project's mailing list and work to drive the development forward. The first reason for the delay is thus simple: nobody stepped up to do the work.

One might wonder why it took so long for somebody to come along and implement a simple system-call wrapper. In its essence, the code that will appear in the 2.25 release is:

    /* Write LENGTH bytes of randomness starting at BUFFER.  Return 0 on
       success and -1 on failure.  */
    ssize_t
    getrandom (void *buffer, size_t length, unsigned int flags)
    {
      return SYSCALL_CANCEL (getrandom, buffer, length, flags);
    }

Such a function does not seem particularly hard to write. The original patch for getrandom() support, finally posted by Florian Weimer in June 2016, was rather more complicated than that, though. Weimer, knowing that the glibc project is conservative and wants the library to work in almost all situations, attempted to cover every base he could think of. So the patch included documentation updates, test programs, and several other details that, in turn, led to a number of sticking points that surely slowed the eventual acceptance of the patch.

The first obstacle, though, had little to do with the patch itself; it was, instead, brought about by the project's reluctance to add wrappers for Linux-specific system calls at all. Glibc does not see itself as a Linux-specific project, so it naturally prefers standardized interfaces that can be supported on all systems. The project has sporadically discussed its policy around Linux-specific calls over the last couple of years. In 2015, Myers described it as:

The result is a de facto status of "syscall wrappers present for almost all syscalls added up to Linux 3.2 / glibc 2.15 but for nothing added since then", which certainly doesn't make sense.

A draft policy for Linux-specific wrappers has existed since about then but, lacking consensus in a strongly consensus-oriented project, it has never achieved any sort of official status. Thus, even though this policy states that system-call wrappers should be added by default in the absence of reasons to the contrary, Roland McGrath responded to the initial patch posting with a terse message saying: "You need to start with rationale justifying the new nonstandard API and why it belongs in libc." That justification was not hard, given that a number of projects have been asking for this wrapper, and that adding the BSD getentropy() interface on top of it is easily done, but this challenge foreshadowed much of what was to come.

A trickier question was: what should glibc do when running on pre-3.17 kernels (or non-Linux kernels) that lack getrandom() support? The initial patch included a set of emulation functions so that getrandom() calls would always work; they would read the data from /dev/random or /dev/urandom as appropriate. Doing so involved keeping open file descriptors to those devices (lest later calls fail if the application does a chroot()). But using file descriptors in libraries is always fraught with perils; applications may have their own ideas of which descriptors are available, or may simply run a loop closing all descriptors. So the code took pains to use high-numbered descriptors that applications presumably don't care about, and it used fstat() to ensure that the application had not closed and reopened its descriptors between calls.

This usage of file descriptors drew a number of comments; it is something that glibc tries to avoid whenever possible. After some discussion, it was concluded that glibc should provide only a wrapper for the system call, without emulation. If an application calls getrandom() on a kernel where that system call is not supported, the glibc wrapper will simply return ENOSYS and it will be up to the application to use a fallback. That decision removed a fair amount of code and one obstacle to merging.

In writing the patch, Weimer worried that there may be a number of applications out there with their own function called getrandom(), which may or may not provide the same interface and semantics as the glibc version. The prospect was especially troubling because a getrandom() call that does not actually return random data may not cause any visible problems in the application at all — until some attacker notices this behavior and exploits it. So he employed a bunch of macro and symbol-versioning trickery to detect and prevent confusion over which getrandom() function to use.

This feature, too, was unpopular; glibc does not normally add extra layers of protection around its symbols in this way. The tricks made it impossible to take the address of the function, among other things. After extensive discussion, Weimer backed down and removed the interposition protection, but he clearly was not entirely happy about it.

The most extensive argument, though, was over whether getrandom() should be a thread cancellation point. In other words, what should happen if pthread_cancel() is called on a thread that is currently blocked in getrandom()? The original patch did make getrandom() into a cancellation point; it still behaves that way in the version merged for 2.25, but it had to survive a lot of argument to get there.

Weimer wanted getrandom() to be a cancellation point because the system call can block indefinitely, even if it almost never blocks at all. The Python os.urandom() episode showed that this blocking can, in rare situations, cause real problems. So, he said, it should be possible for a cancellation-aware program to respond to an overly slow getrandom() call.

The objections here seemed to be, for the most part, objections to cancellation points in general. It is true that cancellation points are problematic in a number of ways. To the implementation issues one can add the fact that most programs are not cancellation-aware and may not respond well to a thread cancellation in an unexpected place. A version of getrandom() that adds a new cancellation point could thus lead to unfortunate behavior. Additionally, getrandom() is supposed to always succeed; the possibility of cancellation adds a failure mode that is not a part of the system call itself.

On the other hand, Carlos O'Donell argued that getrandom() is analogous to read() and thus should behave the same way; read() is a cancellation point. The argument went back and forth over months, and included detours into whether there should be a separate getrandom_nocancel() function or an additional "cancellation point please" argument to getrandom(). In the end, getrandom() remained an unconditional cancellation point. The BSD-compatible getentropy() implementation included in the patch is not a cancellation point, though.

With these issues resolved, the conversation came to a close on December 12 when getrandom() and getentropy() were merged into the glibc repository. A feature that has been shipping in the Linux kernel for over two years will finally be available to application developers without the need to create special system-call wrappers. Now all that's left is all the other Linux-specific system calls that still lack glibc wrappers.


(Log in to post comments)

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:06 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

It's frustrating to see resistance to using internal file descriptors in a library. One of the goals of the O_CLOEXEC work from a few years ago was to make it easier for libraries to use file descriptors without having to coordinate with other libraries in the same process. If we still can't use private file descriptors, what was the point? I say that we consider private file descriptors 100% fine and consider processes that do silly things like loop over all FDs and reopen them to simply be broken.

FWIW, Windows programmers have no qualms about keeping private HANDLEs around despite the ability under Windows to enumerate HANDLE values, do the equivalent of dup2, and so on.

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

There's a difference - HANDLE namespace is not dense, so developers are not tempted to do stupid close-and-reopen tricks.

The long road to getrandom() in glibc

Posted Jan 10, 2017 0:57 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

I'm also in favor of the various perennial proposal to cast off the oppressive yoke of POSIX and just randomize file descriptor allocation.

The long road to getrandom() in glibc

Posted Jan 10, 2017 20:53 UTC (Tue) by lsl (subscriber, #86508) [Link]

Impossible. That would break lots of real programs, not just POSIX. Those might not be recently-written programs (at least one would hope so) but people still rely on them.

The long road to getrandom() in glibc

Posted Jan 10, 2017 22:08 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Making it opt-in would work. I.e. a new O_ option for open() and friends to allocate an FD from the new space.

And perhaps a thread/process-wide default.

The long road to getrandom() in glibc

Posted Jan 13, 2017 13:03 UTC (Fri) by alonz (subscriber, #815) [Link]

For even cleaner semantics, make all such new-space FDs always behave as if O_CLOEXEC was set. You then magically get rid of (all? most?) arguments against FD randomization.

The long road to getrandom() in glibc

Posted Jan 13, 2017 19:59 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Makes sense. And if somebody wants to have non O_CLOEXEC descriptors then they can just turn it off explicitly.

The long road to getrandom() in glibc

Posted Jan 13, 2017 21:20 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

If you really want that behaviour it seems a lot saner to enforce CLOEXEC being set, rather than silently behaving as it were set.

The long road to getrandom() in glibc

Posted Feb 14, 2017 18:53 UTC (Tue) by nix (subscriber, #2304) [Link]

You could do this by adding an internal do-not-use __O_RANDOM_ALLOC and making the publically-visible O_RANDOM_ALLOC equal to __O_RANDOM_ALLOC | O_CLOEXEC.

The long road to getrandom() in glibc

Posted Jan 10, 2017 8:04 UTC (Tue) by vstinner (subscriber, #42675) [Link]

Python uses a private FD with O_CLOEXEC set, but it doesn't prevent users or libraries to make mistakes and getting issues: http://bugs.python.org/issue21207

"I've seen an issue with using urandom on Python 3.4. I've traced down to fd being closed (not by core CPython, but by third party library code). After this, access to urandom fails. (....) OSError: [Errno 9] Bad file descriptor"

The workaround is to call fstat() and store st_dev and st_ino to check if the FD was *closed or replaced*. If the FD was replaced, os.urandom() leaves the FD open because "it probably points to something important for some third-party code" and open a new FD... Not ideal, but "it works"...

Getting random bytes from the OS in a portable way takes around 600 lines of code: https://github.com/python/cpython/blob/master/Python/rand... !

The long road to getrandom() in glibc

Posted Jan 10, 2017 8:39 UTC (Tue) by vstinner (subscriber, #42675) [Link]

> Python uses a private FD with O_CLOEXEC set (...)

Hum, I forgot to explain why, it's also an interesting story. With a lot of threads and high system load, the Python os.urandom() function failed with the NotImplementedError("/dev/urandom ...") exception:
http://bugs.python.org/issue18756

The C code considered that the device is not available if open("/dev/urandom", O_RDONLY) fails with an error. There was no specific case for EMFILE or ENFILE error. The private FD was added to use at most one file descriptor.

Note: Java also keeps one persistent FD to /dev/urandom.

The long road to getrandom() in glibc

Posted Jan 10, 2017 13:28 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

I wouldn't even bother trying to detect replacement or closing or whatever. Programs that close file descriptors that they do not own are broken. We should not try to accommodate broken programs.

The long road to getrandom() in glibc

Posted Jan 10, 2017 14:36 UTC (Tue) by pizza (subscriber, #46) [Link]

> We should not try to accommodate broken programs.

Meanwhile, in the real world, the programs dictate the choice of your system, not the other way around.

The long road to getrandom() in glibc

Posted Jan 10, 2017 16:14 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Programs used to dereference NULL and get zero. Programs used to expect to be mapped in the same location every time. Programs used to expect unrestricted ptrace. None of these assumptions hold today, and we're better off for it.

The long road to getrandom() in glibc

Posted Jan 10, 2017 14:53 UTC (Tue) by nix (subscriber, #2304) [Link]

But 'loop and close everything' has been a de-facto standard for literally decades. You just declared much of the Unix world broken and not worth accommodating. This is not the way glibc development, at least, works. (Thank goodness.)

Programs that do clearly broken things like dereferencing NULL pointers are freely broken (unlike on, say, Solaris or God forbid hpux) -- but programs that do things that Unix programs have been doing for decades in very large numbers can't just be broken by fiat like that, even if they are objectively horrible things to do.

The long road to getrandom() in glibc

Posted Jan 10, 2017 16:16 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Looping and closing *in* *a* *child* *process* before exec is fine and won't break anything. A program looping over FDs and expecting to keep running is badly broken, and very few programs actually do that. If I'm wrong about that last bit --- about programs needing precise control over all their file descriptors --- please point me at an example.

The long road to getrandom() in glibc

Posted Jan 10, 2017 16:51 UTC (Tue) by excors (subscriber, #95769) [Link]

haypo pointed at http://bugs.python.org/issue21207 which points at https://github.com/fail2ban/fail2ban/blob/1c65b946171c3bb... which daemonizes itself by forking then closing all file descriptors, and it doesn't call exec. Searching the web for instructions on how to write a daemon shows plenty of other people promoting the same pattern.

The long road to getrandom() in glibc

Posted Jan 10, 2017 17:07 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Then these programs must change. If libraries cannot have private file descriptors, far too many useful constructs become impossible. What do you propose? SysV-like ad-hoc handles for resources instead of file descriptors? Some kind of alternate file descriptor namespace?

Programs that access resources that they do not own are broken and need to be fixed no matter how painful it may be. We have symbol versioning and such to preserve old, broken behavior for old, broken programs, and the versioning approach will continue to work for private file descriptors. Programs compiled these days need to be modified so that they don't free resources that they don't own.

The long road to getrandom() in glibc

Posted Jan 10, 2017 18:15 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Then these programs must change.
And we also need flying unicorns.

> If libraries cannot have private file descriptors
They cannot, not safely.

> SysV-like ad-hoc handles for resources instead of file descriptors? Some kind of alternate file descriptor namespace?
The same descriptor namespace, but growing down from some large value (MAXINT?) and not having all the brain-deadness associated with regular descriptors.

The long road to getrandom() in glibc

Posted Jan 11, 2017 14:53 UTC (Wed) by mirabilos (subscriber, #84359) [Link]

No, they just need to un-deprecate sysctl(), in particular the Linux-specific kernel.random.uuid — on BSD, we also just use sysctl CTL_KERN.KERN_ARND for this, and it works in chroots and everything else.

The long road to getrandom() in glibc

Posted Jan 21, 2017 4:32 UTC (Sat) by njs (guest, #40338) [Link]

I know the thread has drifted, but you realize that the original article was all about how Linux did exactly that some years ago :-) (modulo using a dedicated syscall instead of overloading sysctl)

The long road to getrandom() in glibc

Posted Jan 10, 2017 10:13 UTC (Tue) by smcv (subscriber, #53363) [Link]

> One of the goals of the O_CLOEXEC work from a few years ago was to make it easier for libraries to use file descriptors without having to coordinate with other libraries in the same process

If you fork-and-exec, you still have to coordinate with the other libraries in the same process to make sure they use O_CLOEXEC, SOCK_CLOEXEC, FD_CLOEXEC, etc. on every fd (in practice not necessarily feasible if you use lots of libraries); or iterate through fds between fork and exec to close them all, except for a whitelist of deliberately-inherited fds. Otherwise, your activated D-Bus service inherits "private" fds from dbus-daemon's use of libselinux, for example.

(I maintain libdbus and contribute to GLib, both of which: want to use both private fds and fork/exec; have had bugs where their private fds were not close-on-exec despite going to some effort to add the right CLOEXEC flags everywhere; and aim to be portable to platforms that don't have CLOEXEC, so have to have fallback code to use FD_CLOEXEC every time they open a fd anyway.)

The long road to getrandom() in glibc

Posted Jan 10, 2017 13:33 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

If you fork and close file descriptors in the child before you exec, there's no problem. The reason we have O_CLOEXEC is to solve the *opposite* problem, where a process forks, closes only the file descriptors it knows about, and then execs. In this case, a library keeping a private file descriptor without O_CLOEXEC is dangerous, since there's no way of keeping that file descriptor out of child processes. Setting O_CLOEXEC after the fact with F_SETFD is racy. With O_CLOEXEC, this problem disappears.

If you want to be portable, sure, close FDs in the child before exec. You'll be the only thread running, and you'll be running only async-signal-safe code anyway, so what's the problem?

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:14 UTC (Mon) by nix (subscriber, #2304) [Link]

Only two years? That's nothing! Florian also shepherded something of mine through review in this cycle which had been sitting in durance vile for *eight* years. (Most of that was my fault, though -- Uli nacked it, and I didn't move very fast to try to get it in when the glibc governance style changed. Counting more conventionally it was only going through review for about a year, which isn't bad for something that might break just about any architecture you name and uncovered half a dozen mostly latent bugs -- but counting in the same way, getrandom() took six months.)

Florian is a tower of strength without which glibc would be much less secure than it is now: almost all the release-to-release security improvements in glibc have his imprint on them somewhere. (Joseph is a similar power in the land with regard to making libm work better.)

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:43 UTC (Mon) by PaXTeam (guest, #24616) [Link]

i'm not sure that adding the obsolete and weak SSP to a system library in 2016AD is that much of a success when much better alternatives have existed for longer than SSP itself...

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:56 UTC (Mon) by ay (subscriber, #79347) [Link]

Could you please qualify how SSP is obsolete and weak?

The long road to getrandom() in glibc

Posted Jan 10, 2017 4:19 UTC (Tue) by thestinger (subscriber, #91827) [Link]

It only defends against linear stack overflows. An earlier implementation from 1999 (StackGuard XOR canaries) also protected return pointers against direct writes. There are better alternatives now.

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:38 UTC (Tue) by nix (subscriber, #2304) [Link]

Yeah, I'm working on improvements now, albeit slowly: XOR canaries seem at first sight likely to have a prohibitive performance impact (pipeline stalls), though I'll try it out before giving up on that idea: but for now I'm going to aim for straight comparisons, probably of per-ELF-object random objects filled in by the kernel a-la OpenBSD, which will hopefully let us stack-protect the one unprotected piece, ld.so. (As to how, say, -fstack-protector-strong is an improvement over *nothing at all*, I think that speaks for itself. Obviously there are better approaches, but are there any ready to apply to entire distros? No. Not without a *lot* more work and testing if nothing else, while -fstack-protector is known to work at the scale of distros because it's already being *used* there. I'm merely extending protections that are already there for the entire distro bar glibc to glibc as well.)

The long road to getrandom() in glibc

Posted Jan 10, 2017 17:09 UTC (Tue) by thestinger (subscriber, #91827) [Link]

-fstack-protector-strong is better than nothing, but there are tons of memory corruption vulnerabilities and the most common ones are now heap overflows and use-after-free, not stack overflows and particularly not linear stack overflows. It doesn't even protect against relatively small (not arbitrary) stack out-of-bounds accesses beyond adding some usually irrelevant padding. It's also a probabilistic defence relying on the canary not being leaked. I really don't think you gain much by aiming to cover the last remaining bits of code without it. How often are ld.so vulnerabilities relevant, particularly stack buffer overflows and even more specifically linear ones? If you're worried about suid binaries, you should really just be setting NO_NEW_PRIVS on all application layer code as Android does and also doing away with a system of privilege escalation where unsafe code runs in an untrusted environment vs. memory safe code exposing a service. Speaking of which, why the hell is all of the new systemd code written in C? They don't even have a use case for a language like Rust, it could all just be in a GC language like Go with no issues and perhaps even better performance due to easier maintenance and reusing better event loop code.

SafeStack is ready to be applied to entire distributions. HardenedBSD is using it globally and Google is working on integrating it into Android (it's mostly stalled on bikeshedding at the moment). It separates the stack into one with all of the return pointers, register spills and safe data (no overflows or address leaks) and keeps the data where overflows can occur separately. It barely has a performance hit, and if you really want to you can still use SSP with it. Linear overflows can't get to the safe stack as long as there are guard pages, so it's a deterministic defence against that with probabilistic mitigation of arbitrary writes via ASLR / libc stack randomization (reserve random runs of guard pages on either side - and with typical 8M stacks, there's nearly zero cost since they already span multiple 2M regions anyway). It's too bad that there aren't hardware features available for a great implementation anymore (i.e. segmentation).

Clang's CFI implementation is similarly ready for deployment and a subset of it (C++ virtual method calls) is being used by Chrome's 64-bit Linux builds (distributions can use it too but they typically don't care about stuff like this) already with the type cast checking on the horizon. It's annoying though, since it depends on LTO and requires fixing a bunch of undefined indirect calls. It only protects indirect calls rather than also covering returns (performance / size would be a major issue for returns). LTO is risky since it makes all of the latent undefined behavior in real world code significantly more dangerous and we don't have good tools to catch it. UBSan and ASan are able to catch subset but are missing tons of coverage, and only catch it when it occurs at runtime which may not happen for UB resulting in vulnerabilities in edge cases unless you use trapping UBSan which is costly and can be painful to debug but is production ready at least in Clang.

The long road to getrandom() in glibc

Posted Jan 13, 2017 18:04 UTC (Fri) by nix (subscriber, #2304) [Link]

-fstack-protector-strong is better than nothing, but there are tons of memory corruption vulnerabilities and the most common ones are now heap overflows and use-after-free, not stack overflows and particularly not linear stack overflows
Well yeah, but there was no point my implementing a fix for that because Florian's already working on one.

It specifically protected me against CVE-2015-7547. That one made headlines and was a remote exploit. Obviously it doesn't protect you against everything: why on earth would anyone assume that it would?

The long road to getrandom() in glibc

Posted Feb 14, 2017 19:04 UTC (Tue) by nix (subscriber, #2304) [Link]

SafeStack is ready to be applied to entire distributions.
SafeStack is really cool, but by its very nature splitting the stack in two is a great big ABI break, requiring coordinated rebuilds of literally everything. There's a reason Android and the BSDs are doing it first -- as integrated systems, that sort of big cross-project change is much easier for them. (Also, they frankly don't have as much weird edge-case software doing strange things as Linux does. Most of that software is quite unimportant, I suppose.)

The long road to getrandom() in glibc

Posted Jan 10, 2017 21:12 UTC (Tue) by PaXTeam (guest, #24616) [Link]

> XOR canaries seem at first sight likely to have a prohibitive performance impact (pipeline stalls),

replacing speculation with real data, ssp-all has an overhead of >25% on a workload where RAP has <5%. should i ask Intel for a refund since my CPU doesn't seem to know or care about those 'pipeline stalls'? ;)

> probably of per-ELF-object random objects filled in by the kernel a-la OpenBSD,

just for the record, that idea doesn't originate from OpenBSD but from a hardened gentoo discussion back in around 2003-2004 or so (maybe even earlier).

> which will hopefully let us stack-protect the one unprotected piece, ld.so.

i've had no problems protecting glibc (ld.so included) with RAP, so i guess you're just running into (and not fixing) the implemention mistakes of SSP.

> Obviously there are better approaches, but are there any ready to apply to entire distros? No.

i recompiled a gentoo system with enough packages to run chromium and had no problems with RAP's XOR cookie approach whatsoever.

> I'm merely extending protections that are already there for the entire distro bar glibc to glibc as well.)

IMHO your time would be much better spent on fixing glibc's abuse of function pointers as there's lots of horror and actual bugs to be found there...

The long road to getrandom() in glibc

Posted Jan 11, 2017 16:33 UTC (Wed) by clump (subscriber, #27801) [Link]

Why all the condescending remarks? I've read your posts over the years and can't imagine what good comes from such an approach.

The long road to getrandom() in glibc

Posted Jan 11, 2017 17:10 UTC (Wed) by PaXTeam (guest, #24616) [Link]

what good comes from your condescending remarks? useless rhetorics cuts both ways you see. now if you can stop attacking the messenger and address the technical points (you know, "try to be polite, respectful, and informative") then we can actually have a conversation.

The long road to getrandom() in glibc

Posted Jan 11, 2017 18:12 UTC (Wed) by pizza (subscriber, #46) [Link]

> now if you can stop attacking the messenger and address the technical points (you know, "try to be polite, respectful, and informative") then we can actually have a conversation.

Methinks you would do well to follow your own advice.

The long road to getrandom() in glibc

Posted Jan 11, 2017 18:23 UTC (Wed) by PaXTeam (guest, #24616) [Link]

did you try to say something about SSP?

The long road to getrandom() in glibc

Posted Jan 13, 2017 18:08 UTC (Fri) by nix (subscriber, #2304) [Link]

Quite. There are very few posters less polite and respectful than PaXTeam on LWN, and quite possibly only one that remains unbanned.

The long road to getrandom() in glibc

Posted Jan 14, 2017 20:43 UTC (Sat) by jospoortvliet (subscriber, #33164) [Link]

Agreed. The value of the comments posted by PAX is hugely diminished by their condescending, impolite and disrespectful tone. I wouldn't mind a ban - no matter how technically great somebody is, if you're not capable of communicating it halfway decent it means nothing.

The long road to getrandom() in glibc

Posted Jan 15, 2017 0:53 UTC (Sun) by spender (subscriber, #23067) [Link]

Let's go back to what inspired all this virtue signaling from people who have no technical content of their own to share:

"replacing speculation with real data, ssp-all has an overhead of >25% on a workload where RAP has <5%. should i ask Intel for a refund since my CPU doesn't seem to know or care about those 'pipeline stalls'? ;)"

That is literally the only thing he said that could potentially be construed as "impolite" -- personally I find it impolite to be spreading false information or making bogus claims with nothing at all to back it up, especially when you're talking to the person who *has already done the work and knows you are wrong*.

Then we have nix's reply (he often likes to get the last word in via tone argument when he's lost the technical argument, as he always does), where he says: "in the absence of actual facts rather than venom" aka exactly the phrasing the PaX Team used to inspire these stupid comments (followed by attacking a strawman to make him look smart, unfortunately it has nothing to do with what the PaX Team was referring to re: abuse of function pointers. Thinking for 10 seconds about what kind of function pointer abuse one would learn of after having developed a CFI system that essentially enforces type correctness for indirect function calls might clue a person in to what was being talked about, especially when this work has been applied to glibc, but not nix -- his focus is pointless kneejerk responses to make himself feel better). In reply, someone calls his response "graceful". Give me a break.

What's disrespectful, impolite, and condescending is everyone's pointless tone arguments and spreading of false information. If you don't know what you're talking about, the proper way to respond to someone is to ask a question and learn something, not to pose as an expert in something you're clearly not and waste time arguing with someone who knows better. I doubt you will ever find us being "impolite" to anyone asking legitimate questions who are trying to learn. People who aren't here to learn and are just misinforming others with their own ignorance are the problem. And we all know who these people are, because they show up in nearly every thread on this site, and surprise, they think they're an expert in every topic here. Put your collective big boy pants on, and quit it with these pointless replies.

-Brad

The long road to getrandom() in glibc

Posted Feb 14, 2017 18:43 UTC (Tue) by nix (subscriber, #2304) [Link]

(back after several weeks away because spender angered me enough that it poisoned my view of the whole site for some time)
he often likes to get the last word in via tone argument when he's lost the technical argument, as he always does)
See, that's the thing, isn't it? You consider this an 'argument' that you have to win and be seen to be right, no matter the consequences. I see this as a cooperative venture, all together. The two worldviews are incompatible. I think you're repulsive and that complaining about your tone is crucial because it makes everything you do less useful to everyone: you think I'm ignorant and that I'm complaining about your tone as some sort of get-out because I can't attack your content. (I don't want to attack your content, since there's nothing wrong with it when you actually bother to back up your arguments rather than just posting MD5 hashes as proof that you knew something before anyone else, or randomly insulting people who don't treat everything you say as gospel truth. I have never claimed not to be ignorant: of course I don't know everything about everything, and of course not everything I say is always right. That's true of everyone who hasn't taken a vow of silence. Except, it appears, for you, or so you appear to believe.)
Put your collective big boy pants on, and quit it with these pointless replies.
Maybe you should tell PaXTeam that, since he is the very one who responded to my comment by casting aspersions on my work but leaving it up to everyone else to do the necessary research to tell what he was talking about. (Said comment was intended to compliment the glibc developers, not toot my own horn, not that you appeared to grasp that; it was not addressed to either of you, and in fact I'd be very glad if you killfiled me). That sort of response seems to me to be the very definition of a pointless reply: actually worse than pointless, since it's an attack without the evidence to back it up, on a site whose comment threads used to be known for their civility as well as their utility.

But, of course, that was before you two came along.

The long road to getrandom() in glibc

Posted Jan 13, 2017 18:07 UTC (Fri) by nix (subscriber, #2304) [Link]

It's an excellent way of making me decide (yet again) never to pay attention to his offensively unpleasant and frequently ignorant spewage, I must say.

e.g. his flaming about glibc's function pointer use, for instance, well, in the absence of actual facts rather than venom it's hard to tell what he's talking about, since glibc's function pointers have been XORed with a random cookie for a very long time now. I guess he's talking about a few remaining unrandomized libio pointers in the FILE * -- and, y'know, if we didn't give a damn about compatibility we could randomize those too. Unfortunately, we do, and randomizing them breaks real existing applications.

The long road to getrandom() in glibc

Posted Jan 13, 2017 19:02 UTC (Fri) by clump (subscriber, #27801) [Link]

I felt compelled to point out the tone of the conversation because I admire the work you've put in, and also how gracefully you responded here. On behalf of myself and surely others, thank you.

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:44 UTC (Mon) by karkhaz (subscriber, #99844) [Link]

> A trickier question was: what should glibc do when running on pre-3.17 kernels (or non-Linux kernels) that lack getrandom() support?

Perhaps a dumb question, but why not surround the function in IFDEFs so that it doesn't get compiled in on such kernels? So that clients building for other systems get a compilation error rather than a runtime error that they might not even be checking for.

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:47 UTC (Mon) by corbet (editor, #1) [Link]

The function is indeed stubbed out on platforms where the system call cannot exist. But the kernel the library is compiled under is not necessarily the kernel on which it will run at any given time; the two components are not that tightly tied together.

The long road to getrandom() in glibc

Posted Jan 9, 2017 23:55 UTC (Mon) by ay (subscriber, #79347) [Link]

I read the man page for notes like "available on Linux 3.17 or later" and then if the code in question needs to use this interface and I do expect to hit this in the field (that is, it doesn't say "since 2.2" or something truly ancient) I use uname(2) in my program and put in my own workaround path or error messages, I don't call the routine and see what happens (though with what they did that would be sensible). This seems like it isn't a big deal in general.

The long road to getrandom() in glibc

Posted Jan 10, 2017 2:48 UTC (Tue) by ncm (subscriber, #165) [Link]

You could call it a few times and see if the results seem random enough...

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:01 UTC (Tue) by josh (subscriber, #17465) [Link]

> I use uname(2) in my program and put in my own workaround path or error messages, I don't call the routine and see what happens

Please don't ever do that.

Version detection would break, for instance, if the kernel configuration compiles out the system call to save space, supported for an increasing number of system calls (and hopefully all of them eventually). It would also break if someone backported the system call. Or if the kernel wired up the system call in different versions for different architectures or ABIs.

Always call the syscall, check for ENOSYS, and fall back or error out as appropriate.

The long road to getrandom() in glibc

Posted Jan 10, 2017 15:10 UTC (Tue) by ay (subscriber, #79347) [Link]

That makes sense, thank you.

The long road to getrandom() in glibc

Posted Jan 10, 2017 17:25 UTC (Tue) by zlynx (subscriber, #2285) [Link]

Version checks are really a bad way to do it. They are so easy to get wrong. Some time in the future when Linux 10 gets released your version check may decide it is too old because you only looked at the first character or something.

Of course, _your_ code will never do it wrong. But _someone's_ will and then the OS will need yet another lame hack to report version 9.99.

As an example from another OS, Microsoft has got tired enough of these version check bugs that they've made getting the actual OS version quite difficult. The regular version check only returns the minimum of the OS or the program manifest so that software built for Windows 10 will always return version 10 even on Windows 15. And if there isn't a manifest it gets Windows 8.1.

The long road to getrandom() in glibc

Posted Jan 12, 2017 3:12 UTC (Thu) by cjwatson (subscriber, #7322) [Link]

Some proprietary code I was paid to work on long ago had that exact bug, only with Solaris 10. Took us a while to work out why it was suddenly going lots slower on the shiny new build ...

The long road to getrandom() in glibc

Posted Jan 21, 2017 4:42 UTC (Sat) by njs (guest, #40338) [Link]

If we're sharing anecdotes... I had a proprietary backup app decide that it couldn't use inotify and switch to repeatedly scanning the entire filesystem looking for changes (!). It turned out to be because Debian's kernel was reporting a version string like "4.2" while the app was expecting something like "4.2.1", and when the parse failed it assumed the worst. Never mind that you have to go back to like the early 2000s to find a kernel without inotify support...

The long road to getrandom() in glibc

Posted Jan 10, 2017 4:04 UTC (Tue) by busterb (subscriber, #560) [Link]

Why not just make the wrapper only build conditionally based on ‘--enable-kernel=version’ being set to a high-enough version?

That should make glibc itself not work on kernels too old to support the syscall, removing the need for ENOSYS or backward compatibility shims. Then from an application point-of-view, the wrapper either exists, or the code doesn't run in the first place.

That's not a lot different than glibc 2.24 requiring kernel 2.6.32 or later on x86 (others?). Or is this also an optional syscall even on newer kernels?

Kernel and libc versions

Posted Jan 10, 2017 8:11 UTC (Tue) by vstinner (subscriber, #42675) [Link]

It's common that packages are build on a host with a different kernel and libc version than the versions used by users. It can be more recent or older. In both cases, the code should handle compatibility issues.

If the builder is too old, the program lacks new features (ex: don't try to use getrandom()). If the builder is too recent, the program tries to use too recent function which fails with ENOSYS (or differently, sometimes in sublte ways, see below).

Python is full of runtime checks for recent Linux kernel features: open(O_CLOEXEC), socket(SOCK_CLOEXEC), getrandom(), etc.

For open(O_CLOEXEC), the check is not as simple as ENOSYS or EINVAL. On older kernels, the flag was simply ignored! Python has to check on the first open() call if the flag was correctly set. Otherwise, it remembers that the flag is ignored and sets the flag in a second syscall (ioctl or fcntl, again depending on the availability of the ioctl or not).

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:51 UTC (Tue) by nix (subscriber, #2304) [Link]

Because many distros build with as old an --enable-kernel=$version as they can get away with. Doing it this way would mean denying those distros glibc support for getrandom() even if glibc was built on a system with new-enough kernel headers to have the syscall *and* it's running on a kernel new enough to have the syscall. That seems like the worst of all worlds.

The long road to getrandom() in glibc

Posted Jan 10, 2017 5:55 UTC (Tue) by eru (subscriber, #2753) [Link]

the project's reluctance to add wrappers for Linux-specific system calls at all.

Is glibc really being used on non-Linux systems in practice? BSD's seem to prefer their own BSD-licensed libc. I guess there is Cygwin, but it has to create emulations for most of Linux calls anyway, so it could add getrandom() itself.

The long road to getrandom() in glibc

Posted Jan 10, 2017 6:08 UTC (Tue) by pabs (subscriber, #43278) [Link]

Debian kFreeBSD and Hurd use it. There are other GNU distros that have BSD kernel ports too.

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:51 UTC (Tue) by nix (subscriber, #2304) [Link]

Cygwin uses newlib, not glibc.

The long road to getrandom() in glibc

Posted Jan 10, 2017 7:35 UTC (Tue) by jaromil (guest, #97970) [Link]

Good read. Yet both in this and the previous article on glibc wrappers to Linux calls (https://lwn.net/Articles/655028/) there is no explicit mention of the reasons why one would desire to have functions wrapping a kernel syscall(). In other words, why can't people simply use the syscall?
I'm not (yet) partial to that, just forming my opinion, but noticing this narrative implicitly considers the wrapping function a need. I'd love to understand why is that.

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:54 UTC (Tue) by nix (subscriber, #2304) [Link]

Using syscalls directly is really quite unpleasant. Every architecture numbers them differently: many arches have different obscure rules regarding everything from sign-extension to return values to passing in >N arguments (often 5) which you only learn about when you are left holding the broken pieces of your program... direct syscall() is something you do only when you have to, and people shouldn't have to do it for a syscall like getrandom(), which has widely-used alternatives that are simple, obvious, and dangerously wrong in ways that are only obvious long after deployment (or if you've been paying enough attention to realise why and when weak RNGs are risky).

The long road to getrandom() in glibc

Posted Jan 10, 2017 11:55 UTC (Tue) by nix (subscriber, #2304) [Link]

... also, of course, if you want something to be a cancellation point, glibc *has* to get involved, because cancellation points are internal to glibc: there is no mechanism to designate something outside the library to be a cancellation point. But whether getrandom() should be a cancellation point or not is debatable enough that it might not be seen as an advantage. (It definitely is for, say, read().)

The long road to getrandom() in glibc

Posted Jan 10, 2017 13:19 UTC (Tue) by mm7323 (subscriber, #87386) [Link]

You can create cancellation points anywhere with void pthread_testcancel(void). The harder thing is stopping cancellation in a library (your can only really defer it with pthread_setcancelstate()).

The long road to getrandom() in glibc

Posted Jan 10, 2017 15:54 UTC (Tue) by carlos.odonell (subscriber, #99737) [Link]

You cannot easily create arbitrary deferred cancellation points inside a syscall (without recreating what glibc does with cancellation regions and special signal handlers). Consider the scenario where the cancellation arrives after you tested for it but before you enter the syscall. In this case you would block in the syscall and only handle the cancellation when you return and potentially test for cancellation again. The alternative would be to enable asynchronous cancellation around the syscall, but in doing so you must ensure all asynchronous signal handlers use only asynchornous-cancellation safe functions (all 3 of them). So while theoretically possible it's overly restrictive and few applications can make use of such "asynchronous cancellation wrappers." The most useful thing we can do is keep adding support for all the new Linux syscalls, since that's what developers need.

The long road to getrandom() in glibc

Posted Jan 10, 2017 22:08 UTC (Tue) by mm7323 (subscriber, #87386) [Link]

You cannot easily create arbitrary deferred cancellation points inside a syscall
I think the original question was about creating cancellation points in a library. It can be done.

That said, thread cancellation is very messy, non-obvious in the code and prone to resource leaks, corruption and race conditions. Except in the very simplest of cases, I would advise anyone considering trying to use thread cancellation to find a more reliable method.

Python 3.6, glibc 2.25, getentropy() and kernel < 3.17

Posted Jan 10, 2017 8:24 UTC (Tue) by vstinner (subscriber, #42675) [Link]

It's fun to see an article on LWN about the availability of the getrandom() function in the glibc 2.25, because I just got a bug report this week on Fedora 26 LWN: Python 3.6 fails to get random numbers to initialize Python.
https://bugzilla.redhat.com/show_bug.cgi?id=1410175

When I wrote Python/random.c, I added support for OpenBSD getentropy(). On OpenBSD, packages are build on the same OpenBSD version than the version used to run the program. So the Python function calling getentropy() doesn't check ENOSYS. glibc 2.25 added getrandom() and getentropy(), but Python tries first getentropy(). Sadly, the Python package was built on a host with a more recent kernel and libc than the user OS, and so users got the initialization error: getentropy() function calls getrandom() syscall with fails with ENOSYS.

I modified Python to handle ENOSYS and EPERM in getentropy(), and also modified the code to prefer getrandom() over getentropy(), because getrandom() supports non-blocking urandom which is required by Python, the infamous PEP 524.
https://www.python.org/dev/peps/pep-0524/

Note: Python also has to handle EPERM because an user reported that a security policy, called "QNAP", blocked the getrandom() syscall: http://bugs.python.org/issue27955

Again, providing a Python portable os.urandom() function which has almost the same properties on all platforms and all minor platform versions is a hard challenge!

The long road to getrandom() in glibc

Posted Jan 10, 2017 16:02 UTC (Tue) by mmechri (subscriber, #95694) [Link]

Could someone explain to me why Linux developers want Linux-specific system call wrappers in Glibc? I might be naive, but it sounds to me like it would make sense to have a library of Linux system call wrappers instead of stuffing this in Glibc. From what I understand, this is what Ted Ts'o seems to suggest. What are the pros and cons of this approach? Why hasn't it been done already?

The long road to getrandom() in glibc

Posted Jan 10, 2017 17:09 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Because the inconvenience (in distribution, linking, documentation, etc.) is silly --- in practice, glibc is the libc for many Linux systems and not much else, so keeping Linux-specific system calls out of glibc has no practical benefits.

If we do see a separate library of separate system call wrappers, it'll be because we've failed to create a coherent system and instead "shipped the org chart".

The long road to getrandom() in glibc

Posted Jan 12, 2017 2:44 UTC (Thu) by busterb (subscriber, #560) [Link]

Honestly, I'd have preferred to have glibc implement arc4random and family, rather than another low-level primitive that half of the consumers is going to get wrong. getentropy at least was never designed for regular applications to use directly.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds