vDSO, 32-bit time, and seccomp

By Jonathan Corbet
August 2, 2019

The seccomp() mechanism is notoriously difficult to use. It also turns out to be easy to break unintentionally, as the development community discovered when a timekeeping change meant to address the year-2038 problem created a regression for seccomp() users in the 5.3 kernel. Work is underway to mitigate the problem for now, but seccomp() users on 32-bit systems are likely to have to change their configurations at some point.

The virtual dynamic shared object (vDSO) mechanism is an optimization provided by the kernel to reduce the cost of certain frequently used system calls. The vDSO is a small region of kernel-provided memory that is normally mapped into the address space of every user-space process; it contains implementations of system calls that can, in some circumstances at least, do their work in a user-space context. That allows the caller to avoid making a real system call and, thus, to avoid the cost of a context switch into kernel mode. System calls related to timekeeping, such as gettimeofday() are implemented in the vDSO, since they can often run quickly in user space and they tend to be called frequently.

The vDSO has generally been implemented in an architecture-specific way, even though the functions it performs are mostly the same across architectures. In the 5.2 development cycle, Vincenzo Frascino added a generic vDSO implementation that factored out much of the architecture-specific code into a single implementation that could be used on all architectures. During the 5.3 merge window, the x86 architecture switched over to the generic version, and all was well — or so it seemed.

`seccomp()` sadness

In mid-July, Sean Christopherson (among others) reported that the generic vDSO change broke some seccomp() users on 32-bit x86 systems. seccomp(), remember, allows user space to provide a BPF program (still "classic BPF", not eBPF as is used almost everywhere else in a contemporary Linux system) to control which system calls may be made. It is used to reduce the attack surface of code that might be exposed to attackers in one way or another; using it correctly is hard, but the number of users has been on the rise.

While the vDSO can usually implement timekeeping system calls in user space, that is not always possible. If the calling program wants an esoteric clock that has not been implemented, or if the timekeeping hardware available on the system is not amenable to vDSO access, then the vDSO must fall back to calling into the kernel. Prior to 5.3, the architecture-specific vDSO used the native clock_gettime() call on the system it was running on; that meant calling the 32-bit clock_gettime() on 32-bit kernels.

The 32-bit time format is, of course, going to run out of range in January 2038. Quite a bit of work has gone into preparing systems for this particular apocalypse, though much work still remains. Given this problem, adding new users of 32-bit time interfaces is a way to become rather unpopular in kernel-development circles, so the generic vDSO implementation naturally used clock_gettime64() as the fallback timekeeping system call on all architectures. That is not the sort of thing that one would ordinarily even have to think about much; nobody wants to create a generic vDSO implementation that contains yet another year-2038 problem in need of fixing.

But there is a problem here. A surprising number of programs want to know what time it is at some point or another. Anybody putting together a seccomp() policy for a given program will almost certainly allow system calls like gettimeofday(); otherwise the target program will probably break. A program that fails to run is generally secure, but users, being generally unreasonable, tend to get disgruntled anyway.

Any rational seccomp() policy will, thus, allow for the fallback system call when the vDSO is unable to provide the time directly. But it turns out that, while these policies allowed clock_gettime() on 32-bit systems, they lacked the foresight to let clock_gettime64() through as well. The end result is that, when a program protected by one of these seccomp() policies runs on a 5.3 kernel, it is quickly and rudely killed when it tries to make a disallowed system call.

Kernel developers might protest that this change is required to avoid year-2038 problems. They might also be naturally inclined to disregard lame excuses about how clock_gettime64() was never needed before, or about how that system call didn't even exist until the 5.1 release. But, in the end, this is a regression, and the kernel community's policy on such things is fairly unambiguous. Somehow, programs running under existing seccomp() policies will need to continue to work when the final 5.3 kernel comes out.

Fixing the problem

Various ideas were raised for how that could be done, starting with a not-entirely-serious suggestion that the generic vDSO change could simply be reverted. Perhaps seccomp() rules could be bypassed for system calls that originate in the vDSO; this idea didn't get far given that, among other things, faking a vDSO return address is not a difficult thing to do. Bypassing seccomp() for clock_gettime64() specifically is an option, but that would defeat administrators who want to block all access to timekeeping information. The concept of "system-call aliases" was circulated, initially by Andy Lutomirski; it would create a short list of "equivalent" system calls that take the same arguments and do the same thing. If one call in the list was rejected by a seccomp() filter, the kernel would retry the policy with any aliases that might exist.

The alias idea got further than many, but it has problems of its own. For example, authors of seccomp() policies might genuinely want to discriminate between "equivalent" system calls. It seems like the sort of mechanism that could generate surprising results in general. Aliases might still be the long-term solution for this problem but, as Lutomirski pointed out, "it's getting quite late to start inventing new seccomp features to fix this". Something simpler is needed, at least for the 5.3 release.

That something is likely to be based on this patch series from Thomas Gleixner, which simply causes the vDSO to fall back to the 32-bit clock_gettime() system call on 32-bit systems. It is a solution that is pleasing to nobody, but it solves the regression issue for now.

Some other solution will be required eventually; it is not possible to support 32-bit time indefinitely. One possibility is that the authors of seccomp() policies change their code to allow clock_gettime64() as well. But, even if that could be done and widely deployed, there is no strong incentive for developers to do this work, since their existing policies will continue to function as intended. Some sort of multi-year deprecation process could be considered as a way to force policies to be fixed. But the eventual solution may just have to live in seccomp() instead, perhaps in the form of an alias list or other special exception. A long-term solution that is pleasing to everybody is difficult to envision.

This situation highlights a problem with seccomp() in general: it is difficult to write robust policies at that level of detail, and the resulting policies tend to be brittle in the best of times. Even if the kernel community avoids incompatible changes, a change in a library somewhere can invoke a new system call that a given seccomp() policy may frown upon. While the OpenBSD pledge() mechanism may not offer the degree of control provided by seccomp(), its use of relatively broad categories of functionality makes it easier to avoid problems like this. But Linux has seccomp(), with all its power and complexity. It seems highly likely that developers will unwittingly run into this sort of regression again in the future.

Index entries for this article
Kernel	Security/seccomp
Kernel	vDSO
Security	Linux kernel/Seccomp

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 18:15 UTC (Fri) by dullfire (guest, #111432) [Link] (13 responses)

Sounds like seccomp rules should be a runtime config instead of a compile time thing (aka program reads in the seccomp rules from a file and then loads them, instead of being program data).

Or we just ditch all precompiled 32-bit programs with builtin seccomp in 2038

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 21:12 UTC (Fri) by arnd (subscriber, #8866) [Link]

Precompiled programs are not even the main problem, as long as the C library doesn't start using the new system calls to implement the compatibility symbols to implement the old time32 based interfaces.

The problem with seccomp is much bigger when an application is recompiled with the time64 C library interfaces that have to use the 64-bit system calls. However, when you do that, you also have to deal with other problems, this is just one of many things we need to address to have a 32-bit distro that can survive y2038, and one of many things that can go with seccomp as we add new system calls that act as replacements for old ones.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 22:41 UTC (Sat) by flussence (guest, #85566) [Link] (11 responses)

Or we could ditch seccomp and go with something sane that doesn't require constant invasive changes to the kernel, libc and applications...

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 20:40 UTC (Sun) by roc (subscriber, #30627) [Link] (10 responses)

Which is what?

It's not pledge(), which is "we have studied all the applications anyone has ever written or ever will write and come up with a list of set of policies that work for them".

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 20:45 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

pledge() is definitely not a universal cure, but it's extremely practical and works well for a surprising variety of servers.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 22:10 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

Well, it's not hard to turn seccomp into a pledge-wannabe: make it accept "target kernel version" in addition to BPF.

Then kernel would know which syscalls are "alien" to this particular version and would use it's alias database.

Heck, this way you could introduce some "fake" versions which only know about very few syscalls (and thus only allow rough yet simple setup).

This is similar to how Android (well, bionic) handles such things - and works well enough in practice (even if it's easy to construct artificial example which would fall apart in such scheme)

vDSO, 32-bit time, and seccomp

Posted Aug 17, 2019 5:13 UTC (Sat) by gnoack (subscriber, #131611) [Link]

This would be a good start, but the problems with not understanding user space behaviour are still big compared to kernel compatibility issues.

For example, different libcs use different syscalls, which is the first thing to be compatible with.

Shared library loading can lead to very unexpected behaviour as well. LD_PRELOAD is one example. Another one is that when resolving hostnames, libnss in glibc loads shared modules for resolution behavior, and it's very difficult to predict what these will do. (OpenBSDs pledge has a special case for DNS as well, I believe so that they can distinguish between DNS and other UDP.)

In the end, with seccomp you need a very good control of how a program is built, which libc it uses, and in the case of glibc+DNS even how the system is configured. That seems unrealistic.

vDSO, 32-bit time, and seccomp

Posted Aug 6, 2019 7:42 UTC (Tue) by mm7323 (subscriber, #87386) [Link] (5 responses)

Perhaps system-call sites could be annotated and processed by the compiler (or a plug in) to add an ELF section describing the source and return address of each system call. This section could then be read or mapped into protected memory when an executable or shared object is loaded and used to provide policy that automatically checks the program is executing as intended when it was compiled and that it hasn't obviously been compromised.

Relocation processing and such may make this fiddly to implement, but given most things would by dynamically linked against glibc where the system calls commonly come from, it might be possible to reduce overhead to just when loading that shared library with minimal loss for most other programs.

vDSO, 32-bit time, and seccomp

Posted Aug 6, 2019 23:37 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

That sort of thing could be good but it's hard because it's very important you check the system-call number, and obtaining that statically is impossible in general (e.g. for the syscall performed by the syscall() function).

vDSO, 32-bit time, and seccomp

Posted Aug 7, 2019 11:31 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

syscall() may well be a potential problem, though I imagine 99% of uses would pass the first argument from a SYS_xxx constant - it might be possible to use macro trickery with __builtin_constant_p() to still make the correct specific annotation data.

The other 1% of uses may be either bugs, bad code, or exploitable gadgets? It would be interesting research to find out.

vDSO, 32-bit time, and seccomp

Posted Aug 9, 2019 15:03 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

This trivially requires solving the halting problem in the limit. The problem isn't really which syscall is invoked: other than syscall() that is easy to determine. The problem is that figuring out the source and return addresses, which might well be in intricately-constructed data structures, is *horrifically* hard. You can't do it by brute force (it's not *quite* as bad as enumerating all busy beavers but it's clearly ridiculous) which means you have to do it by formal analysis of the program. And that... well, good luck doing it without significant programmer help. Not a chance at all of your being able to do it in randomly-chosen C programs.

vDSO, 32-bit time, and seccomp

Posted Aug 9, 2019 15:06 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

Oh wait you're not talking about checking the arguments, are you? I'm talking nonsense. If you're just checking that the syscalls invoked are syscalls present in the program, and that they're being called from (and for most syscalls returning to) the right places... that sounds practical, at least for PT_GNU_STACK binaries that do not intentionally execute code that the compiler didn't generate. :) But it won't help stop attackers who are using ROP gadgets: the whole point of those is that they carry out arbitrary computation *using* only code actually present in the program, via sufficiently demented crafted stacks. (You'd have to check that the stacks return to loci where there are actually function calls, and that's going to be much more expensive.)

vDSO, 32-bit time, and seccomp

Posted Aug 10, 2019 6:47 UTC (Sat) by mm7323 (subscriber, #87386) [Link]

> But it won't help stop attackers who are using ROP gadgets

That's why I suggest verifying the return addresses as well as call sites - to make chaining ROP gadgets harder. Combined with something like Pointer Authentication Codes in user space, this could button up call flows nicely to ensure code executes as designed when compiled.

That said, I'm not sure if it is possible to 'fake' the return address of a supervisor call or exception on any architectures.

> (You'd have to check that the stacks return to loci where there are actually function calls, and that's going to be much more expensive.)

All security has an overhead. The question is whether such a system could be made efficient enough to be worth the benefit. The idea here is be to leverage the compiler to produce the needed records and fix them up when loading/dynamic linking so that execution overhead could be as simple as some table lookups in the kernel around system calls. It will never be for free, and even hardware assisted things like PAC add instructions.

vDSO, 32-bit time, and seccomp

Posted Aug 8, 2019 17:25 UTC (Thu) by flussence (guest, #85566) [Link]

If not pledge(), maybe we could have a LSM that reads Content-Security-Policy headers. I hear they're popular with the kids and as easy to understand as SELinux rules.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 18:21 UTC (Fri) by luto (guest, #39314) [Link]

The fix doesn’t quite fall back to the 32-bit syscall on 32-bit systems. It falls back to the 32-bit syscall for 32-bit clock_gettime() calls. This seems fine in the long run. A Y2038-ready 32-bit program will use the vDSO’s clock_gettime64, which will fall back to the new syscall.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 20:02 UTC (Fri) by chris_se (subscriber, #99706) [Link] (32 responses)

Isn't this a much more generic problem with seccomp? Let's say glibc
decides to switch its stat() wrapper to use the new statx() system call
(for similar reasons) - then any seccomp policy (which is defined by
programs outside of glibc) allowing stat() but not statx() would
suddenly start to kill programs left and right. Sure, in this case it
was the vDSO of the kernel instead of glibc that caused the problem,
but in both cases the upgrade of a very basic system component broke
the application.

And from a historical perspective it's always been the case that any
wrapper around a system call may internally do other things as well,
as long as it follows the documented contract. seccomp() breaks this
understanding that has long existed to some extent.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 20:27 UTC (Fri) by nix (subscriber, #2304) [Link] (31 responses)

Yes, it is. glibc's change to not cache getpid() results (so that it worked better with containers, etc) in the 2.25 timeframe broke BIND's named because it was relying on the cache and not allowing getpid() through its seccomp rules. In the end seccomp support was just removed from BIND because it was reducing reliability more than it was gaining in security.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 23:19 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (30 responses)

Right. I'm really not a fan of using seccomp and SELinux to ban random system calls to "reduce attack surface". This practice can cause hard-to-debug problems when programs that legitimately use supported system calls in rare cases see unexpected errors. Security should, IMHO, work on the basis of protecting data, not code.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 23:33 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (28 responses)

The Android people had the gall to complain to me because the shell I maintain uses stat() for the test builtin (things like file existence) and they disallow stat in their SELinux policies…

I agree, this is ridiculous.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 0:57 UTC (Sat) by nix (subscriber, #2304) [Link] (27 responses)

... how else are you supposed to implement it? Use fstatat in particular, or something? And then what happens if someone else has a different policy?

This is ridiculous. It drives a truck through ABI stability guarantees, even guarantees as carefully maintained as (say) glibc's.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 1:06 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (20 responses)

The practice of blacklisting arbitrary system calls also creates a perverse incentive: if I, a program author, want to maintain flexibility, I should have my program call as many different system calls as I can lest I lose access to the ones I don't call.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 4:38 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (19 responses)

I think there is a balance here.

A hypothetical crypto library should not need to call into the sockets API, create processes, manipulate shared memory, access the filesystem, or do a wide variety of other I/O-ish things. A malicious actor trying to exploit a buffer overrun would very much like to do those things, for all manner of reasons, but particularly for key exfiltration. We can reasonably foresee a malicious actor being able to cause such a buffer overrun in a crypto library, because it's actually happened numerous times. Not all of those bugs would have been stopped by seccomp (see for example Heartbleed), but no security measure claims to solve all problems.

At the other extreme, of course a shell is going to call all manner of I/O syscalls (except *maybe* for the sockets API). It really doesn't make sense to try and limit what a shell can do, because the whole point of a shell is to facilitate arbitrary code execution (by the user who is typing commands). Yes, restricted shells exist, but those tend to be sandboxed along different dimensions than "which syscalls are fair game."

Most software is going to fall somewhere between these extremes. So where does that leave us? If I were an upstream, the lesson I would take from this is to just write sensible code, and let downstreams figure out their own security policies. If they file a bug telling me that some of my code is unreasonable, and therefore tripping seccomp, I might fix it. If they file a bug telling me that my code does something that is inconvenient for them, but not unreasonable from where I sit, I would WONTFIX it and let the pieces fall where they may.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 6:17 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> A hypothetical crypto library should not need to
> call into the sockets API
Except to set up the kernel-level TLS acceleration. Or it might need to make outgoing connections to validate CRLs, for example.

> create processes
OK.

> manipulate shared memory
Except if it wants to use uring, maybe?

> access the filesystem
> or do a wide variety of other I/O-ish things.
Read CA bundles.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 7:01 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Except to set up the kernel-level TLS acceleration.

Sure, if that's the specific thing that you are doing. But then the application logic knows you are doing that, and can avoid sandboxing it.

> Or it might need to make outgoing connections to validate CRLs, for example.

Gods, no. If the application wants to use a CRL, it downloads it separately, and before applying the sandbox. The crypto library could, of course, provide a helper function for that, but it should not be part of the "main" codepath unless the caller has somehow asked for it. You don't make outgoing connections behind the application code's back.

> Read CA bundles.

read(2) poses substantially less of a security risk than write(2) and open(2), so I don't actually have a problem with this.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 9:24 UTC (Sat) by storner (subscriber, #119) [Link] (2 responses)

> > Or it might need to make outgoing connections to validate CRLs, for example.

>Gods, no. If the application wants to use a CRL, it downloads it separately, and before applying the sandbox. The crypto library could, of course, provide a helper >function for that, but it should not be part of the "main" codepath unless the caller has somehow asked for it. You don't make outgoing connections behind the >application code's back.

Gods, no. CRL's from a public CA are huge and the cost (time, bandwidth, storage) of downloading one would be prohibitive in most cases. You normally use OCSP which requires an HTTP(S) network connection. So socket/network access is needed.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 10:56 UTC (Sat) by chris_se (subscriber, #99706) [Link]

> Gods, no. CRL's from a public CA are huge and the cost (time, bandwidth, storage) of downloading one would be prohibitive in most cases. You normally use OCSP which requires an HTTP(S) network connection. So socket/network access is needed.

Although in an ideal word everybody would use OCSP Stapling - that way it wouldn't require the client to do OCSP requests to arbitrary destinations, and only each server would need to perform such a request every two days or so, and that only to its own CA.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 18:20 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

A MitM can cause OCSP requests to fail, at which point most stacks choose fail-open. So OSCP provides no security benefit and should be removed to reduce attack surface and network chatter. Or else you should make it fail-closed, but nobody actually does that.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 20:27 UTC (Sun) by rwmj (subscriber, #5474) [Link] (8 responses)

Filtering on system calls is somewhat ridiculous anyway. The proper way to do this is with capabilities. You are given a ticket which allows certain operations (eg. access a subdirectory in the filesystem), and you can create new tickets which are subsets of those operations that you hand down to the libraries and components you use. Capabilities are supported by the operating system so diagnosing problems and working out what capabilities are needed to carry out a whole task can be done at the level of the whole system.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:00 UTC (Sun) by roc (subscriber, #30627) [Link] (7 responses)

I'm all for capabilities but the goal of seccomp is to reduce the attack surface of kernel code that the confined process can trigger execution of, and capabilities aren't always an appropriate way to express that.

For example almost every application needs read(). Most don't need the features provided by preadv2(), and those features trigger execution of a bunch of relatively new and untested kernel code. How would you use capabilities to control the ability of a confined process to access those features?

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:11 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (4 responses)

preadv2 and other new system calls provide new capabilities. These new capabilities let programs do a better job of serving the user. How are these programs supposed to deliver this improved utility to users if security policy blocks the new system calls?

It's circular: we have to block them because they're rare, and they're rare because we block them. We can't make progress that way.

I'm all for addressing specific known vulnerabilities, but this practice is reflexively blocking anything new has got to stop.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:36 UTC (Sun) by roc (subscriber, #30627) [Link] (3 responses)

In practice, security needs vary, seccomp policies vary, and lots of software runs without a seccomp policy at all, so there is no circular deadlock.

Also, many seccomp policies are tailed to the needs of the software they confine, rather than the other way around. Don't tell Chrome or Firefox that they should stop using seccomp policies to sandbox their browser processes because the kernel community needs additional testing of kernel code ... which their browser processes only exercise if they've been compromised.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 0:04 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Chrome actually works just fine with pledge() - http://undeadly.org/cgi?action=article;sid=20160107075227

Raw syscall filtering really is looking like a bad solution.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 0:49 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Sure, after modifying pledge() to make it work just fine with Chrome. https://marc.info/?l=openbsd-cvs&m=145207222327683&...

But that has nothing to do with this sub-thread, which is about whether capabilities obviate the need for seccomp.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 3:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

I honestly don't mind this approach. Fully generalized systems are not always the best solution.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 14:06 UTC (Mon) by MarcB (guest, #101804) [Link] (1 responses)

This raises the question what seccomp is supposed to be.

Should it be some "personal firewall" to protect potentially vulnerable kernel code or should it restrict the functionality available to processes based on their needs (i.e. classical sandboxing)?

Personally, I think only the second concept is feasible. In that approach, there would be no difference whatsover between read() and preadv2() - or clock_gettime64() and clock_gettime(). Those syscalls are equivalent in the sense that they allow a process to do exactly the same things.

If seccomp is used to filter arbitrary syscalls, this will lead to ossifications (can't reliably use new syscalls) and maintenance or portability nightmares (just look at the circumstances needed to trigger this problem here). And frankly, if the Linux kernel really needed such a protective filter, it would be high time to switch operating systems (or to significantly change Linux' development process wrt syscalls).

Applications and administrators should define security in term of the security model provided by the operating system and not start second-guessing it. Doing so would cause the same madness operating system developers are currently experiencing with those hardware vulnerabilities, but on a much larger scale.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 21:48 UTC (Mon) by roc (subscriber, #30627) [Link]

Google developed seccomp-bpf for the Chrome sandbox and "protect potentially vulnerable kernel code" was explicitly a goal. I don't know why you think that isn't feasible; it is feasible, and it's working.

> And frankly, if the Linux kernel really needed such a protective filter,

It does. See https://events.linuxfoundation.org/wp-content/uploads/201...
The situation has not improved.

> it would be high time to switch operating systems (or to significantly change Linux' development process wrt syscalls).

Maybe so but for now seccomp-bpf is needed.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 18:22 UTC (Sat) by dullfire (guest, #111432) [Link] (4 responses)

> A hypothetical crypto library should not need to call into the sockets API, create processes, manipulate shared memory, access the filesystem, or do a wide variety of other I/O-ish things.

A crypto lib, in a program that can not do any of those things is kind of useless. (or alternately, last I check seccomp applies to processes not shared libs)

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 19:51 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

True! And isn’t t̲h̲a̲t̲ part of the problem why the current solutions are useless (or rather, do more harm and create unreliability than they do good and create security)?

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 13:09 UTC (Mon) by leromarinvit (subscriber, #56850) [Link] (2 responses)

I thought the intended way to use seccomp was to compartmentalize your program into different processes, each allowed to use only the syscalls they need, communicating via some IPC mechanism? Of course that's more work than using plain function calls into a library, but is there anything stopping a library from implementing something like that internally, with the user-visible API just passing the data to the actual worker process?

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 13:27 UTC (Mon) by dullfire (guest, #111432) [Link] (1 responses)

you aren't wrong, but the example in question explicitly precludes IPC

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 15:59 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

The example doesn't preclude IPC. The crypto library doesn't need to be able to open files, set up new sockets, or create/map/unmap shared memory areas, but it can use files, sockets, or shared memory areas which are provided to it. For file- or socket-based IPC it just needs the read() and write() system calls inside the sandbox.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 1:15 UTC (Sat) by mirabilos (subscriber, #84359) [Link] (4 responses)

access(2), which is broken in many different ways on too many operating systems to list here.

In contrast to the freedesktop.org/systemd/GNOME people and, apparently, Google, I care for more than just GNU/Linux/{amd,arm}64.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 15:01 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

I'm just amazed that anyone could expect you to implement an entire shell without using stat-family syscalls but only using access(). WTF no that's just ridiculous. (Or that anyone would think that sandboxing a *shell* with seccomp, the very definition of something whose whole purpose is to execute arbitrary code, made any sense at all.)

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 16:17 UTC (Sat) by nivedita76 (subscriber, #121790) [Link]

Not seccomp, selinux. Though the overall point remains.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 16:32 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

This is a bit of a tangent, but: if you use stat to implement test, doesn't that require reimplementing (a subset of) the permission model yourself, and potentially missing system-specific mechanisms such as "there's an additional ACL here granting permission" or "the underlying filesystem is read-only / noexec"?

Also, I'd be curious what problems you've observed with the access system call on various operating systems.

a tangent (was vDSO, 32-bit time, and seccomp)

Posted Aug 22, 2019 22:13 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

Hi,

the shell uses stat and looks at the various bits (mtime, mode, …) for tests.

The condition “read-only filesystem” is not in the scope of the tests (it’s more of a run-time vs. how-the-fs-tree-is-set-up question) and EROFS will be thrown on actual accesses by the kernel.

Most tests are very low-level:

-g file file's mode has the setgid bit set.

Others aren’t, but…

-w file file exists and is writable.

… considering this is a Unix shell, the Unix file attributes are checked, no extended ones, and I know of no portable way to check for them. (That being said, I do not deal with extended attributes at all, and mksh is normally developed on MirBSD which doesn’t have them anyway, but I understand at least OS/2 and Cygwin/Interix/UWIN/PW32 out of the supported platforms do, if HPFS/NTFS is the underlying filesystem; I’m not familiar enough with these.)

I’d have to look why access(2) is not normally used. If it’s only false negatives, we could check _both_ access and stat, and if one fails return a failure. This would be dead slow on most operating systems, so I’d only enable it for those that really need it.

I do know that access(2) says the file is executable if the caller is root and the file isn’t. There’s already an access wrapper in the code, and another one for OS/2 (that deals with adding .exe automatically if needed)…

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 22:37 UTC (Sun) by marcH (subscriber, #57642) [Link]

> the architecture-specific vDSO used the native clock_gettime() call on the system it was running on; that meant calling the 32-bit clock_gettime() on 32-bit kernels.

> the generic vDSO implementation naturally used clock_gettime64() as the fallback timekeeping system call on all architectures.

> During the 5.3 merge window, the x86 architecture switched over to the generic version,

If the version of clock_gettime() invoked was really the *internal* implementation detail it seemed to be, there wouldn't have been any issue. Just like firewalls, the seccomp approach doesn't seem to care about layers and abstractions. This basically "promotes" internal implementation details to API rank, right? What could possibly go wrong.

> Even if the kernel community avoids incompatible changes, a change in a library somewhere can invoke a new system call that a given seccomp() policy may frown upon.

Sounds like a "yes".

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:04 UTC (Sun) by roc (subscriber, #30627) [Link]

It makes sense to protect code, as well as data, because the more kernel code a malicious process can cause to execute, the more kernel bugs it can trigger for exploitation.

Put a version number on the policy

Posted Aug 4, 2019 7:49 UTC (Sun) by epa (subscriber, #39769) [Link]

Surely the seccomp policy should be tagged with the kernel version it was originally written for. If the policy is for an old kernel (or predates the existence of version numbers) then apply the weird backwards compatibility workarounds that when you allow one system call you really intended to allow another one too. If the policy is for a new enough kernel then apply it as-is. If the policy is truly ancient, the kernel could refuse it altogether (so the backwards compatibility code doesn’t have to be maintained for ever and ever).

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 19:07 UTC (Mon) by madscientist (subscriber, #16861) [Link] (2 responses)

> If the calling program wants an esoteric clock that has not been implemented

The use of "esoteric" here is IMO misleading. Any clock that doesn't have vDSO support is essentially useless unless you only want to call it rarely... and most of the nonstandard clocks are there precisely to provide the kind of precise timing that is needed when calling them often.

I'm glad that as of the generic rewrite it appears that CLOCK_MONOTONIC_RAW will _finally_ get the vDSO treatment (on intel). This clock has been known to be virtually useless for years, with many blog posts pointing out (often without understanding why) that it's hundreds of times slower than CLOCK_MONOTONIC even though its behavior is actually what people want when measuring time intervals and the clock_gettime() man page makes it sound like it should be the most efficient option.

If you investigate the reasons why vDSO CLOCK_MONOTONIC_RAW isn't already available you'll run across a somewhat depressing example of the kernel development model failing.

vDSO, 32-bit time, and seccomp

Posted Aug 6, 2019 13:24 UTC (Tue) by luto (guest, #39314) [Link] (1 responses)

Please explain how you find it to be a depressing failure.

CLOCK_MONOTONIC_RAW for the x86 vDDO was merged just a couple months after patches showed up. If there were significantly earlier requests, no one told me about them, and I’m the maintainer.

vDSO, 32-bit time, and seccomp

Posted Aug 6, 2019 18:01 UTC (Tue) by madscientist (subscriber, #16861) [Link]

Maybe I've got the timeline wrong: I'm not sure if by "a couple months" you mean the recent merge of the generic vDSO implementation, or something else.

Google shows that patches to add vDSO for Intel CLOCK_MONOTONIC_RAW were sent in March 2018 but it seems they were never applied; I can't find info on them via Google or "git log --grep".

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 23:41 UTC (Mon) by dezgeg (subscriber, #92243) [Link] (1 responses)

> That something is likely to be based on this patch series from Thomas Gleixner, which simply causes the vDSO to fall back to the 32-bit clock_gettime() system call on 32-bit systems. It is a solution that is pleasing to nobody, but it solves the regression issue for now. Some other solution will be required eventually; it is not possible to support 32-bit time indefinitely.

But to match the existing ABI of clock_gettime(), the return value of the function will have to fit a 32-bit struct timespec anyway in the end. So how is it an improvement to have the VDSO to make a 64-bit clock_gettime64() call just to immediately truncate the seconds to 32 bits? Am I missing something?

vDSO, 32-bit time, and seccomp

Posted Aug 17, 2019 7:43 UTC (Sat) by mcortese (guest, #52099) [Link]

When a function is replaced by a new one with additional features and you know that the old one will eventually become deprecated, then it's preferable to use the new one even if you don't need the additional features.

vDSO, 32-bit time, and seccomp

seccomp() sadness

Fixing the problem

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

a tangent (was vDSO, 32-bit time, and seccomp)

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

Put a version number on the policy

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

vDSO, 32-bit time, and seccomp

`seccomp()` sadness