vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 20:27 UTC (Fri) by nix (subscriber, #2304)
In reply to: vDSO, 32-bit time, and seccomp by chris_se
Parent article: vDSO, 32-bit time, and seccomp

Yes, it is. glibc's change to not cache getpid() results (so that it worked better with containers, etc) in the 2.25 timeframe broke BIND's named because it was relying on the cache and not allowing getpid() through its seccomp rules. In the end seccomp support was just removed from BIND because it was reducing reliability more than it was gaining in security.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 23:19 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (30 responses)

Right. I'm really not a fan of using seccomp and SELinux to ban random system calls to "reduce attack surface". This practice can cause hard-to-debug problems when programs that legitimately use supported system calls in rare cases see unexpected errors. Security should, IMHO, work on the basis of protecting data, not code.

vDSO, 32-bit time, and seccomp

Posted Aug 2, 2019 23:33 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (28 responses)

The Android people had the gall to complain to me because the shell I maintain uses stat() for the test builtin (things like file existence) and they disallow stat in their SELinux policies…

I agree, this is ridiculous.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 0:57 UTC (Sat) by nix (subscriber, #2304) [Link] (27 responses)

... how else are you supposed to implement it? Use fstatat in particular, or something? And then what happens if someone else has a different policy?

This is ridiculous. It drives a truck through ABI stability guarantees, even guarantees as carefully maintained as (say) glibc's.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 1:06 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (20 responses)

The practice of blacklisting arbitrary system calls also creates a perverse incentive: if I, a program author, want to maintain flexibility, I should have my program call as many different system calls as I can lest I lose access to the ones I don't call.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 4:38 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (19 responses)

I think there is a balance here.

A hypothetical crypto library should not need to call into the sockets API, create processes, manipulate shared memory, access the filesystem, or do a wide variety of other I/O-ish things. A malicious actor trying to exploit a buffer overrun would very much like to do those things, for all manner of reasons, but particularly for key exfiltration. We can reasonably foresee a malicious actor being able to cause such a buffer overrun in a crypto library, because it's actually happened numerous times. Not all of those bugs would have been stopped by seccomp (see for example Heartbleed), but no security measure claims to solve all problems.

At the other extreme, of course a shell is going to call all manner of I/O syscalls (except *maybe* for the sockets API). It really doesn't make sense to try and limit what a shell can do, because the whole point of a shell is to facilitate arbitrary code execution (by the user who is typing commands). Yes, restricted shells exist, but those tend to be sandboxed along different dimensions than "which syscalls are fair game."

Most software is going to fall somewhere between these extremes. So where does that leave us? If I were an upstream, the lesson I would take from this is to just write sensible code, and let downstreams figure out their own security policies. If they file a bug telling me that some of my code is unreasonable, and therefore tripping seccomp, I might fix it. If they file a bug telling me that my code does something that is inconvenient for them, but not unreasonable from where I sit, I would WONTFIX it and let the pieces fall where they may.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 6:17 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> A hypothetical crypto library should not need to
> call into the sockets API
Except to set up the kernel-level TLS acceleration. Or it might need to make outgoing connections to validate CRLs, for example.

> create processes
OK.

> manipulate shared memory
Except if it wants to use uring, maybe?

> access the filesystem
> or do a wide variety of other I/O-ish things.
Read CA bundles.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 7:01 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Except to set up the kernel-level TLS acceleration.

Sure, if that's the specific thing that you are doing. But then the application logic knows you are doing that, and can avoid sandboxing it.

> Or it might need to make outgoing connections to validate CRLs, for example.

Gods, no. If the application wants to use a CRL, it downloads it separately, and before applying the sandbox. The crypto library could, of course, provide a helper function for that, but it should not be part of the "main" codepath unless the caller has somehow asked for it. You don't make outgoing connections behind the application code's back.

> Read CA bundles.

read(2) poses substantially less of a security risk than write(2) and open(2), so I don't actually have a problem with this.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 9:24 UTC (Sat) by storner (subscriber, #119) [Link] (2 responses)

> > Or it might need to make outgoing connections to validate CRLs, for example.

>Gods, no. If the application wants to use a CRL, it downloads it separately, and before applying the sandbox. The crypto library could, of course, provide a helper >function for that, but it should not be part of the "main" codepath unless the caller has somehow asked for it. You don't make outgoing connections behind the >application code's back.

Gods, no. CRL's from a public CA are huge and the cost (time, bandwidth, storage) of downloading one would be prohibitive in most cases. You normally use OCSP which requires an HTTP(S) network connection. So socket/network access is needed.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 10:56 UTC (Sat) by chris_se (subscriber, #99706) [Link]

> Gods, no. CRL's from a public CA are huge and the cost (time, bandwidth, storage) of downloading one would be prohibitive in most cases. You normally use OCSP which requires an HTTP(S) network connection. So socket/network access is needed.

Although in an ideal word everybody would use OCSP Stapling - that way it wouldn't require the client to do OCSP requests to arbitrary destinations, and only each server would need to perform such a request every two days or so, and that only to its own CA.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 18:20 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

A MitM can cause OCSP requests to fail, at which point most stacks choose fail-open. So OSCP provides no security benefit and should be removed to reduce attack surface and network chatter. Or else you should make it fail-closed, but nobody actually does that.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 20:27 UTC (Sun) by rwmj (subscriber, #5474) [Link] (8 responses)

Filtering on system calls is somewhat ridiculous anyway. The proper way to do this is with capabilities. You are given a ticket which allows certain operations (eg. access a subdirectory in the filesystem), and you can create new tickets which are subsets of those operations that you hand down to the libraries and components you use. Capabilities are supported by the operating system so diagnosing problems and working out what capabilities are needed to carry out a whole task can be done at the level of the whole system.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:00 UTC (Sun) by roc (subscriber, #30627) [Link] (7 responses)

I'm all for capabilities but the goal of seccomp is to reduce the attack surface of kernel code that the confined process can trigger execution of, and capabilities aren't always an appropriate way to express that.

For example almost every application needs read(). Most don't need the features provided by preadv2(), and those features trigger execution of a bunch of relatively new and untested kernel code. How would you use capabilities to control the ability of a confined process to access those features?

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:11 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (4 responses)

preadv2 and other new system calls provide new capabilities. These new capabilities let programs do a better job of serving the user. How are these programs supposed to deliver this improved utility to users if security policy blocks the new system calls?

It's circular: we have to block them because they're rare, and they're rare because we block them. We can't make progress that way.

I'm all for addressing specific known vulnerabilities, but this practice is reflexively blocking anything new has got to stop.

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:36 UTC (Sun) by roc (subscriber, #30627) [Link] (3 responses)

In practice, security needs vary, seccomp policies vary, and lots of software runs without a seccomp policy at all, so there is no circular deadlock.

Also, many seccomp policies are tailed to the needs of the software they confine, rather than the other way around. Don't tell Chrome or Firefox that they should stop using seccomp policies to sandbox their browser processes because the kernel community needs additional testing of kernel code ... which their browser processes only exercise if they've been compromised.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 0:04 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Chrome actually works just fine with pledge() - http://undeadly.org/cgi?action=article;sid=20160107075227

Raw syscall filtering really is looking like a bad solution.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 0:49 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Sure, after modifying pledge() to make it work just fine with Chrome. https://marc.info/?l=openbsd-cvs&m=145207222327683&...

But that has nothing to do with this sub-thread, which is about whether capabilities obviate the need for seccomp.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 3:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

I honestly don't mind this approach. Fully generalized systems are not always the best solution.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 14:06 UTC (Mon) by MarcB (subscriber, #101804) [Link] (1 responses)

This raises the question what seccomp is supposed to be.

Should it be some "personal firewall" to protect potentially vulnerable kernel code or should it restrict the functionality available to processes based on their needs (i.e. classical sandboxing)?

Personally, I think only the second concept is feasible. In that approach, there would be no difference whatsover between read() and preadv2() - or clock_gettime64() and clock_gettime(). Those syscalls are equivalent in the sense that they allow a process to do exactly the same things.

If seccomp is used to filter arbitrary syscalls, this will lead to ossifications (can't reliably use new syscalls) and maintenance or portability nightmares (just look at the circumstances needed to trigger this problem here). And frankly, if the Linux kernel really needed such a protective filter, it would be high time to switch operating systems (or to significantly change Linux' development process wrt syscalls).

Applications and administrators should define security in term of the security model provided by the operating system and not start second-guessing it. Doing so would cause the same madness operating system developers are currently experiencing with those hardware vulnerabilities, but on a much larger scale.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 21:48 UTC (Mon) by roc (subscriber, #30627) [Link]

Google developed seccomp-bpf for the Chrome sandbox and "protect potentially vulnerable kernel code" was explicitly a goal. I don't know why you think that isn't feasible; it is feasible, and it's working.

> And frankly, if the Linux kernel really needed such a protective filter,

It does. See https://events.linuxfoundation.org/wp-content/uploads/201...
The situation has not improved.

> it would be high time to switch operating systems (or to significantly change Linux' development process wrt syscalls).

Maybe so but for now seccomp-bpf is needed.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 18:22 UTC (Sat) by dullfire (guest, #111432) [Link] (4 responses)

> A hypothetical crypto library should not need to call into the sockets API, create processes, manipulate shared memory, access the filesystem, or do a wide variety of other I/O-ish things.

A crypto lib, in a program that can not do any of those things is kind of useless. (or alternately, last I check seccomp applies to processes not shared libs)

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 19:51 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

True! And isn’t t̲h̲a̲t̲ part of the problem why the current solutions are useless (or rather, do more harm and create unreliability than they do good and create security)?

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 13:09 UTC (Mon) by leromarinvit (subscriber, #56850) [Link] (2 responses)

I thought the intended way to use seccomp was to compartmentalize your program into different processes, each allowed to use only the syscalls they need, communicating via some IPC mechanism? Of course that's more work than using plain function calls into a library, but is there anything stopping a library from implementing something like that internally, with the user-visible API just passing the data to the actual worker process?

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 13:27 UTC (Mon) by dullfire (guest, #111432) [Link] (1 responses)

you aren't wrong, but the example in question explicitly precludes IPC

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 15:59 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

The example doesn't preclude IPC. The crypto library doesn't need to be able to open files, set up new sockets, or create/map/unmap shared memory areas, but it can use files, sockets, or shared memory areas which are provided to it. For file- or socket-based IPC it just needs the read() and write() system calls inside the sandbox.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 1:15 UTC (Sat) by mirabilos (subscriber, #84359) [Link] (4 responses)

access(2), which is broken in many different ways on too many operating systems to list here.

In contrast to the freedesktop.org/systemd/GNOME people and, apparently, Google, I care for more than just GNU/Linux/{amd,arm}64.

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 15:01 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

I'm just amazed that anyone could expect you to implement an entire shell without using stat-family syscalls but only using access(). WTF no that's just ridiculous. (Or that anyone would think that sandboxing a *shell* with seccomp, the very definition of something whose whole purpose is to execute arbitrary code, made any sense at all.)

vDSO, 32-bit time, and seccomp

Posted Aug 3, 2019 16:17 UTC (Sat) by nivedita76 (subscriber, #121790) [Link]

Not seccomp, selinux. Though the overall point remains.

vDSO, 32-bit time, and seccomp

Posted Aug 5, 2019 16:32 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

This is a bit of a tangent, but: if you use stat to implement test, doesn't that require reimplementing (a subset of) the permission model yourself, and potentially missing system-specific mechanisms such as "there's an additional ACL here granting permission" or "the underlying filesystem is read-only / noexec"?

Also, I'd be curious what problems you've observed with the access system call on various operating systems.

a tangent (was vDSO, 32-bit time, and seccomp)

Posted Aug 22, 2019 22:13 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

Hi,

the shell uses stat and looks at the various bits (mtime, mode, …) for tests.

The condition “read-only filesystem” is not in the scope of the tests (it’s more of a run-time vs. how-the-fs-tree-is-set-up question) and EROFS will be thrown on actual accesses by the kernel.

Most tests are very low-level:

-g file file's mode has the setgid bit set.

Others aren’t, but…

-w file file exists and is writable.

… considering this is a Unix shell, the Unix file attributes are checked, no extended ones, and I know of no portable way to check for them. (That being said, I do not deal with extended attributes at all, and mksh is normally developed on MirBSD which doesn’t have them anyway, but I understand at least OS/2 and Cygwin/Interix/UWIN/PW32 out of the supported platforms do, if HPFS/NTFS is the underlying filesystem; I’m not familiar enough with these.)

I’d have to look why access(2) is not normally used. If it’s only false negatives, we could check _both_ access and stat, and if one fails return a failure. This would be dead slow on most operating systems, so I’d only enable it for those that really need it.

I do know that access(2) says the file is executable if the caller is root and the file isn’t. There’s already an access wrapper in the code, and another one for OS/2 (that deals with adding .exe automatically if needed)…

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 22:37 UTC (Sun) by marcH (subscriber, #57642) [Link]

> the architecture-specific vDSO used the native clock_gettime() call on the system it was running on; that meant calling the 32-bit clock_gettime() on 32-bit kernels.

> the generic vDSO implementation naturally used clock_gettime64() as the fallback timekeeping system call on all architectures.

> During the 5.3 merge window, the x86 architecture switched over to the generic version,

If the version of clock_gettime() invoked was really the *internal* implementation detail it seemed to be, there wouldn't have been any issue. Just like firewalls, the seccomp approach doesn't seem to care about layers and abstractions. This basically "promotes" internal implementation details to API rank, right? What could possibly go wrong.

> Even if the kernel community avoids incompatible changes, a change in a library somewhere can invoke a new system call that a given seccomp() policy may frown upon.

Sounds like a "yes".

vDSO, 32-bit time, and seccomp

Posted Aug 4, 2019 21:04 UTC (Sun) by roc (subscriber, #30627) [Link]

It makes sense to protect code, as well as data, because the more kernel code a malicious process can cause to execute, the more kernel bugs it can trigger for exploitation.