The problem with the asynchronous bsg interface

By Jonathan Corbet
July 19, 2018

The kernel supports two different "SCSI generic" pseudo-devices, each of which allows user space to send arbitrary commands to a SCSI-attached device. Both SCSI-generic implementations have proved to have security issues in the past as a result of the way their API was designed. In the case of one of those drivers, these problems seem almost certain to lead to the removal of a significant chunk of functionality in the 4.19 development cycle.

The SCSI standard is generally thought of as a way to control storage devices, such as disk and tape drives (younger readers, ask a coworker what the latter were). But SCSI can be thought of as a sort of network protocol with more general capabilities, as demonstrated by its use to control tape-changing robots, scanners, optical-disk writers, and more. Drivers for such devices tend to run in user space; to support those drivers, the SCSI generic (SG) interface was created. This interface provides direct access to the SCSI protocol, allowing user-space code to control devices in ways not supported by the in-kernel disk and tape drivers.

The original SG interface was simply called "sg"; like the "sd" driver for SCSI disks and "st" driver for tape drives, its name highlights the SCSI developers' focus on efficiency, in that no letters were wasted. The sg driver implements a low-level device that interfaces directly with the SCSI midlayer. Back in 2004, Jens Axboe posted a new implementation that he called "bsg"; unlike sg, it worked at the level of the block layer, taking advantage of its request-queue infrastructure to manage SCSI operations. It took a while, but bsg was finally merged for the 2.6.23 release in 2007. Since then, both interfaces have coexisted in the kernel. The sg interface retains a number of users; older code makes up some of them, but some users have found that it works better for their needs (as will be revisited below). The bsg interface, instead, is the only way to gain access to some newer SCSI protocol features.

Both devices implement two different APIs to accomplish the same task. The synchronous interface uses ioctl() commands; results of operations are returned when ioctl() returns. There is also an asynchronous interface based on simple read() and write() calls, where one uses write() to issue a command, followed by a later read() to obtain the results. The system calls involved are simple, but the data that is transferred is not: SCSI commands are executed by writing an sg_io_hdr structure to the device. The structure is complex in its own right, but it can also contain pointers to other ranges of user-space memory. Normally, a write() call will not access memory outside of the provided buffer; with these interfaces, instead, a write() call can cause accesses to memory almost anywhere in the address space.

The dangers of this kind of interface have become increasingly clear in recent years. In this case, there have been a few security issues related to indirect memory access through the SG devices. There is also the persistent concern that an attacker may succeed in convincing a setuid program to write the wrong thing to such a device, opening up another vulnerability. Worries about this kind of problem led to the recent rejection of the write-based filesystem mounting API. For SG, though, the interfaces have been established for a long time, so they cannot be withdrawn without breaking applications.

For bsg, though, that may not actually be the case.

In June, Jann Horn tried to harden these interfaces by adding more restrictions on the contexts in which they can be used. Almost as an aside, the changelog noted that, in the case of bsg, arbitrary access to memory can also happen in a release() call, when the file descriptor is being closed. That immediately set off a new round of alarms; even a legitimate user-space memory access can run into trouble at release time, when that memory may no longer be present. The results would be unpredictable — but they would be predictably bad.

There was some discussion about how this problem might be fixed, but it didn't take long for Christoph Hellwig to suggest that the asynchronous side of the bsg interface be removed outright. There are reasons to believe that it is not actually being used in the real world, some of which were described by Douglas Gilbert, the maintainer of the sg interface. Among other things, if two processes are issuing commands to the same device, bsg is unable to keep the responses straight. "Once real world users (needing an async SCSI (or general storage) pass-through) find out about that bsg 'feature', they don't use it". Horn did some searching in the Debian Code Search database and concluded that there were no users that needed to be worried about.

The end result of the discussion is that Axboe has merged Hellwig's patch to remove the asynchronous bsg functionality. The synchronous ioctl()-based API, which does not have the same problems (and which is actually used by applications), will remain. Linus Torvalds has stated that this patch should also be applied to the stable kernels as well. So, unless some users of the asynchronous API come forward in the near future, this particular feature will soon disappear.

Index entries for this article
Kernel	SCSI/Block SCSI generic (bsg)

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 2:25 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

I always wondered why interfaces in Linux are designed in such a way that the kernel has to chase pointers into the userspace.

Wouldn't it be easier to pack all the data into a self-contained chunk of memory and then send it to the kernel?

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 3:02 UTC (Fri) by willy (subscriber, #9762) [Link] (3 responses)

A single SCSI command can operate on gigabytes or even exabytes of data. It may not be feasible to copy all of it, let alone be efficient to do so.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 6:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Seriously? The actual data related to the request, that will have to be transferred?

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 14:02 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

SG currently limits data transfers to 256 MB. But that is still a lot.

sg_common_write()
	if (hp->dxfer_len >= SZ_256M)
		return -EINVAL;

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 7:37 UTC (Fri) by mjthayer (guest, #39183) [Link]

Actually the first thing which came to my mind was splitting the write command into several, one (at least) for each disjoint chunk of memory. No need for additional copies there, but it still eliminates the wild pointers.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 14:08 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

Advanced applications manage their data buffers in complex ways and try to avoid copying if possible, hence the desire for iovecs in readv()/writev()/sendmsg()/recvmsg()/etc. sg also supports iovecs for its data buffers. But sg is more like sendmsg()/recvmsg() where multiple *types* of buffers for different purposes are specified for a single command, e.g. a data buffer for I/O and a "sense" buffer for error information. Requiring them all to be packed together along with the control header information would make the interface less efficient and more difficult to use.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 20:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

iovec is an extension of that - you can send multiple chunks at once, but they are still bounded and don't require pointer chasing.

VFS read/write access under KERNEL_DS

Posted Jul 20, 2018 3:49 UTC (Fri) by TheJH (subscriber, #101155) [Link]

For context, here is how such issues can be abused to corrupt kernel memory, in particular via sys_splice(): https://www.spinics.net/lists/linux-rdma/msg36015.html

Basically, there are a few places in the kernel (sys_splice() is the most interesting one, but there are others) that call VFS read/write handlers under KERNEL_DS, so that copy_to_user()/copy_from_user() can also access kernel memory; all the copy_to_user()/copy_from_user() calls in VFS read/write context can be treated as essentially equivalent to __copy_from_user()/__copy_to_user(). The buffer/length pair provided to the read/write handler is guaranteed to be safe, but if you either access other random addresses or access the provided buffer beyond the provided length, bad stuff happens.

One particularly annoying thing about this kind of bug is that KASAN doesn't see the bogus access, and pagefaults on kernel addresses don't trigger oopses (because they are treated as userspace faults, so you just get -EFAULT). So if you hit this kind of bug with something like a fuzzer, you're unlikely to actually notice anything. I wonder whether I should try to write a patch to change that... maybe let the pagefault handler ignore uaccess fixups when KERNEL_DS is active, with an exception for __probe_kernel_read/__probe_kernel_write or so?

Another slightly related bug, from 2016: https://bugs.chromium.org/p/project-zero/issues/detail?id... - this one wasn't in VFS context, but in handler code for performance counter overflows, which can trigger in pretty much any context.

Another related bug (not from me): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/... - this one is an overflow beyond the end of the provided buffer in a debugfs read handler; if you are root and fiddle around with splice a bit, you can get this to overflow beyond the pipe page, crashing the machine.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 13:37 UTC (Fri) by vtl (guest, #121291) [Link] (2 responses)

"There are reasons to believe that it is not actually being used in the real world"

In one of my past job I extended sg to support AIO. AIO, SCSI and per-IO SCSI sense codes were required for our proprietary datapath running in userspace. So there are users, they are just under the radar.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 14:04 UTC (Fri) by felixfix (subscriber, #242) [Link] (1 responses)

But you used sg; this is bsg.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 14:22 UTC (Fri) by vtl (guest, #121291) [Link]

Yes, it was just a side note, that not all API users are immediately visible.

The problem with the asynchronous bsg interface

Posted Jul 20, 2018 14:00 UTC (Fri) by dullfire (guest, #111432) [Link] (1 responses)

Wouldn't it have been better to use writev(2) instead of write(2)? I think that would allow you to "write" all the data at once, and still have zero extra userspace memcpy's.

Think ioctl()

Posted Jul 20, 2018 15:16 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

Interesting idea, but it is more complicated than that. In the sg API, the write() syscall means "start a command" and the read() syscall means "get command completion result". The command may be a SCSI READ, SCSI WRITE, or something else. So you can start a SCSI READ command with a write() syscall, and you can get the command completion of a SCSI WRITE command with a read() syscall. Add in direct I/O, and you will never be able to simplify the API down to the original syscall meanings. Essentially, the sg read()/write() syscalls behave more like ioctl().

Here is an example:

Example: SCSI READ command using direct I/O
allocate buffer to hold data being read
start SCSI READ command via write() syscall to sg, passing address of buffer
wait for command completion; SCSI HBA DMAs directly to buffer
use read() syscall to sg to get command result

If using indirect I/O, you could design an interface where the buffer was passed to the read() syscall at command completion (although the sg driver doesn't work like that). But using direct I/O, the kernel needs the buffer when the command is started, so it is passed in the write() syscall. So there is no way to map "start a command" to the original meaning of the write() (or writev()) syscall using direct I/O. It makes more sense to think of it as an ioctl().