The problem with the asynchronous bsg interface
The SCSI standard is generally thought of as a way to control storage devices, such as disk and tape drives (younger readers, ask a coworker what the latter were). But SCSI can be thought of as a sort of network protocol with more general capabilities, as demonstrated by its use to control tape-changing robots, scanners, optical-disk writers, and more. Drivers for such devices tend to run in user space; to support those drivers, the SCSI generic (SG) interface was created. This interface provides direct access to the SCSI protocol, allowing user-space code to control devices in ways not supported by the in-kernel disk and tape drivers.
The original SG interface was simply called "sg"; like the "sd" driver for SCSI disks and "st" driver for tape drives, its name highlights the SCSI developers' focus on efficiency, in that no letters were wasted. The sg driver implements a low-level device that interfaces directly with the SCSI midlayer. Back in 2004, Jens Axboe posted a new implementation that he called "bsg"; unlike sg, it worked at the level of the block layer, taking advantage of its request-queue infrastructure to manage SCSI operations. It took a while, but bsg was finally merged for the 2.6.23 release in 2007. Since then, both interfaces have coexisted in the kernel. The sg interface retains a number of users; older code makes up some of them, but some users have found that it works better for their needs (as will be revisited below). The bsg interface, instead, is the only way to gain access to some newer SCSI protocol features.
Both devices implement two different APIs to accomplish the same task. The synchronous interface uses ioctl() commands; results of operations are returned when ioctl() returns. There is also an asynchronous interface based on simple read() and write() calls, where one uses write() to issue a command, followed by a later read() to obtain the results. The system calls involved are simple, but the data that is transferred is not: SCSI commands are executed by writing an sg_io_hdr structure to the device. The structure is complex in its own right, but it can also contain pointers to other ranges of user-space memory. Normally, a write() call will not access memory outside of the provided buffer; with these interfaces, instead, a write() call can cause accesses to memory almost anywhere in the address space.
The dangers of this kind of interface have become increasingly clear in recent years. In this case, there have been a few security issues related to indirect memory access through the SG devices. There is also the persistent concern that an attacker may succeed in convincing a setuid program to write the wrong thing to such a device, opening up another vulnerability. Worries about this kind of problem led to the recent rejection of the write-based filesystem mounting API. For SG, though, the interfaces have been established for a long time, so they cannot be withdrawn without breaking applications.
For bsg, though, that may not actually be the case.
In June, Jann Horn tried to harden these interfaces by adding more restrictions on the contexts in which they can be used. Almost as an aside, the changelog noted that, in the case of bsg, arbitrary access to memory can also happen in a release() call, when the file descriptor is being closed. That immediately set off a new round of alarms; even a legitimate user-space memory access can run into trouble at release time, when that memory may no longer be present. The results would be unpredictable — but they would be predictably bad.
There was some discussion about how this problem might be fixed, but it
didn't take long for Christoph Hellwig to suggest
that the asynchronous side of the bsg interface be removed
outright. There are reasons to believe that it is not actually being used
in the real world, some of which were described
by Douglas Gilbert, the maintainer of the sg interface. Among
other things, if two processes are issuing commands to the same device,
bsg is unable to keep the responses straight. "Once real
world users (needing an async SCSI (or general storage) pass-through) find
out about that bsg 'feature', they don't use it
". Horn did
some searching in the Debian
Code Search database and concluded that there were no users that needed
to be worried about.
The end result of the discussion is that Axboe has merged Hellwig's patch to remove the
asynchronous bsg functionality. The synchronous ioctl()-based
API, which does
not have the same problems (and which is actually used by applications),
will remain. Linus Torvalds has stated
that this patch should also be applied to the stable kernels as well. So,
unless some users of the asynchronous API come forward in the near future,
this particular feature will soon disappear.
| Index entries for this article | |
|---|---|
| Kernel | SCSI/Block SCSI generic (bsg) |
Posted Jul 20, 2018 2:25 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Wouldn't it be easier to pack all the data into a self-contained chunk of memory and then send it to the kernel?
Posted Jul 20, 2018 3:02 UTC (Fri)
by willy (subscriber, #9762)
[Link] (3 responses)
Posted Jul 20, 2018 6:05 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jul 20, 2018 14:02 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link]
Posted Jul 20, 2018 7:37 UTC (Fri)
by mjthayer (guest, #39183)
[Link]
Posted Jul 20, 2018 14:08 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
Posted Jul 20, 2018 20:18 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 20, 2018 3:49 UTC (Fri)
by TheJH (subscriber, #101155)
[Link]
Basically, there are a few places in the kernel (sys_splice() is the most interesting one, but there are others) that call VFS read/write handlers under KERNEL_DS, so that copy_to_user()/copy_from_user() can also access kernel memory; all the copy_to_user()/copy_from_user() calls in VFS read/write context can be treated as essentially equivalent to __copy_from_user()/__copy_to_user(). The buffer/length pair provided to the read/write handler is guaranteed to be safe, but if you either access other random addresses or access the provided buffer beyond the provided length, bad stuff happens.
One particularly annoying thing about this kind of bug is that KASAN doesn't see the bogus access, and pagefaults on kernel addresses don't trigger oopses (because they are treated as userspace faults, so you just get -EFAULT). So if you hit this kind of bug with something like a fuzzer, you're unlikely to actually notice anything. I wonder whether I should try to write a patch to change that... maybe let the pagefault handler ignore uaccess fixups when KERNEL_DS is active, with an exception for __probe_kernel_read/__probe_kernel_write or so?
Another slightly related bug, from 2016: https://bugs.chromium.org/p/project-zero/issues/detail?id... - this one wasn't in VFS context, but in handler code for performance counter overflows, which can trigger in pretty much any context.
Another related bug (not from me): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/... - this one is an overflow beyond the end of the provided buffer in a debugfs read handler; if you are root and fiddle around with splice a bit, you can get this to overflow beyond the pipe page, crashing the machine.
Posted Jul 20, 2018 13:37 UTC (Fri)
by vtl (guest, #121291)
[Link] (2 responses)
In one of my past job I extended sg to support AIO. AIO, SCSI and per-IO SCSI sense codes were required for our proprietary datapath running in userspace. So there are users, they are just under the radar.
Posted Jul 20, 2018 14:04 UTC (Fri)
by felixfix (subscriber, #242)
[Link] (1 responses)
Posted Jul 20, 2018 14:22 UTC (Fri)
by vtl (guest, #121291)
[Link]
Posted Jul 20, 2018 14:00 UTC (Fri)
by dullfire (guest, #111432)
[Link] (1 responses)
Posted Jul 20, 2018 15:16 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link]
Here is an example:
Example: SCSI READ command using direct I/O
If using indirect I/O, you could design an interface where the buffer was passed to the read() syscall at command completion (although the sg driver doesn't work like that). But using direct I/O, the kernel needs the buffer when the command is started, so it is passed in the write() syscall. So there is no way to map "start a command" to the original meaning of the write() (or writev()) syscall using direct I/O. It makes more sense to think of it as an ioctl().
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
SG currently limits data transfers to 256 MB. But that is still a lot.
The problem with the asynchronous bsg interface
sg_common_write()
if (hp->dxfer_len >= SZ_256M)
return -EINVAL;
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
VFS read/write access under KERNEL_DS
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
The problem with the asynchronous bsg interface
Think ioctl()
allocate buffer to hold data being read
start SCSI READ command via write() syscall to sg, passing address of buffer
wait for command completion; SCSI HBA DMAs directly to buffer
use read() syscall to sg to get command result
