A farewell to set_fs()?
The original role of set_fs() was to set the x86 processor's FS segment register which, in the early days, was used to control the range of virtual addresses that could be accessed by unprivileged code. The kernel has, of course, long since stopped using x86 segments this way. In current kernels, set_fs() works by setting a global variable called addr_limit, but the intended functionality is the same: unprivileged code is only allowed to dereference addresses that are below addr_limit. The kernel's access_ok() function, used to validate user-space accesses throughout the kernel, is a simple check against addr_limit, with the rest of the protection being handled by the processor's memory-management unit.
The addr_limit variable, thus, marks the partition between user and kernel space. One might think that such a limit would be fixed, with good reasons for changing it being few and far between. As it happens, there are nearly 400 set_fs() calls in the kernel. Usually, such calls are made to allow code that is normally restricted to accessing user-space memory to operate on a range of kernel memory instead. In 0.10, for example, it was added so that the exec() system call could use the normal filesystem I/O routines to read an executable image into memory that was not yet part of the calling program's address space.
The usual pattern for use of set_fs() looks like this code snippet from the splice() system call:
old_fs = get_fs(); set_fs(get_ds()); res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0); set_fs(old_fs);
This sequence temporarily raises addr_limit so that vfs_readv(), which is normally restricted to reading data into user-space memory, can read data into a kernel-space pipe buffer.
In 2010, it was discovered that, if the kernel could be made to oops between the two set_fs() calls, the second call restoring the address limit would never be made; that left kernel data open to being overwritten by user space. Hilarity, as they say, ensued in the form of CVE-2010-4258. That problem is long since fixed. In late 2016, though, an Android bug was reported for an LG touchscreen driver; there was a way to cause that driver to raise addr_limit and return to user space, once again leaving the kernel open to exploitation.
set_fs() is clearly the sort of interface that can easily create severe security bugs. It is also a tempting shortcut that tends to find its way into code of questionable quality such as out-of-tree drivers. In an attempt to harden the system against set_fs() bugs, Thomas Garnier posted a simple patch changing the system-call code so that it would check addr_limit before returning to user space. If it ever finds an incorrect value, it causes a system panic — a severe response, but probably better than allowing an exploit to occur.
Nobody disagreed with the goal of this patch, but it ran into a problem that is familiar to security developers: its impact on performance. As Ingo Molnar pointed out, the patch adds several instructions to the system-call path, which is one of the most performance-sensitive parts of the kernel. Adding overhead to system calls will slow down everything the kernel does; when one considers how many Linux machines would be executing this code on every system call, one begins to think that its carbon footprint might rival that of a small country. That is not a cost to be paid lightly.
Molnar suggested adding some sort of static analysis to the kernel build
system instead. The standard pattern of set_fs() calls should be
amenable to some sort of static analysis, he said, but Kees Cook argued that the problem was not quite so
simple and that the cost of the patch was worth paying. "Until we
can eliminate set_fs(), we need to add this
check
", he said.
As it happens, some other developers were already considering removing set_fs(), which has, arguably, hung around for far longer than it really should have. Christoph Hellwig suggested removing all calls outside of the core filesystem and architecture code; Andy Lutomirski went one step further and said they should all go. Without set_fs(), the kernel would be more secure, and the code that checks user-space memory accesses could become that much simpler.
Removing set_fs() depends on replacing those calls with a better alternative, of course. Many set_fs() calls exist to enable I/O to kernel-space memory; it should be possible to replace the bulk of those using the iov_iter interface. Hellwig has already started doing this replacement.
Another common pattern occurs in compatibility code where, for example, a structure passed to an ioctl() call from a 32-bit user-space process is converted to the 64-bit equivalent in kernel space, then passed to the regular ioctl() implementation. See do_compat_ioctl() in the media subsystem for an example. In such cases, it's just a matter of splitting that implementation into two pieces: one that fetches the argument from user space, and one that actually performs the desired action.
Other set_fs() calls will have to be dealt with in other ways.
But it would appear that this ball is now rolling with a certain amount of
momentum. Given the benefits of removing set_fs(), it would not
be surprising to see much of this work merged for 4.13, with the task
completed not long thereafter. It will be the end of a longstanding
traditional kernel-code pattern, but it's doubtful that many developers
will mourn its passing.
Index entries for this article | |
---|---|
Kernel | iov_iter |
Kernel | set_fs() |
Security | Linux kernel |
Posted May 10, 2017 15:11 UTC (Wed)
by fratti (guest, #105722)
[Link] (7 responses)
While I get the point that's being made, I doubt the carbon footprint of it would be anywhere close to that of a small country. It's not that I ran the numbers, I'm simply thinking about all the energy inefficient things a lot of people in this day and age do. Next time you open a disposable plastic container such as a pack of gummy bears, think about how far that thing has travelled, and how it's all just going to either end up in a landfill, get incinerated or get washed into the ocean. Doing the ballpark mathematics on this seems difficult due to CPU features such as instruction-level parallelism and various power saving mechanisms.
Posted May 10, 2017 16:22 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (1 responses)
Posted May 16, 2017 3:35 UTC (Tue)
by vapier (guest, #15768)
[Link]
Posted Jun 9, 2017 14:43 UTC (Fri)
by mkbosmans (subscriber, #65556)
[Link] (4 responses)
Assuming:
Then:
According to [1], the smallest countries, like Tuvalu, emit 3000 tons per year. So, although not quite there, it is reasonably close.
Posted Jun 9, 2017 23:12 UTC (Fri)
by cjr (guest, #88606)
[Link] (1 responses)
All the information I could find in a quick search is a bit outdated, but here it is anyways:
This paper has some analysis of mobile phone power consumption:
This article from Qualcomm (2013) has some stats based on the above article:
This article (behind an obtrusive advertisement FYI) says it takes ~1kWh per year to run a mobile phone:
Chris
Posted Jun 10, 2017 16:09 UTC (Sat)
by mkbosmans (subscriber, #65556)
[Link]
So the number of desktops, laptops and servers (i.e. computers with a 30W or greater power usage) combined will probably be more than 2 billion right now. Of course not all of them run Linux.
There have been sold 1.6 billion android phones to date [2]. Excluding discarded phones and including routers and other embedded devices brings the total of active low-power Linux devices probably also in the range of 1-2 billion.
Anyway, this whole exercise was more to see whether the power savings could even close to the power consumption of a small country. Even if a lot of guesses are wrong, I don't think I'm more than a factor 1000x off.
[1] http://www.reuters.com/article/us-computers-statistics-id...
Posted Jun 9, 2017 23:37 UTC (Fri)
by cjr (guest, #88606)
[Link] (1 responses)
Assuming the 30W power consumption was correct, are you sure you have the math right? I get the following:
Note .05mW rather than .5mW:
After going through this exercise, I also found that I was surprised if even an average desktop CPU would consume .05mW to execute 5000 instructions. So I looked on Wikipedia [1] to find a Core-i7 940XM, which uses 55W at 2.13GHz. 5000 instructions in that case would take ~.129mW, which means indeed it can take more power than I thought to execute those 5000 instructions. But even if all those 2e9 devices were running that hungry Core-i7, they would still be consuming roughly 4x less power than the original figures.
Thanks for provoking an interesting discussion anyways. Also please correct me if I have gotten anything wrong above.
[1] https://en.wikipedia.org/wiki/List_of_CPU_power_dissipati...
Chris
Posted Jun 10, 2017 16:13 UTC (Sat)
by mkbosmans (subscriber, #65556)
[Link]
Of course this is a really high estimate, because even when a user is actively using his computer, the average number of syscalls/second will be much lower.
Another big simplification is 1 instruction == 1 cycle. That can certainly be more or less, depending on the specific instructions and other context. But again, this whole exercise was meant as a https://en.wikipedia.org/wiki/Fermi_problem.
A farewell to set_fs()?
A farewell to set_fs()?
A farewell to set_fs()?
A farewell to set_fs()?
- 2e9 computers, phones, etc. on the world run Linux
- They are active 10% of the time
- On average they run 3e9 instructions / second
- When active, there are 1e4 syscalls / second
- When active, 30 W of power is used
- 5 instructions of overhead for each syscall results in a 5 * 1e4 / 3e9 * 30 = 0.5 mW extra power when a computer is active.
- Globally, this means an increased power usage of 0.5mW * 10% * 2e9 = 100 kW
- On a yearly basis, this amounts to 876000 kWh, which equals to about 700 tons of carbon emissions.
A farewell to set_fs()?
https://www.usenix.org/legacy/event/atc10/tech/full_paper...
https://developer.qualcomm.com/blog/mobile-apps-and-power...
https://www.forbes.com/sites/christopherhelman/2013/09/07...
A farewell to set_fs()?
[2] http://www.statisticbrain.com/android-phone-statistics/
A farewell to set_fs()?
5 * 1e4 / 3e9 * 30 = 0.00005 = 0.05 mW
0.05mW * 10% * 2e9 = 10 kW
10kW * 24 * 365 = 87600 kWh
A farewell to set_fs()?
I did this by firing up a browser under strace and looking up a website. That was 45.000 syscalls in 5.5 seconds.