Killing off /dev/kmem
/dev/kmem provides access to the kernel's address space; it can be read from or written to like an ordinary file, or mapped into a process's address space. Needless to say, there are some mild security implications arising from providing that sort of access; even read access to this file is generally enough to expose credentials and allow an attacker to take over a system. As a result, protections on /dev/kmem have always tended to be restrictive, but it remains the sort of open back door into the kernel that makes anybody who worries about security worry even more.
It is a rare Linux system that enables /dev/kmem now. As of the 2.6.26 kernel release in July 2008, the kernel only implements this special file if the CONFIG_DEVKMEM configuration option is enabled. One will have to look long and hard for a distributor that enables this option in 2021; most of them disabled it many years ago. So its disappearance from the kernel is unlikely to create much discomfort.
It's worth noting that Linux systems still support /dev/mem (without the "k"), which once provided similar access to all of the memory in the system. It has long been restricted to I/O memory; system RAM is off limits. The occasional user-space device driver still needs /dev/mem to function, but it's otherwise unused.
One may well wonder why a dangerous interface like /dev/kmem existed in the first place. The kernel goes out of its way to hide its memory from the rest of the system; creating a special file to circumvent that hiding seems like a step in the wrong direction. The answer, in short, is that once there was no other way to get many types of information out of the kernel.
As an example, consider the "load average" numbers printed by tools like top, uptime, or w; they indicate the average length of the CPU run queues over periods of one, five, and 15 minutes. In the distant past, when computers were scarce and it was common to run many tasks on the same machine, jobs that were not time-critical would often consult the load average and defer their work if it was too high. It was the sort of courtesy that was much appreciated by the other users of the machine, of which there may have been dozens.
But how does one determine the current load average? Unix kernels have maintained those statistics for decades, but they originally kept that information to themselves. User-space code that wanted to know this number would have to do the following:
- Read the symbol table from the executable image of the current kernel to determine the location of the avenrun array.
- Open /dev/kmem and seek to that location.
- Read the avenrun array into a user-space buffer.
Code from that era can be hard to find, but the truly masochistic can wade through what must be one of the deeper circles of #ifdef hell to find an implementation toward the bottom of this version of getloadavg() from an early GNU make release. In a current Linux system, instead, all that is needed is to read a line from /proc/loadavg.
This kind of grubbing around in kernel memory was not limited to the load-average array. Tools with more complex information requirements also had to dig around in /dev/kmem; see, for example, the 2.9BSD implementation of ps. That was just how things were done in those days.
Rooting through the kernel's memory for information about the system has a number of problems beyond the need to implement /dev/kmem. Changes to the kernel could break user space in surprising ways. Multiple reads were often needed to get a complete picture, but that picture could change while the reads were taking place, leading to more surprises. The move away from /dev/kmem and toward well-defined kernel interfaces, such as /proc, sysfs, and various system calls, has cleaned this situation up — and made it possible to disable /dev/kmem.
Now, it seems that /dev/kmem will go away entirely. Linus
Torvalds said
that he would "happily do this for the next merge window
", but
he wanted confirmation that distributors are, indeed, not enabling it now.
There have been a few responses for specific distributions, but nobody has
said that /dev/kmem is still in use anywhere. If there are
users of this interface out there, they will want to make their existence
known in the near future. Failing that, this back door into kernel memory
will soon be removed entirely — but, then, your editor once predicted that it would be removed for
2.6.14, so one never knows.
Index entries for this article | |
---|---|
Kernel | /dev/kmem |
Posted Apr 5, 2021 16:04 UTC (Mon)
by michaelkjohnson (subscriber, #41438)
[Link] (11 responses)
Not needing /dev/kmem was one of the points when I first created procps lo these many years ago.
Posted Apr 5, 2021 17:31 UTC (Mon)
by josh (subscriber, #17465)
[Link] (10 responses)
Posted Apr 5, 2021 20:30 UTC (Mon)
by michaelkjohnson (subscriber, #41438)
[Link] (9 responses)
The original ps for Linux, often requiring rebuild after a new kernel build, if any of the structures had changed, used /dev/kmem. And after procps was released, the original ps was sometimes referred to as "kmem ps" to differentiate.
The original proc filesystem did not have enough functionality for a full replacement version of ps. I modified it to have all the necessary data for ps and uptime, and then wrote procps as a suite of programs that used the new functionality.
The output was formatted compactly (keep in mind this was when a 386sx16 was a decent machine) and I separated the stat and statm files because of the expense of producing the statm data, then in ps I kept track of whether statm needed to be read in order to produce the output.
As far as I know, my original procps was the original implementation of ps that defaulted to sorted output. All versions of ps that directly read /dev/kmem, as far as I know, listed the data in the order it happened to find it in the kernel memory it was digging through, and I was tired of juggling sort arguments when invoking ps.
I believe my original procps was also among the first, if not the first, to just recognize both BSD or SysV command line arguments and do what you meant, rather than requiring you to remember which syntax you needed to use on this particular system.
In any case, I don't know whether I was actually the first to introduce either or both of sorting internally and honoring both BSD and SysV arguments, or if one or both were previously invented and I unknowingly reimplemented ideas that already existed.
Posted Apr 6, 2021 11:40 UTC (Tue)
by lyda (subscriber, #7429)
[Link] (4 responses)
I can imagine switching to it saved a massive amount of hassle.
Posted Apr 7, 2021 17:05 UTC (Wed)
by quanstro (guest, #77996)
[Link] (2 responses)
Posted Apr 8, 2021 0:58 UTC (Thu)
by michaelkjohnson (subscriber, #41438)
[Link] (1 responses)
Posted Apr 9, 2021 17:49 UTC (Fri)
by quanstro (guest, #77996)
[Link]
Posted Apr 8, 2021 3:52 UTC (Thu)
by k8to (guest, #15413)
[Link]
Posted Apr 6, 2021 13:09 UTC (Tue)
by acahalan (guest, #151496)
[Link] (3 responses)
Based on that, Michael K Johnson wrote the procps.
Somebody else ended up maintaining procps for a while, adding color to the output, but then not much happened.
Michael K Johnson, then at Red Hat, decided (was told?) to maintain procps. He reverted to the pre-color version of the code. He put out a call for help, and Albert Cahalan responded with the suggestion that ps support both BSD and SysV syntax like OSF/1 (later renamed Tru64 then Digital UNIX) and AIX did. This would have been 1996 probably, or perhaps 1997. Sorted output was possible, but I don't believe it was the default. It should not be the default, mainly because ps is often used when a system is low on memory but also because partial output is desirable when running on a failing kernel. Sorting with the "O" option appears to have a BSD origin.
Albert Cahalan rewrote procps, initially just to prove that it would be possible to go beyond what OSF/1 and AIX could do, parsing mixed BSD and SysV options. (OSF/1 could only do one or the other, not mixed) There was then some human conflict relating to "ps -aux" printing a warning. Craig Small over at Debian started using Albert Cahalan's new code. This code definitely did not sort by default.
Michael K Johnson turned over a CVS repository to Rick van Riel and Ingo Molnar, excluding Albert Cahalan without explanation. This was almost certainly in 1997. Albert Cahalan then put a version 3.x.x on sourceforge, where he maintained procps for about a decade. At some point the 2.x.x version was made unreliable, grouping processes as threads if they happened to share various attributes as procps non-atomically looked at them. Albert Cahalan instead enhanced the /proc filesystem by adding the /proc/*/task/ directories and the thread counts. All distributions, including Red Hat, switched over to Albert Cahalan's procps 3.x.x code.
After about a decade maintaining procps, Albert Cahalan became too busy due to a large family. Also, he was demotivated because he found that it was impossible to stop Red Hat from hacking things up in ways that would add bugs and ill-considered compatibility troubles. This led to Craig Small, the Debian package maintainer, joining up with some other people to start the 4.x.x version series elsewhere.
Posted Apr 7, 2021 2:10 UTC (Wed)
by michaelkjohnson (subscriber, #41438)
[Link] (1 responses)
Branko Lankester built kmem ps that came earlier.
In the earliest procps version I found (0.7), I already tried to honor at least SysV arguments e and f for people whose fingers had been trained on SysV, but provided BSD-style output regardless. Your rewrite implemented multiple personalities, which was naturally much better.
It looks like I introduced sorting output in version 0.93 in April 1994, later than I recalled, but before I was aware of you doing work on procps. That version definitely sorts by default, and the "o" option toggles sorting. I also clearly failed to update the man page along with that new feature.
Your memory of the transition is different from mine. I was doing a poor job of being maintainer (slow to apply patches and do new releases) but I certainly didn't "revert" color support, though I suspect it was there in a patch or fork that I hadn't adopted. A fork was the obvious response to an unresponsive maintainer, so no complaints there! I did finally step back formally.
Posted Apr 8, 2021 21:17 UTC (Thu)
by Kamilion (subscriber, #42576)
[Link]
Took me a moment to look at the poster's names and realize they were the very people involved.
Posted Apr 8, 2021 3:58 UTC (Thu)
by k8to (guest, #15413)
[Link]
export I_WANT_A_BROKEN_PS=shutup
Being part of my .profile on Linux for many years. At some point around 2012 I realized I didn't need it anymore.
Posted Apr 5, 2021 17:56 UTC (Mon)
by nickodell (subscriber, #125165)
[Link] (5 responses)
> * Read the symbol table from the executable image of the current kernel to determine the location of the avenrun array.
How did unprivileged code determine the load average? Were unprivileged users allowed to read /dev/kmem in the past?
Posted Apr 5, 2021 18:15 UTC (Mon)
by sjfriedl (✭ supporter ✭, #10111)
[Link]
I think the `ps` command was setuid root; what could go wrong? :-)
Posted Apr 6, 2021 3:56 UTC (Tue)
by markh (subscriber, #33984)
[Link] (1 responses)
Posted Apr 6, 2021 12:08 UTC (Tue)
by michaelkjohnson (subscriber, #41438)
[Link]
Posted Apr 6, 2021 13:53 UTC (Tue)
by foxcrisp (guest, #52781)
[Link] (1 responses)
In a sandboxed container, one doesnt need /dev/kmem, so, one could argue that if its not needed by some apps, it is not needed by any apps.
It is a shame if we lose it, but few apps truly needed it (I used it for dtrace, a while back - but would have to research the proposed alternatives).
Posted Apr 10, 2021 17:45 UTC (Sat)
by quanstro (guest, #77996)
[Link]
Posted Apr 5, 2021 18:34 UTC (Mon)
by ribalda (subscriber, #58945)
[Link] (1 responses)
cat /dev/kmem > core
To debug the current state of the kernel/drivers?
Of course, never enabled in production. But for bringup is extremely helpful.
Posted Apr 5, 2021 18:43 UTC (Mon)
by josh (subscriber, #17465)
[Link]
Posted Apr 5, 2021 21:05 UTC (Mon)
by luto (guest, #39314)
[Link] (1 responses)
/me runs
Posted Apr 6, 2021 12:39 UTC (Tue)
by tux3 (subscriber, #101245)
[Link]
Posted Apr 5, 2021 21:36 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (3 responses)
Posted Apr 6, 2021 14:03 UTC (Tue)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Apr 7, 2021 14:52 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
Posted May 3, 2021 17:31 UTC (Mon)
by jjulian (guest, #152040)
[Link]
Posted Apr 6, 2021 16:17 UTC (Tue)
by shakkhar (guest, #117388)
[Link] (3 responses)
Can anyone share algorithm / code / doc which exemplifies this practice?
Posted Apr 6, 2021 16:24 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted Apr 6, 2021 16:53 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Apr 7, 2021 2:50 UTC (Wed)
by songmaster (subscriber, #1748)
[Link]
Posted Jan 8, 2022 16:11 UTC (Sat)
by aCOSwt (guest, #156120)
[Link]
AFAIU lilo-24.2 (latest) would also be happy to use it :
if ((fd=open(DEV_DIR "/mem", O_RDONLY)) < 0) return buf_valid=1;
(from the fetch function in probe.c)
this in order to determine misc. hardware (floppies / disks / video ) related information.
OK no harm if it cannot, lilo will just print a warning.
Killing off /dev/kmem
Killing off /dev/kmem
Yes, procps from /proc ps
Yes, procps from /proc ps
Yes, procps from /proc ps
Linux proc influence
Linux proc influence
Yes, procps from /proc ps
how I remember the history
Well, we remember some different things...
Well, we remember some different things...
how I remember the history
Killing off /dev/kmem
> * Open /dev/kmem and seek to that location.
> * Read the avenrun array into a user-space buffer.
Killing off /dev/kmem
Killing off /dev/kmem
Given the information available in /dev/kmem, setgid kmem is insignificantly different from setuid root. It feels better, but in the end it needs to be treated the same from a security perspective.
group kmem ~= root
Killing off /dev/kmem
Killing off /dev/kmem
Killing off /dev/kmem
gdb vmlinux core
Killing off /dev/kmem
Killing off /dev/kmem
Killing off /dev/kmem
Killing off /dev/kmem
My God. You were *not* kidding about the ifdefs.
Killing off /dev/kmem
Killing off /dev/kmem
Killing off /dev/kmem
Killing off /dev/kmem
Look at sendmail, for example; it will stop processing mail if the system gets too busy.
Killing off /dev/kmem
Killing off /dev/kmem
Wol
Killing off /dev/kmem
Killing off /dev/kmem