Killing off /dev/kmem

By Jonathan Corbet
April 5, 2021

The recent proposal from David Hildenbrand to remove support for the /dev/kmem special file has not sparked a lot of discussion. Perhaps that is because today's youngsters, lacking an understanding of history, may be wondering what that file is in the first place and, thus, be unclear on why it may matter. Chances are that /dev/kmem will not be missed, but in passing it takes away a venerable part of the Unix kernel interface.

/dev/kmem provides access to the kernel's address space; it can be read from or written to like an ordinary file, or mapped into a process's address space. Needless to say, there are some mild security implications arising from providing that sort of access; even read access to this file is generally enough to expose credentials and allow an attacker to take over a system. As a result, protections on /dev/kmem have always tended to be restrictive, but it remains the sort of open back door into the kernel that makes anybody who worries about security worry even more.

It is a rare Linux system that enables /dev/kmem now. As of the 2.6.26 kernel release in July 2008, the kernel only implements this special file if the CONFIG_DEVKMEM configuration option is enabled. One will have to look long and hard for a distributor that enables this option in 2021; most of them disabled it many years ago. So its disappearance from the kernel is unlikely to create much discomfort.

It's worth noting that Linux systems still support /dev/mem (without the "k"), which once provided similar access to all of the memory in the system. It has long been restricted to I/O memory; system RAM is off limits. The occasional user-space device driver still needs /dev/mem to function, but it's otherwise unused.

One may well wonder why a dangerous interface like /dev/kmem existed in the first place. The kernel goes out of its way to hide its memory from the rest of the system; creating a special file to circumvent that hiding seems like a step in the wrong direction. The answer, in short, is that once there was no other way to get many types of information out of the kernel.

As an example, consider the "load average" numbers printed by tools like top, uptime, or w; they indicate the average length of the CPU run queues over periods of one, five, and 15 minutes. In the distant past, when computers were scarce and it was common to run many tasks on the same machine, jobs that were not time-critical would often consult the load average and defer their work if it was too high. It was the sort of courtesy that was much appreciated by the other users of the machine, of which there may have been dozens.

But how does one determine the current load average? Unix kernels have maintained those statistics for decades, but they originally kept that information to themselves. User-space code that wanted to know this number would have to do the following:

Read the symbol table from the executable image of the current kernel to determine the location of the avenrun array.
Open /dev/kmem and seek to that location.
Read the avenrun array into a user-space buffer.

Code from that era can be hard to find, but the truly masochistic can wade through what must be one of the deeper circles of #ifdef hell to find an implementation toward the bottom of this version of getloadavg() from an early GNU make release. In a current Linux system, instead, all that is needed is to read a line from /proc/loadavg.

This kind of grubbing around in kernel memory was not limited to the load-average array. Tools with more complex information requirements also had to dig around in /dev/kmem; see, for example, the 2.9BSD implementation of ps. That was just how things were done in those days.

Rooting through the kernel's memory for information about the system has a number of problems beyond the need to implement /dev/kmem. Changes to the kernel could break user space in surprising ways. Multiple reads were often needed to get a complete picture, but that picture could change while the reads were taking place, leading to more surprises. The move away from /dev/kmem and toward well-defined kernel interfaces, such as /proc, sysfs, and various system calls, has cleaned this situation up — and made it possible to disable /dev/kmem.

Now, it seems that /dev/kmem will go away entirely. Linus Torvalds said that he would "happily do this for the next merge window", but he wanted confirmation that distributors are, indeed, not enabling it now. There have been a few responses for specific distributions, but nobody has said that /dev/kmem is still in use anywhere. If there are users of this interface out there, they will want to make their existence known in the near future. Failing that, this back door into kernel memory will soon be removed entirely — but, then, your editor once predicted that it would be removed for 2.6.14, so one never knows.

Index entries for this article
Kernel	/dev/kmem

Killing off /dev/kmem

Posted Apr 5, 2021 16:04 UTC (Mon) by michaelkjohnson (subscriber, #41438) [Link] (11 responses)

Well, that took a while.

Not needing /dev/kmem was one of the points when I first created procps lo these many years ago.

Killing off /dev/kmem

Posted Apr 5, 2021 17:31 UTC (Mon) by josh (subscriber, #17465) [Link] (10 responses)

Did you call it "procps" because it's a ps that uses /proc rather than other methods?

Yes, procps from /proc ps

Posted Apr 5, 2021 20:30 UTC (Mon) by michaelkjohnson (subscriber, #41438) [Link] (9 responses)

Yes, that was the source of the name.

The original ps for Linux, often requiring rebuild after a new kernel build, if any of the structures had changed, used /dev/kmem. And after procps was released, the original ps was sometimes referred to as "kmem ps" to differentiate.

The original proc filesystem did not have enough functionality for a full replacement version of ps. I modified it to have all the necessary data for ps and uptime, and then wrote procps as a suite of programs that used the new functionality.

The output was formatted compactly (keep in mind this was when a 386sx16 was a decent machine) and I separated the stat and statm files because of the expense of producing the statm data, then in ps I kept track of whether statm needed to be read in order to produce the output.

As far as I know, my original procps was the original implementation of ps that defaulted to sorted output. All versions of ps that directly read /dev/kmem, as far as I know, listed the data in the order it happened to find it in the kernel memory it was digging through, and I was tired of juggling sort arguments when invoking ps.

I believe my original procps was also among the first, if not the first, to just recognize both BSD or SysV command line arguments and do what you meant, rather than requiring you to remember which syntax you needed to use on this particular system.

In any case, I don't know whether I was actually the first to introduce either or both of sorting internally and honoring both BSD and SysV arguments, or if one or both were previously invented and I unknowingly reimplemented ideas that already existed.

Yes, procps from /proc ps

Posted Apr 6, 2021 11:40 UTC (Tue) by lyda (subscriber, #7429) [Link] (4 responses)

Can't remember exactly why, but I had to look for a process in C on SCO once and discovered that the two ways to do it were to pull it out of /dev/kmem or to parse ps output. Massive gaping hole in libc in my mind. The /proc fs is a good unix-y solution.

I can imagine switching to it saved a massive amount of hassle.

Yes, procps from /proc ps

Posted Apr 7, 2021 17:05 UTC (Wed) by quanstro (guest, #77996) [Link] (2 responses)

procfs goes back to at least 8ed in 1984, and was included (and expanded) in plan 9 from the beginning.

Linux proc influence

Posted Apr 8, 2021 0:58 UTC (Thu) by michaelkjohnson (subscriber, #41438) [Link] (1 responses)

My confident recollection is that the initial implementation of the Linux proc filesystem was explicitly inspired by plan 9's proc filesystem.

Linux proc influence

Posted Apr 9, 2021 17:49 UTC (Fri) by quanstro (guest, #77996) [Link]

i recall that as well. at the time, nobody had access to plan 9. so linux /proc was inspired by _papers_ about plan9's /proc. one of the things linux missed---perhaps because the plan 9 implementation was not visible---was a dirt cheep way to have lots of little single-job file systems. so linux /proc acquired a few warts. there is more to implement in a linux file system than 9p. but perhaps the 9p subset is good enough for most cases.

Yes, procps from /proc ps

Posted Apr 8, 2021 3:52 UTC (Thu) by k8to (guest, #15413) [Link]

Given variations in proc, ps is still the portable solution, sadly.

how I remember the history

Posted Apr 6, 2021 13:09 UTC (Tue) by acahalan (guest, #151496) [Link] (3 responses)

Somebody wrote the original ps.

Based on that, Michael K Johnson wrote the procps.

Somebody else ended up maintaining procps for a while, adding color to the output, but then not much happened.

Michael K Johnson, then at Red Hat, decided (was told?) to maintain procps. He reverted to the pre-color version of the code. He put out a call for help, and Albert Cahalan responded with the suggestion that ps support both BSD and SysV syntax like OSF/1 (later renamed Tru64 then Digital UNIX) and AIX did. This would have been 1996 probably, or perhaps 1997. Sorted output was possible, but I don't believe it was the default. It should not be the default, mainly because ps is often used when a system is low on memory but also because partial output is desirable when running on a failing kernel. Sorting with the "O" option appears to have a BSD origin.

Albert Cahalan rewrote procps, initially just to prove that it would be possible to go beyond what OSF/1 and AIX could do, parsing mixed BSD and SysV options. (OSF/1 could only do one or the other, not mixed) There was then some human conflict relating to "ps -aux" printing a warning. Craig Small over at Debian started using Albert Cahalan's new code. This code definitely did not sort by default.

Michael K Johnson turned over a CVS repository to Rick van Riel and Ingo Molnar, excluding Albert Cahalan without explanation. This was almost certainly in 1997. Albert Cahalan then put a version 3.x.x on sourceforge, where he maintained procps for about a decade. At some point the 2.x.x version was made unreliable, grouping processes as threads if they happened to share various attributes as procps non-atomically looked at them. Albert Cahalan instead enhanced the /proc filesystem by adding the /proc/*/task/ directories and the thread counts. All distributions, including Red Hat, switched over to Albert Cahalan's procps 3.x.x code.

After about a decade maintaining procps, Albert Cahalan became too busy due to a large family. Also, he was demotivated because he found that it was impossible to stop Red Hat from hacking things up in ways that would add bugs and ill-considered compatibility troubles. This led to Craig Small, the Debian package maintainer, joining up with some other people to start the 4.x.x version series elsewhere.

Well, we remember some different things...

Posted Apr 7, 2021 2:10 UTC (Wed) by michaelkjohnson (subscriber, #41438) [Link] (1 responses)

I found some old tarballs to refresh my memory. ☺

Branko Lankester built kmem ps that came earlier.

In the earliest procps version I found (0.7), I already tried to honor at least SysV arguments e and f for people whose fingers had been trained on SysV, but provided BSD-style output regardless. Your rewrite implemented multiple personalities, which was naturally much better.

It looks like I introduced sorting output in version 0.93 in April 1994, later than I recalled, but before I was aware of you doing work on procps. That version definitely sorts by default, and the "o" option toggles sorting. I also clearly failed to update the man page along with that new feature.

Your memory of the transition is different from mine. I was doing a poor job of being maintainer (slow to apply patches and do new releases) but I certainly didn't "revert" color support, though I suspect it was there in a patch or fork that I hadn't adopted. A fork was the obvious response to an unresponsive maintainer, so no complaints there! I did finally step back formally.

Well, we remember some different things...

Posted Apr 8, 2021 21:17 UTC (Thu) by Kamilion (subscriber, #42576) [Link]

Wow, big thanks to Albert Cahalan and Michael K. Johnson for showing up and taking the time to explain to us johnny-come-latelys.

Took me a moment to look at the poster's names and realize they were the very people involved.

how I remember the history

Posted Apr 8, 2021 3:58 UTC (Thu) by k8to (guest, #15413) [Link]

The warning on ps -aux led to

export I_WANT_A_BROKEN_PS=shutup

Being part of my .profile on Linux for many years. At some point around 2012 I realized I didn't need it anymore.

Killing off /dev/kmem

Posted Apr 5, 2021 17:56 UTC (Mon) by nickodell (subscriber, #125165) [Link] (5 responses)

> But how does one determine the current load average? Unix kernels have maintained those statistics for decades, but they originally kept that information to themselves. User-space code that wanted to know this number would have to do the following:

> * Read the symbol table from the executable image of the current kernel to determine the location of the avenrun array.
> * Open /dev/kmem and seek to that location.
> * Read the avenrun array into a user-space buffer.

How did unprivileged code determine the load average? Were unprivileged users allowed to read /dev/kmem in the past?

Killing off /dev/kmem

Posted Apr 5, 2021 18:15 UTC (Mon) by sjfriedl (✭ supporter ✭, #10111) [Link]

> How did unprivileged code determine the load average?

I think the `ps` command was setuid root; what could go wrong? :-)

Killing off /dev/kmem

Posted Apr 6, 2021 3:56 UTC (Tue) by markh (subscriber, #33984) [Link] (1 responses)

/dev/kmem was readable by group kmem, so programs requiring access to it could be made setgid kmem. (That is still the case for /dev/mem and /dev/port.) It's still a security concern, but better than requiring setuid root.

group kmem ~= root

Posted Apr 6, 2021 12:08 UTC (Tue) by michaelkjohnson (subscriber, #41438) [Link]

Given the information available in /dev/kmem, setgid kmem is insignificantly different from setuid root. It feels better, but in the end it needs to be treated the same from a security perspective.

Killing off /dev/kmem

Posted Apr 6, 2021 13:53 UTC (Tue) by foxcrisp (guest, #52781) [Link] (1 responses)

Earlier unixes required apps to read from /dev/kmem, and know the format of the kernel data structures and location. Linux changed all of that, by exposing most things via /proc - mostly simple text strings. In a security based world, /dev/mem and /dev/kmem are just holes to allow access to any part of memory. Whilst the early implementations used unix group permissions, that just meant delegating the security mechanisms to the group mechanisms. That simply opens up the surface area (either get your self root, or get access to the relevant group for reading /dev/kmem).

In a sandboxed container, one doesnt need /dev/kmem, so, one could argue that if its not needed by some apps, it is not needed by any apps.

It is a shame if we lose it, but few apps truly needed it (I used it for dtrace, a while back - but would have to research the proposed alternatives).

Killing off /dev/kmem

Posted Apr 10, 2021 17:45 UTC (Sat) by quanstro (guest, #77996) [Link]

i suppose if you consider mainstream *nix variants, linux may have been first. however, 8th edition unix introduced the concept of /proc. and iirc, linux was inspired by plan 9, not 8th edition. in plan 9 there is extra expressiveness. for example, /proc allows inspection of another machine's processes without endian/word size concerns via mount and bind. this is how stats(1) works. ioctl is similarly not included.

Killing off /dev/kmem

Posted Apr 5, 2021 18:34 UTC (Mon) by ribalda (subscriber, #58945) [Link] (1 responses)

Am I the only one that was using:

cat /dev/kmem > core
gdb vmlinux core

To debug the current state of the kernel/drivers?

Of course, never enabled in production. But for bringup is extremely helpful.

Killing off /dev/kmem

Posted Apr 5, 2021 18:43 UTC (Mon) by josh (subscriber, #17465) [Link]

You can do that with /proc/kcore now.

Killing off /dev/kmem

Posted Apr 5, 2021 21:05 UTC (Mon) by luto (guest, #39314) [Link] (1 responses)

Fortunately, eBPF can replace these legacy /dev/kmem uses.

/me runs

Killing off /dev/kmem

Posted Apr 6, 2021 12:39 UTC (Tue) by tux3 (subscriber, #101245) [Link]

eBPF can feel pretty limiting sometimes. Thankfully there are Systemtap Guru scripts for all my "poke at kernel structures without manually writing a module" needs :)

Killing off /dev/kmem

Posted Apr 5, 2021 21:36 UTC (Mon) by Paf (subscriber, #91811) [Link] (3 responses)

“ but the truly masochistic can wade through what must be one of the deeper circles of #ifdef hell”
My God. You were *not* kidding about the ifdefs.

Killing off /dev/kmem

Posted Apr 6, 2021 14:03 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

Hah, that's nothing! The original home of nightmares is xterm, and while there is no easily web-accessible canonical xterm source (only tarballs), there are random github mirrors I can point at. Look at main() here: https://github.com/joejulian/xterm/blob/master/main.c. Look at the wonderful tangle in Tinput() here: https://github.com/joejulian/xterm/blob/master/Tekproc.c. Then fear, for this is code people are still using. (Though it could be worse still. It could be procmail.)

Killing off /dev/kmem

Posted Apr 7, 2021 14:52 UTC (Wed) by Paf (subscriber, #91811) [Link]

Oh god, that is .... wow. 😂

Killing off /dev/kmem

Posted May 3, 2021 17:31 UTC (Mon) by jjulian (guest, #152040) [Link]

* Disclaimer: I have nothing to do with any of this! :)

Killing off /dev/kmem

Posted Apr 6, 2021 16:17 UTC (Tue) by shakkhar (guest, #117388) [Link] (3 responses)

> In the distant past, when computers were scarce and it was common to run many tasks on the same machine, jobs that were not time-critical would often consult the load average and defer their work if it was too high.

Can anyone share algorithm / code / doc which exemplifies this practice?

Killing off /dev/kmem

Posted Apr 6, 2021 16:24 UTC (Tue) by corbet (editor, #1) [Link]

Look at sendmail, for example; it will stop processing mail if the system gets too busy.

Killing off /dev/kmem

Posted Apr 6, 2021 16:53 UTC (Tue) by Wol (subscriber, #4433) [Link]

Not Unix, but I worked on minis, and I set up a bunch of work queues with very strict limits, so if you wanted to fire off a load of jobs you could hammer the system without impacting everyone, My favourite was the "quick" queue, which ran at highest priority, but had a wall-clock-limit of 30 seconds. If it over-ran that it just got killed.

Cheers,
Wol

Killing off /dev/kmem

Posted Apr 7, 2021 2:50 UTC (Wed) by songmaster (subscriber, #1748) [Link]

This isn’t an example of reading the loadavg values, but for some code that goes delving into the internals of the OS to get data out of it I recommend looking at the legacy branch of the lsof program at https://github.com/lsof-org/lsof/tree/legacy. It supported many versions of Unix, and had to find and extract many different pieces of data to generate its output. The 00PORTING file at https://github.com/lsof-org/lsof/blob/legacy/00PORTING mentions briefly how it actually did that, and even takes a dig at “some down-sides to the Linux /proc-based lsof.”

Killing off /dev/kmem

Posted Jan 8, 2022 16:11 UTC (Sat) by aCOSwt (guest, #156120) [Link]

"The occasional user-space device driver still needs /dev/mem to function, but it's otherwise unused."

AFAIU lilo-24.2 (latest) would also be happy to use it :

if ((fd=open(DEV_DIR "/mem", O_RDONLY)) < 0) return buf_valid=1;

(from the fetch function in probe.c)

this in order to determine misc. hardware (floppies / disks / video ) related information.

OK no harm if it cannot, lilo will just print a warning.