LWN.net Logo

Tracking down a "runaway loop"

By Jake Edge
December 10, 2008

The Linux boot process, at least as provided by distributions, depends on help from user space, with drivers being loaded as required from the initial filesystem (initramfs/initrd). Loading drivers requires using tools built into initramfs and if those tools break, the kernel won't boot. But when a working kernel configuration and initramfs are used with a new kernel, the result is expected to be a kernel that successfully boots. When that doesn't happen, bugs are filed regarding kernel regressions but, as a recent example shows, the actual problem may be elsewhere.

The original report was made in late October, but no progress was made until Evgeniy Polyakov saw it again in early December. The symptom was a kernel that hangs after printing:

    request_module: runaway loop modprobe char-major-5-1
four times on the console. Since nothing in the user space (initramfs) or kernel configuration had changed, it seemed to clearly point to something in the kernel itself.

It turns out that the "runaway loop" message is meant to indicate that the request_module() function has been invoked recursively. So in an effort to load the driver for the character device with major/minor numbers 5/1—which corresponds to /dev/consolerequest_module() was invoked again. The code in kernel/kmod.c:

        if (atomic_read(&kmod_concurrent) > max_modprobes) {
                /* We may be blaming an innocent here, but unlikely */
                if (kmod_loop_msg++ < 5)
                        printk(KERN_ERR
                               "request_module: runaway loop modprobe %s\n",
                               module_name);
                atomic_dec(&kmod_concurrent);
                return -ENOMEM;
        }
only allows that message to be printed four times, but the invoker should recognize the ENOMEM and handle it appropriately.

The root cause was that something in the kernel was trying to access /dev/console before that device was registered in the kernel. This led the kernel to try and load a module to handle /dev/console, which will fail. Because of the failure, something in the user space modprobe then tries to access /dev/console, presumably to output an error message, which repeats the kernel module loading process. And so on. After that recurses enough to exceed the max_modprobes limit, request_module() will produce the runaway loop message and return ENOMEM which should put a stop to the whole process.

In an acrimonious thread—and kernel bug report—Alan Cox, Kay Sievers, and Polyakov tried to determine where the problem came from and what to do about it. It didn't help matters that they were using different distribution's initramfs so that they saw different behavior. Polyakov/Sievers were using Debian user space while Cox was using Fedora. Something in the Debian version was continuing to try to open /dev/console even after getting ENOMEM. This leads to an infinite loop, thus a kernel hang.

Sievers eventually tracked it to the kernel cryptographic API:

It is caused by: "modprobe cryptomgr" called from swapper[1]

This modprobe process does try to log an error, accesses /dev/console, which is not initialized in the kernel at that time, and the kernel module loader tries the load a module to support dev_t 5:1, which again runs modprobe, and ...

Setting CONFIG_CRYPTO_MANAGER=y makes it disappear.

It turns out that the crypto layer attempts to load the cryptomgr module as part of its algorithm testing infrastructure. If cryptomgr fails to load, the algorithm registration code can continue without it. It is optional, but modprobe wants to put out a message when it fails to load it, which leads to the runaway loop. As Herbert Xu points out, though, the problem is not crypto-specific at all:

In any case the loop itself does not involve any crypto components so I don't think making changes in the crypto layer is going to make this go away forever as anyone calling request_module early enough will get into this loop.

It is this potential pitfall that Sievers and Polyakov would like to see removed. In general, user-space programs are not required to be concerned with the availability of /dev/console—except when they are run from early kernel initialization. But Cox points out that user-space helpers must concern themselves with avoiding loops because there are multiple possible ways to cause that to happen. As an example, he notes that if UNIX-domain sockets (AF_UNIX) are in a module and syslog() is called before the module is loaded, a similar loop will occur.

In an effort to "step back" from the arguments that were going back and forth, Ted Ts'o offers his analysis of the problem along with a suggested course of action:

There is a dispute about whether it is looping forever, or whether it should be getting caught by kernel/kmod.c's modprobe recursion detector. Alan has checked the recursion detector and reports that it works just fine; Evgeniy and Kay are claiming that it in fact loops forever, and the recursion detector is not working.

[...] So I would think the best thing to do is to figure out what Debian's initrd is doing that is evading the recursion detection. Fixing that is going to make things much more robust.

Clearly the recursion detector is working to some extent, or the runaway loop messages would not be seen, but on Debian, at least, that detection doesn't stop the problem. Ts'o's theory is that something outside of directly invoked helper is actually the culprit: "I'm guessing why it isn't working given Debian's initrd setup is that whatever is ultimately opening /dev/console isn't being called until after the helper script has exited." That seems worth tracking down as Ts'o points out in a later message:

It would be good to make sure we understand what the root causes for while the modprobe recursion detector is apparently not triggering, since it could be that Debian's initrd might cause some other uncaught recursion loop if we don't drive this problem determination to root cause.

The exact cause of the problem and why Debian and Fedora behave differently is still not known. Digging into Debian's initrd to figure that out, as Ts'o suggests, is clearly the right starting point. That answer will likely lead to sensible fixes, either in user space or the kernel—possibly both. Bickering about where and how to fix the problem before it is fully understood seems counter-productive at best.


(Log in to post comments)

Notification of a "runaway" other than /dev/console

Posted Dec 11, 2008 3:14 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

How about playing the Bon Jovi song "Runaway" if /dev/console isn't available? Just a thought...

Don't mind me, I'm just in a silly mood tonight for some odd reason... ;)

Notification of a "runaway" other than /dev/console

Posted Dec 11, 2008 10:54 UTC (Thu) by tyhik (guest, #14747) [Link]

In such a miserable "runaway" situation hearing that great Runaway song would be a big relief, I am sure. Be it played back even through /dev/console if nothing else works.

Mind me neither.

Notification of a "runaway" other than /dev/console

Posted Dec 11, 2008 11:30 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Developers are on track for this: http://kernelslacker.livejournal.com/132712.html
:-)

Differences between distribution initramfs scripts can cause interesting problems

Posted Dec 11, 2008 11:31 UTC (Thu) by njd27 (subscriber, #5770) [Link]

What is needed is a reference set of scripts, as described in this article:

http://lwn.net/Articles/298593/

Differences between distribution initramfs scripts can cause interesting problems

Posted Dec 12, 2008 6:11 UTC (Fri) by bronson (subscriber, #4806) [Link]

Yes, that's true! We could use more bickering. :)

Tracking down a "runaway loop"

Posted Jan 7, 2011 19:54 UTC (Fri) by daniel (subscriber, #3181) [Link]

"Bickering about where and how to fix the problem before it is fully understood seems counter-productive at best."

Hi Jon,

This problem still exists more than two years out. In all probability the bickering approach would have been more productive.

Tracking down a "runaway loop"

Posted Jun 3, 2011 17:32 UTC (Fri) by cfriedt (guest, #75420) [Link]

Just in case somebody runs into a similar error like the one Vim[1] (& more recently I) ran into

request_module: runaway loop modprobe char-major-0-20481

This is actually the same problem as the missing '/dev/console' but the device files get mangled somehow on the nfs server system.

Note that /dev/console is normally (maj,min) 5,1, but in this case, the kernel is pressed to search for 0,20481 which is notably outside the usual numeric range for device nodes.

I discovered this problem while using VMWare to run a Linux instance with an NFS root hosted from Mac OS X, while Vim discovered this problem doing the same but while running Windows.

It turns out that, if you run 'mknod -m 600 console c 5 1; mknod -m 666 null c 1 3', and then read the result back from within the VMWare instance of Linux, it's identified as follows:

# ls -la console null
crw------- 1 root root 0, 20481 Jun 3 2011 console
crw-rw-rw- 1 root root 0, 4099 Jun 3 2011 null

The solution: don't use mknod on Mac OS X or cygwin ;-)

In my case I created these critical device nodes on the NFS root from within the VMWare Linux instance with the use of a livecd iso, as a last resort.

In case anyone is interested, after re-creating the 'correct' device nodes using the VMWare Linux instance, the respective files on the nfs root (Mac OS X) show up as follows:

$ ls -la console null
crw------- 1 root wheel 0, 1281 Jun 3 13:27 console
crw-rw-rw- 1 root wheel 0, 259 Jun 3 13:28 null

Note: I'm using NFSv2 due to some brokenness with the Mac OS X NFS server implementation.

[1] http://www.mail-archive.com/davinci-linux-open-source@lin...

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds