LWN.net Logo

Toward a larger dev_t

The 2.5.66 kernel includes Andries Brouwer's patches clearing the path for an expansion of the dev_t device number type. A small number of problems have been found, but the changes are working for most people. Andrew Morton has gone a little further and actually changed dev_t to 32-bits in his -mm tree; predictably, the number of problems found there has been a little higher. As a whole, though, the transition appears to be going relatively smoothly.

Badari Pulavarty decided that it was time to play with the possibilities of a larger device number type; he posted this patch which makes the SCSI disk driver make full use of the expanded minor number range. Testing with 4000 virtual disks, with 50 real drives at the end of the range, worked - for the most part. Some scaling problems did turn up, however.

The most significant one appears to be in the request queue mechanism. When the kernel wants to issue a block I/O request, the block subsystem needs to be able to set it up quickly. In particular, memory allocations are best avoided at that point; it's possible that the system is out of memory and the kernel is doing I/O in an attempt to free up some space. Trying to allocate memory at that point can lead to deadlocks. So the block subsystem sets aside a number of pre-allocated request structures for every request queue (and there is typically one request queue for each physical drive in the system). That number varies depending on the amount of memory on the system; it can be as low as 32, and as high as 256. Request structures run about 144 bytes each. So, if one assumes that a system hosting 4000 disks really should be equipped with a fair amount of memory, the block subsystem will set aside about a million request structures, at a cost of about 150MB. And that is just the beginning; the deadline I/O scheduler augments each request structure with a separate deadline_rq structure. Other overheads exist as well.

The end result is that, when the number of disks gets large, a great deal of memory (which must all be in the low memory zone on 32-bit processors) gets tied up in request queues. As Andrew Morton pointed out, with 4000 disks, enough request structures have been allocated to represent 200GB of current I/O requests. That, perhaps, is a bit more than is really needed in most situations.

The solution, as hacked up by Jens Axboe, is to go to a more dynamic scheme for the allocation of request structures. The mempool mechanism is used to keep an absolute minimum number of request structures available for each queue; all the rest are allocated as needed and freed afterwards. This patch will probably go through a few more iterations, but the immediate scalability problem has been addressed.

Meanwhile, not everybody is entirely happy with the direction of the dev_t changes for char devices. In particular, Roman Zippel, who has apparently given up on getting the module changes backed out, has now posted a series of patches backing out the char device changes and substituting his approach. That approach includes maintaining the (currently unused) char device hashing scheme and getting rid of the new register_chrdev_region() function. There is, he claims, no particular need to split char minor number ranges into regions, as there is with block devices. Roman's patches have created some discussion, but there does not appear to be a great deal of pressure for a change in direction at this time.

There has also been a bit of discussion on how big the new dev_t should be. The plan has been to expand it to 32 bits; 12 for the major number, and 20 for the minor number. That is the way Linus has wanted to do it, but he recently has made noises about being open to the idea of making dev_t even larger. If dev_t were to go to 64 bits, with 32 each for major and minor numbers, there would be little need to worry about running out of device numbers for some time into the future. This decision may not be made for a while; once the work to support the dev_t expansion has been done, setting it to one size or another is a relatively simple task.


(Log in to post comments)

Toward a larger dev_t

Posted Mar 27, 2003 12:19 UTC (Thu) by rwmj (subscriber, #5474) [Link]

Device major and minor numbers (and indeed block/char special inodes) are one of those early Unix things which have now really outlived their usefulness. The current problems and debates around them are symptomatic of the fact that the whole concept is flawed.

They were originally necessary as a hack so that one could create files which looked like devices on an ordinary filesystem.

There are two better approaches now.

One is a kind of enhanced 'devfs'. It would look a lot like /proc (the clue here is that /proc has files with magical properties, yet it doesn't need block or char special inodes). The newer devfs would be mounted under /dev. When a device is registered, a /proc-like file is created which just intercepts the open/read/write/etc. system calls and calls device-supplied functions to provide the magic properties of the file. No major or minor numbers required.

This would work for /dev but wouldn't allow you to create block and char special files on any ordinary filesystem. I'm proposing that we extend ext3 to add special "object" inodes. An object inode has some metadata - probably just the name of the kernel module which handles that inode. When an object inode is opened, it causes the module to be loaded and the open function in this module to be called. Read/write/etc. operations can be modified by the module to provide whatever kind of magic is required.

One interesting possibility is to replace some configuration files with object inodes. /etc/passwd, for example, could be a special type of file which, when read, looks like an ordinary password file, but the data is actually pulled from NIS or LDAP.

Just my £0.02 anyway.

Rich.

Toward a larger dev_t

Posted Mar 27, 2003 18:28 UTC (Thu) by hummassa (subscriber, #307) [Link]

I don't mean to be rude or anything, but did you notice you just invented The HURD? ;-)

regards,
Massa

not even close

Posted Mar 28, 2003 16:27 UTC (Fri) by pflugstad (subscriber, #224) [Link]

No, not even close. He describes some simple enhancements to the
filesystem to allow the kernel to do away with major/minor numbers.
Something that should/could have been done a long time ago.

The HURD is something else entirely.

Toward a larger dev_t

Posted Mar 27, 2003 23:23 UTC (Thu) by torsten (guest, #4137) [Link]

devfs creates as many problems as it solves. Here are some:

1. Requires modifying applications that are stable, mature, secure, and well-tested.

2. devfs requires an ugly hack overlay daemon to maintain compatibility with the current standard. The correct way to fix this is change all /dev dependent programs, but the devfs proponents are not spearheading the work.

3. While there are fewer devices present in /dev, the device naming is more complicated. i.e. /dev/hdc1 becomes /dev/ide/disk/1/cb0u0u0p0 (or something like that, it was so complicated, I couldn't remember it exactly). The same reduction in complexity can be had with /dev by deleting unused /dev devices (i.e. I could easily whittle my /dev down to 10 or 15 device names by deleting unused devices).

4. It is not a unifying solution, it divides opinion. This is the most damaging.

5. The devfs guys have not rewritten all the utilities whose authors refuse to play along with devfs implementation. The overlay daemon is not a long-term solution.

6. Does not handle the situation well, where a program requires the existence of a device node prior to the device being attached to the machine (i.e. plugging and unplugging a device while it's associated control software is running).

7. Implements a plan of rapid change. This is not how proper development is done (proper development is obviously far less exciting because of this).

8. Lack of confidence - devfs has been available for a long time, but there is no trend to have it universally adopted.

9. devfs requires more typing - for those of use that still use the command line, devfs represents far more typing than /dev. Also, the similar naming of device partitions means the universal [tab] completion does not work well.

10. devfs has more levels of abstraction than needed. As a general rule of thumb, data abstraction layers should match the complexity of data. devfs implements four to five levels of abstraction. In the case there are 10 devices attached to a system, a typical person would not need more than one level of abstraction (I can easily handle names of ten devices in one directory, where having ten devices spread out over four or five levels of subdirectories is useless).


Torsten

devfs transition, not devfs, is the problem

Posted Mar 31, 2003 18:59 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Virtually all of these problems with devfs have to do with moving from major/minor numbers to devfs. Devfs per se remains a far more logical and useful way to name devices than major/minor number.

And it is a Linux tradition to accept transition pain as a means of getting to a better place.

The typing argument is an easy one to deal with: Make symbolic links from short file names to the descriptive ones.

devfs has more levels of abstraction than needed

This one I don't get. Devfs removes a layer of abstraction: the major and minor numbers. The concrete way to identify a device is "Logical Unit 0 on Target ID 3 of Bus 0 on Controller 0 in the SCSI subsystem." Mapping that to a major and minor number, as is required in the traditional device naming system, is a layer of abstraction. Mapping the device number to a name such as /dev/sdb is another. And one of those layers is extremely problematic, because the mapping is fairly fluid.

It's better not to introduce two additional names for a device. Devfs uses a (derivation of a) name the device already had.

The biggest valid criticism I've seen of devfs is that it doesn't provide an acceptable way to set the permissions to access a device. In the major/minor scheme, you create a permanent special file and set its permissions, and if you can manage to make sure the same device has the same device number all the time, its permissions are permanent.

But permanent permissions have drawbacks too -- many devices are not setup such that the same people are supposed to be able to read and write them all the time -- tty's, serial lines, removeable media drives. So again, I think the complaint is just that there are existing applications tailored to the exact limitations of the major/minor number permission system.

Toward a larger dev_t

Posted Mar 29, 2003 23:57 UTC (Sat) by oneukum (subscriber, #3970) [Link]

The problem is that devices are becoming dynamic quickly.
Therefore any solution that requires manual intervention cannot work.
Storing the meaning of a major/minor combination in a text file is no longer
an option.

Furthermore the scheme has to work for all systems for years to come.
So thousands of devices on a system have to be supported.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds