The 2.5.66 kernel includes Andries Brouwer's patches clearing the path for
an expansion of the dev_t
device number type. A small number of
problems have been found, but the changes are working for most people.
Andrew Morton has gone a little further and actually changed dev_t
to 32-bits in his -mm tree; predictably, the number of problems found there
has been a little higher. As a whole, though, the transition appears to be
going relatively smoothly.
Badari Pulavarty decided that it was time to play with the possibilities of
a larger device number type; he posted this patch
which makes the SCSI disk driver make full use of the expanded minor number
range. Testing with 4000 virtual disks, with 50 real drives at the end of
the range, worked - for the most part. Some scaling problems did turn up,
The most significant one appears to be in the request queue mechanism.
When the kernel wants to issue a block I/O request, the block subsystem
needs to be able to set it up quickly. In particular, memory allocations
are best avoided at that point; it's possible that the system is out of
memory and the kernel is doing I/O in an attempt to free up some space.
Trying to allocate memory at that point can lead to deadlocks. So the
block subsystem sets aside a number of pre-allocated request structures for
every request queue (and there is typically one request queue for each
physical drive in the system). That number varies depending on the amount
of memory on the system; it can be as low as 32, and as high as 256.
Request structures run about 144 bytes each. So, if one assumes that a system
hosting 4000 disks really should be equipped with a fair amount of memory,
the block subsystem will set aside about a million request structures, at a
cost of about 150MB. And that is just the beginning; the deadline I/O
scheduler augments each request structure with a separate
deadline_rq structure. Other overheads exist as well.
The end result is that, when the number of disks gets large, a great deal
of memory (which must all be in the low memory zone on 32-bit processors)
gets tied up in request queues. As Andrew Morton pointed out, with 4000 disks, enough request
structures have been allocated to represent 200GB of current I/O requests.
That, perhaps, is a bit more than is really needed in most situations.
The solution, as hacked up by Jens Axboe, is
to go to a more dynamic scheme for the allocation of request structures.
The mempool mechanism is used to keep an
absolute minimum number of request structures available for each queue; all
the rest are allocated as needed and freed afterwards. This patch will
probably go through a few more iterations, but the immediate scalability
problem has been addressed.
Meanwhile, not everybody is entirely happy with the direction of the
dev_t changes for char devices. In particular, Roman Zippel, who
has apparently given up on getting the module changes backed out, has now
posted a series of patches backing out the char device changes and
substituting his approach. That approach includes maintaining the
(currently unused) char device hashing scheme and getting rid of the new
register_chrdev_region() function. There is, he claims, no
particular need to split char minor number ranges into regions, as there is
with block devices. Roman's patches have created some discussion, but
there does not appear to be a great deal of pressure for a change in
direction at this time.
There has also been a bit of discussion on how big the new dev_t
should be. The plan has been to expand it to 32 bits; 12 for the major
number, and 20 for the minor number. That is the way Linus has wanted to
do it, but he recently has made noises about being open to the idea of
making dev_t even larger. If dev_t were to go to 64
bits, with 32 each for major and minor numbers, there would be little need
to worry about running out of device numbers for some time into the
future. This decision may not be made for a while; once the work to
support the dev_t expansion has been done, setting it to one size
or another is a relatively simple task.
to post comments)