LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

dev_t expands at last

The expansion of the dev_t device number type has been on the list of goals for 2.6 since the beginning. The only problem is that it has stayed on that list through the entire 2.5 development process; for various reasons, work on that project stalled for a long time. As of September 24, however, the dev_t expansion can be checked off the list; Linus has merged the required changes into his BitKeeper tree. They will appear in the 2.6.0-test6 release.

For some time, it had appeared that dev_t would expand to 64 bits, with 32 bits each for the major and minor numbers. The actual change, however, is to 32 bits, with a 12-bit major number and 20 bits for the minor. That should be adequate for some time, especially given that the new registration mechanisms and sysfs make it much easier for the system to use device numbers more effectively.

Internally, the new kernel dev_t type uses the encoding one would expect: the major number sits in the top twelve bits of a 32-bit value, [New user-space dev_t] with the minor number in the bottom 20 bits. The encoding seen by user space is different, however, as shown in the diagram to the right. Here, the major number sits in bits 8-19, while the minor number is split across bits 20-31 and 0-7. This representation may seem strange, but it has one very nice property: old 16-bit device numbers are still valid in the new scheme. Encoding device numbers this way helps keep no end of applications from breaking with the new device number type. One might wonder why this workaround is necessary, given that the C library can convert device numbers as needed for the few system calls (mknod(), stat(), etc.) that actually need them. The problem is that device number pop up in a number of other contexts, such as in filesystems and ioctl() calls, where the C library is unable to help.

There are places, however, where an explicitly 16-bit value is passed. There is no way to change that without breaking applications. In such cases, the kernel checks whether 16 bits is sufficient; if not, the system call has no choice but to fail with an EOVERFLOW error.

Beyond that, most of the groundwork for the new dev_t had already been laid over the last few months. There are, however, certain to be a few surprises left after such a fundamental change. The next couple kernels could be interesting to use while the remaining issues get ironed out.


(Log in to post comments)

OVERFLOW not such a great idea

Posted Sep 25, 2003 16:20 UTC (Thu) by spitzak (subscriber, #4593) [Link]

This sounds similar to readdir32's emulation, which would produce an error when reading certain disks, because the inode number did not fit when converting from the 64-bit structure. This was extremely annoying because my code was uninterested in the inode number and simply wanted the filename (which could be converted with no problem), the result to the end user was that the GUI file chooser would pop up errors instead of showing the files. I was forced to back-port the fix for this into old and supposedly stable versions of our programs and thus really disliked this "solution".

If the system call is returning information other than the device number, I would much prefer that the overflow be indicated by filling in the structure with a special illegal device number (0xffff?). Thus the error is deferred until somebody actually uses the number.

OVERFLOW not such a great idea

Posted Sep 25, 2003 23:37 UTC (Thu) by viro (subscriber, #7872) [Link]


... except that stat() on files longer than 2Gb *already* gives -EOVERFLOW.
If your code doesn't deal with that gracefully - you already have a trouble
in that place, so nothing new had happened.

OVERFLOW not such a great idea

Posted Sep 26, 2003 16:23 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

If your code doesn't deal with that gracefully

This isn't about dealing with the failure gracefully. It's about the failure itself. If, when you stat a file and some return value won't fit, your program fails gracefully with the message, "I can't tell you what the file permissions are because the kernel won't tell me," that's not nearly as good as "here are the file permissions."

The point that when you're returning multiple pieces of information, it's better to return partial information than to fail the entire request, is a good one. And doing that for some fields even though you don't do it for others is better than doing it for none of them.

Have you ever read "Worse is Better"?

Posted Sep 26, 2003 16:40 UTC (Fri) by im14u2c (subscriber, #5246) [Link]

Have you ever read the document The Rise of "Worse is Better"? This particular decision path sounds familiar.

The document recounts the following scenario:

Two famous people, one from MIT and another from Berkeley (but working on Unix) once met to discuss operating system issues. The person from MIT was knowledgeable about ITS (the MIT AI Lab operating system) and had been reading the Unix sources. He was interested in how Unix solved the PC loser-ing problem. The PC loser-ing problem occurs when a user program invokes a system routine to perform a lengthy operation that might have significant state, such as IO buffers. If an interrupt occurs during the operation, the state of the user program must be saved. Because the invocation of the system routine is usually a single instruction, the PC of the user program does not adequately capture the state of the process. The system routine must either back out or press forward. The right thing is to back out and restore the user program PC to the instruction that invoked the system routine so that resumption of the user program after the interrupt, for example, re-enters the system routine. It is called ``PC loser-ing'' because the PC is being coerced into ``loser mode,'' where ``loser'' is the affectionate name for ``user'' at MIT.

The MIT guy did not see any code that handled this case and asked the New Jersey guy how the problem was handled. The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again. The MIT guy did not like this solution because it was not the right thing.

The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex. Besides, programmers could easily insert this extra test and loop. The MIT guy pointed out that the implementation was simple but the interface to the functionality was complex. The New Jersey guy said that the right tradeoff has been selected in Unix--namely, implementation simplicity was more important than interface simplicity.

The MIT guy then muttered that sometimes it takes a tough man to make a tender chicken, but the New Jersey guy didn't understand (I'm not sure I do either).

Sound familiar? "Push the problem to user space."

In this particular case, I think -EOVERFLOW has better potential to be right than filling in certain fields with magic values, since an old app is more likely to be looking for generic error messages than recently ordained Magic Numbers. Much of this adaptation could be hidden inside libc, theoretically, since libc should be using the updated system calls anyway. Someone running binary-only with a too-old libc should expect some breakage. A minimal patch to libc could provide the desired behavior if needed as a crutch for some old binary-only application.

OVERFLOW not such a great idea

Posted Sep 27, 2003 5:19 UTC (Sat) by viro (subscriber, #7872) [Link]

Sigh... RTF{S,posting you are replying to}, please. Again, the same
stat() variants that can give you -EOVERFLOW due to st_dev and
st_rdev being too large *already* gave you -EOVERFLOW if st_size
was too large. Nothing new here - exact same spot in your program
had already been a source of -EOVERFLOW for some files.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds