LWN.net Logo

O_HOT and O_COLD

By Jonathan Corbet
April 24, 2012
While storage devices are billed as being "random access" in nature, the truth of the matter is that operations to some parts of the device can be faster than operations to others. Rotating storage has a larger speed differential than flash, while hybrid devices may show a large difference indeed. Given that differences exist, it is natural to want to place more frequently-accessed data on the faster part of the device. But a recent proposal to allow applications to influence this placement has met with mixed reviews; the problem, it seems, is a bit more complicated than it appears.

The idea, as posted by Ted Ts'o, is to create a couple of new flags to be provided by applications at the time a file is created. A file expected to be accessed frequently would be created with O_HOT, while a file that will see traffic only rarely would be marked with O_COLD. It is assumed that the filesystem would, if possible, place O_HOT files in the fastest part of the underlying device.

The implementation requires a change to the create() inode operation; a new parameter is added to allow the VFS layer to pass down the flags passed by the application. That change is the most intrusive part of the patch, requiring tweaks to most filesystems—43 files changed in all. The only filesystem actually implementing these flags at the outset is, naturally, ext4. In that implementation, O_HOT files will be placed in low-numbered blocks, while O_COLD files occupy the high-numbered blocks—but only if the filesystem is stored on a rotating device. Requesting O_HOT placement requires the CAP_RESOURCE privilege or the ability to dip into the reserved block pool.

A lot of people seem to like the core idea, but there were a lot of questions about the specifics. What happens when the storage device is an array of rotating devices? Why assume that a file is all "hot" or all "cold"; some parts of a given file may be rather hotter than others. If an application is using both hot and cold files, will the (long) seeks between them reduce performance overall? What about files whose "hotness" varies over time? Should this concept be tied into the memory management subsystem's notion of hot and cold pages? And what do "hot" and "cold" really mean, anyway?

With regard to the more general question, Ted responded that, while it would be possible to rigorously define the meanings of "hot" and "cold" in this context, it's not what he would prefer to do:

The other approach is to leave things roughly undefined, and accept the fact that applications which use this will probably be specialized applications that are very much aware of what file system they are using, and just need to pass minimal hints to the application in a general way, and that's the approach I went with in this O_HOT/O_COLD proposal.

In other words, this proposal seems well suited to the needs of, say, a large search engine company that is trying to get the most out of its massive array of compute nodes. That is certainly a valid use case, but a focus on that case may make it hard to generalize the feature for wider use.

Generalizing the feature may also not be helped by placing the decision on who can mark files as "hot" at the individual filesystem level. That design could lead to different policies provided by different filesystems; indeed, Ted expects that to happen. Filesystem-level policy will allow for experimentation, but it will push the feature further into an area where it is only useful for specific applications where the developers have full control over the underlying system. One would not expect to see O_HOT showing up in random applications, since developers would have no real way to know what using that flag would do for them. And that, arguably, is just as well; otherwise, it would not be surprising to see the majority of files eventually designated as "hot."

Interestingly, there is an alternative approach which was not discussed here. In 2010, a set of "data temperature" patches was posted to the btrfs list. This code watched accesses to files and determined, on the fly, which blocks were most in demand. The idea was that btrfs could then migrate the "hot" data to the faster parts of the storage device, improving overall performance. That work would appear to have stalled; no new versions of those patches have appeared for some time. But, for the general case, it would stand to reason that actual observations of access patterns would be likely to be more accurate than various developers' ideas of which files might be "hot."

In summary, it seems that, while there is apparent value in the concept of preferential treatment for frequently-accessed data, figuring out how to request and implement that treatment will take some more time. Among other things, any sort of explicit marker (like O_HOT) will quickly become part of the kernel ABI, so it will be difficult to change once people start using it. So it is probably worthwhile to ponder for a while on how this feature can be suitably designed for the long haul, even if some hot files will have to languish in cold storage in the meantime.


(Log in to post comments)

O_HOT and O_COLD

Posted Apr 26, 2012 4:39 UTC (Thu) by slashdot (guest, #22014) [Link]

It would seem that a much better approach would be to add a per-inode "block group range" field indicating in which block groups the filesystem should try to allocate file blocks, which would be inherited from the parent directory for new files.

This way, it would be setup by the system administrator depending on his knowledge of the hardware (or by an automated tool with heuristics), it wouldn't pollute the API, and it would be far more flexible.

O_HOT and O_COLD

Posted Apr 26, 2012 7:56 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

Or more generally some sort of "allocation cookie". The way this might be implemented might be, like you say, by allocating from some dedicated blockgroups on Ext-like filesysems.
But you could also envision a more generic "subvolume" like logic for managing this (allocation pools in zfs, certain block device subsets in btrfs, ...)

O_HOT and O_COLD

Posted Apr 26, 2012 8:40 UTC (Thu) by neilbrown (subscriber, #359) [Link]

This sounds just like the LBA hinting mentioned briefly here: https://lwn.net/Articles/489311/

The idea is to tell your disk drive how you might expect to use the data later, and it will just sort everything out for you. Promise.

Maybe we should just encode the T10 spec in the VFS and allow filesystems to accept exactly the same hints and do the same things. Or maybe not.

I think if you really need that sort of control, you should make it all much more explicit. Create a /hot filesystem and a /cold filesystem and let that be that.

O_HOT and O_COLD

Posted Apr 26, 2012 19:35 UTC (Thu) by ejr (subscriber, #51652) [Link]

And then a unionfs binding of /hot and /cold, along with userspace tools based on common machine learning techniques that manage /hot and /cold.

The question becomes what are the sufficient statistics necessary for control of what file goes where, and how (and how often) does the kernel update those statistics. While that *could* be handled at user level, intercepting open/close/read/write calls is kludgy and inefficient. IMHO, gathering the right statistics is where the kernel can help.

O_HOT and O_COLD

Posted Apr 26, 2012 18:34 UTC (Thu) by gerdesj (subscriber, #5446) [Link]

Surely this is what a sysadmin is for? The app dev or whatever tells me what they want to do and I provide the environment for their app. They probably shouldn't try and eliminate me by trying to speak to the fs devs directly. Especially when that conversation is one way, with rather basic semantics (hot n cold) and with an unknown other end (which fs exactly)!

I haven't even started on looking at RAID, spindle layout, SAN, NAS, iSCSI, FC, RAM, processors, network and all the other factors that will affect your app.

I much prefer the idea of an fs that optimizes itself in the way indicated in the article which looks at the load over time and moves things around. I might specify that and then watch what it does to see if it helps or hinders in the particular case. Perhaps it could be invoked with a flag at certain times. Otherwise I'd do some pre production testing depending on how important it is. Either way, I'll probably beat a simplistic flag.

It reminds me of the mess that MS SQL and Exchange's "memory managers" tend to create on a Windows box. It's particularly hilarious on an SBS ...

Leave file system allocation to the experts (fs devs) and don't try to hinder them with this. Even if the proposal came from one!

Cheers
Jon

O_HOT and O_COLD

Posted Apr 26, 2012 21:11 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>O_HOT files will be placed in low-numbered blocks, while O_COLD files occupy the high-numbered blocks—but only if the filesystem is stored on a rotating device.

That however makes certain assumptions (<- not good) about where low-numbered blocks are stored. Think vinyl :)

O_HOT and O_COLD

Posted Apr 30, 2012 10:16 UTC (Mon) by dmk (subscriber, #50141) [Link]

Why O_HOT and O_COLD? Why not O_IKNOWWHATIMDOING_1 and O_IKNOWWHATIMDOING_2 ?

O_HOT and O_COLD

Posted Apr 28, 2012 19:51 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Given that differences exist, it is natural to want to place more frequently-accessed data on the faster part of the device.

The proposal seems to do something rather diffferent: rather than steer data to a faster part of the device, it creates a faster part of the device. It's faster because the head happens to be positioned there most of the time.

Also, it doesn't seem to account for an important reality of temperature: your hottest data should go on the slowest device, because it's accessed on the device rarely (it's accessed in an intervening cache most of the time).

O_HOT and O_COLD

Posted Apr 29, 2012 7:09 UTC (Sun) by jcm (subscriber, #18262) [Link]

Ok. Nobody else mentioned Katy Perry yet, so let me volunteer to send Ted a copy of the single ;)

O_HOT and O_COLD

Posted Apr 29, 2012 7:40 UTC (Sun) by jcm (subscriber, #18262) [Link]

Link sent. I'm sure I can come up with a more complete set of lyrics some other time.

O_HOT and O_COLD

Posted May 3, 2012 9:07 UTC (Thu) by Klavs (subscriber, #10563) [Link]

IMHO it will never work to let developers designate which files are HOT/COLD - as the article also suggests :)

This should be something the OS should handle - and there could be tools for the sysadmin.

I'd very much like it if the this feature could also support f.ex. distributing data, so if I f.ex. build 1 logical volume, consisting of 1 types of storage: SLOW (a few 2TB disks) and FAST (several smaller disks) - that way the HOT data could be kept on the FAST part, and the "bulk data" part - which often sits unused, can be put on the SLOW part - without having to shuffle data around (as we do in SANS today) and "intermix" mount points etc.

If the logistics for designating HOT/COLD data got good enough, one could even imagine a case, where the sysadmin could get an overview of hot "filled" the HOT part was - so he knows when to add more small/fast disks, and when to just add more slow large diske.

It would definetely help a lot of usecases (for server use obviously :)

O_HOT and O_COLD

Posted May 3, 2012 9:08 UTC (Thu) by Klavs (subscriber, #10563) [Link]

Missing that edit button..

s/1 types/2 types/

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds