LWN.net Logo

msync() and subtle behavioral tweaks

By Jonathan Corbet
June 19, 2012
Some kernel behavior is determined by standards like POSIX; others are simply a function of what the kernel developers implemented. The latter type of behavior can, in theory, be changed if there is a good reason to do so, but there is always a risk of breaking an application that depended on the previous behavior. Even worse, this kind of problem can be impossible to find during development and, indeed, may lurk until long after the new code has been deployed. A system call patch currently under consideration shows how hard it can be to know when a change is truly safe.

The msync() system call exists to request that a file-backed memory region created with mmap() be written back to persistent storage. Once upon a time, msync() was the only way to guarantee that modified pages would be saved to disk in any reasonable period of time; the kernel could not always detect on its own that they had been changed by the application. That problem has long since been dealt with, but msync() is still a good way to inform the kernel that now would be a good time to flush modified pages to disk.

Paolo Bonzini recently posted a small patch set making a couple of changes to msync(). The actual API does not change at all, but how the system call implements the API changes in subtle and interesting ways.

There are a few options to msync(), one of which (MS_ASYNC) asks that the writeback of modified pages be "scheduled," but not necessarily completed immediately. It is meant to be a non-blocking system call that sets the necessary actions in motion, but does not wait for them to complete. Current kernels will write back dirty pages as part of the normal writeback process; the system behaves, in other words, as if msync(MS_ASYNC) were being called on a regular basis on every mapping. Writeback of dirty pages is already scheduled as soon as the page is dirtied. Given that, there's not much work for an explicit MS_ASYNC call from user space to do, and, indeed, the kernel essentially ignores such calls.

Paolo's patch causes the kernel to immediately start I/O on modified pages in response to MS_ASYNC calls. The result is to get those pages to persistent storage a bit more quickly than would otherwise happen, but still avoid blocking the calling process. The change seems reasonable, but Andrew Morton worried that this behavioral change might be a problem for some users:

Means that people will find that their msync(MS_ASYNC) call will newly start IO. This may well be undesirable for some. Also, it hardwires into the kernel behaviour which userspace itself could have initiated, with sync_file_range(). ie: reduced flexibility.

Most users are unlikely to notice the change at all. But it's entirely possible that somebody out there has a precisely-tuned system that will choke if the underlying I/O behavior changes. Users complain about exactly this kind of change at times, but usually when the change shows up in a new enterprise kernel, years too late. That said, many patches make behavioral changes that can affect users in surprising ways. The only thing that is different about this one is that the nature of the change is understood from the beginning. Andrew's concerns were not echoed by others and may not be enough to keep this change from being merged.

The other change is potentially more troubling. msync() takes two parameters indicating the offset and length of the memory area to be written back. But the kernel has always ignored those parameters, choosing instead to just write back all modified pages in the file, and the related metadata as well. Paolo's patch changes the implementation to only synchronize the specific pages requested by the user.

It would be hard to argue that the new behavior breaks the documented API; the offset and length parameters are there for a reason, after all. Still, once again, Andrew worried that applications could break in especially unpleasant ways:

Would be nice, but if applications were previously assuming that an msync() was syncing the whole file, this patch will secretly and subtly break them.

No developer should have written a program with the assumption that msync() would write pages outside of the range it was given. Any such program would clearly be buggy. But, programs written that way will work under current kernels. Changing msync() to not write some pages that it currently writes could cause such programs to fail in strange and difficult-to-reproduce ways.

In general, the kernel tries not to break existing applications, even if those applications can be said to have been written in a buggy manner. If something works now, it should continue to work with future kernels. If the msync() changes described here break those programs, the changes should probably not be merged into the kernel. The problem, of course, is that it can be very difficult to know if a specific change will break somebody's application. Any problems caused by subtle changes are relatively unlikely to turn up before the changes are included in a released kernel. So it is necessary to proceed with care. That said, it is not practical to hold back every change that might break a badly-written program somewhere; kernel development would likely be slowed considerably by such a constraint. So, probably, these changes will probably go in unless an affected user happens to notice a problem in the near future.


(Log in to post comments)

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 14:39 UTC (Thu) by mikemol (subscriber, #83507) [Link]

Just a guess, but if you change MS_ASYNC to immediately being syncing, wouldn't that reduce the kernel's ability to defer syncing in, e.g. mobile and low-power environments?

If my spinning platters started spinning more frequently (or failed to spin down as often), I know I'd be annoyed.

msync() and subtle behavioral tweaks

Posted Jul 3, 2012 12:40 UTC (Tue) by philomath (guest, #84172) [Link]

In that case, why the MS_ASYNC at all? just leave it for the kernel.
In general, it doesn't make sense to have a flag that does nothing, just because it has never done anything.

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 15:51 UTC (Thu) by Yorick (subscriber, #19241) [Link]

It's not unusual to see msync (with MS_ASYNC) used to detect whether a certain address is mapped or not - for instance, inside in-process stack tracers that may need to follow random pointers. Unless I'm mistaken, libunwind does exactly this. It may be undersirable if this suddenly becomes a lot more expensive.

Playing around with SIGSEGV handlers is one alternative, but it's often slower and kind of dangerous in what is typically already a precarious context (inside a signal handler itself, for example), or with multiple threads.

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 18:23 UTC (Thu) by cmccabe (guest, #60281) [Link]

You could always try to write(2) a byte from the address you're checking to /dev/null.

I also got the impression that libunwind was mostly for crash debugging, at which point performance is not exactly priority number 1...

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 19:31 UTC (Thu) by nix (subscriber, #2304) [Link]

Other uses listed on libunwind's own webpage include 'exception handling' and 'efficient setjmp()', for which performance is surely more of a priority.

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 19:41 UTC (Thu) by Yorick (subscriber, #19241) [Link]

Maybe. There is no guarantee that write(2) from an invalid address wouldn't cause a SIGSEGV instead of returning EFAULT. I could be mistaken, but I believe Linux has actually behaved that way before. msync(2) seems to be more tolerant in general.

Another interesting use of libunwind is for profiling, for which performance is attractive. (Yes, there are alternatives, but libunwind is fairly portable.)

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 21:15 UTC (Thu) by cmccabe (guest, #60281) [Link]

> Maybe. There is no guarantee that write(2) from an invalid address
> wouldn't cause a SIGSEGV instead of returning EFAULT. I could be mistaken,
> but I believe Linux has actually behaved that way before. msync(2) seems
> to be more tolerant in general.

I'm really curious why you think this. It seems totally bogus to me: the kernel is the one doing the address space checking, not userspace. You would need to add extra code to get the weird and (I think) non-POSIX behavior of delivering a signal to userspace. What gave you the idea that a signal might be delivered?

msync() and subtle behavioral tweaks

Posted Jun 21, 2012 22:03 UTC (Thu) by Yorick (subscriber, #19241) [Link]

I may have misunderstood it entirely, but I was under the impression from old discussions on linux-kernel (that I wish I could find now—sorry) that Posix actually would allow this. Consider an implementation that implements all or parts of write() in user space, for instance.

You would then rightly ask why msync() would be exempt from such a behaviour, and to that I have no good answer. I have seen that syscall being used for this purpose (address checking) in a variety of places, however, misguided or not.

msync() and subtle behavioral tweaks

Posted Jun 22, 2012 3:40 UTC (Fri) by cmccabe (guest, #60281) [Link]

I think glusterfs has a shim library that will intercept some calls to glibc and interpose its own functions. So you might call write() and end up getting a userspace version. So I guess it's not outside the realm of possibility. Of course, glusterfs could theoretically intercept msync as well-- I don't know if their shim library does or not.

I would guess that the people using msync to check whether an address is valid are using it more because it doesn't require you to have any open file descriptors, than because they're being "careful." In fact, it's not even clear according to the man page that you can use msync on memory that wasn't allocated with mmap. I really have no idea what msync is "supposed" to do on memory allocated with brk, for example. So it's just another case of people relying on some pretty hairy implementation details.

As far as I can see, your best bet for checking address validity probably is "mincore." It definitely doesn't make any sense for a shim library to intercept that function.

msync() and subtle behavioral tweaks

Posted Jun 22, 2012 8:40 UTC (Fri) by nix (subscriber, #2304) [Link]

Anyone intercepting relatively-bare syscalls and converting them into library functions like that had better trap SIGSEGV during the call and convert it into an -EFAULT return. It's not like that's terribly hard (though it does require flipping signal dispositions twice, that's fast as syscalls go).

EFAULT vs SIGSEGV on write()

Posted Jun 22, 2012 17:59 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Anyone intercepting relatively-bare syscalls and converting them into library functions like that had better trap SIGSEGV during the call and convert it into an -EFAULT return.

But do the standards or conventional architecture really call for that? I don't think the POSIX definition of write() uses the word "kernel" and I believe the general understanding for any library is that if you pass an invalid address to a subroutine, it might generate a SIGSEGV.

Or are you just making a practicality argument, since people might be depending on EFAULT. I think it would be a pretty unusual program that passes invalid addresses to write() when the program isn't broken.

EFAULT vs SIGSEGV on write()

Posted Jun 22, 2012 23:48 UTC (Fri) by nix (subscriber, #2304) [Link]

It's practicality. If you're trying to transparently, replace a function that normally EFAULTs on events that would cause userspace to SIGSEGV, it behooves you to behave the same way, lest you break some weird program that really depends on this. (I wrote one once. It does happen.)

msync() and subtle behavioral tweaks

Posted Jun 22, 2012 2:21 UTC (Fri) by buck (subscriber, #55985) [Link]

MS_REALLY_ASYNC_MEANINGFUL_PARAMS (MS_RAMP), anybody?

msync() and subtle behavioral tweaks

Posted Jun 22, 2012 17:51 UTC (Fri) by jmorris42 (subscriber, #2203) [Link]

Maybe I'm missing something important but it sounds like these two changes make the system call behave more like the documentation. That should be a good thing and a no brainer.

The first change, causing it to actually get busy writing when the call is made but without blocking is kinda what you should have been expecting from explicitly making that call in the first place. The existing kernel behavior of basically ignoring the whole request is the error. If you could live with it getting written eventually when the system got around to it you could have skipped the call and got the same result.

Same for the range. It is actually implementing behavior that has long been documented in the API but left unwritten. Even if you have a bug and are specifying the range wrong you still get the same result you used to get, a write when the kernel gets a spare roundtuit. So all this is doing is increasing the odds that a range that is correctly specified will get written in the case of a crash. It could have happened before though and it might not, and it still might not get written. So all one can say is if you have buggy code it is possible you will get bitten a little more often with this change but probably all that will happen is you won't get as much of an improvement in reliability.

msync() and subtle behavioral tweaks

Posted Jun 26, 2012 19:14 UTC (Tue) by liljencrantz (guest, #28458) [Link]

I am surprised this did not come up in the article. Maintaining backward compatibility with possibly existing old programs is relevant, but is it really more important than making sure the kernel behavior matches the API documentation so that *future* programs will behave as expected?

The other alternative would of course be to deprecate the old calls and replace them with newer ones that do what you'd expect from the documentation and then carefully document all the ways in which the old syscalls don't actually do what they advertise. This alternative sounds horrible in a very Redmondesque way.

msync() and subtle behavioral tweaks

Posted Jul 3, 2012 12:36 UTC (Tue) by philomath (guest, #84172) [Link]

Exactly my words, as I would have written. thank you!

msync() and subtle behavioral tweaks

Posted Jun 25, 2012 22:01 UTC (Mon) by cesarb (subscriber, #6266) [Link]

About the second concern: how about a MS_DO_NOT_IGNORE_OFFSET_AND_LENGHT flag?

msync() and subtle behavioral tweaks

Posted Jun 28, 2012 7:19 UTC (Thu) by slashdot (guest, #22014) [Link]

Probably needs to become the addition of two new flags called MS_IMMEDIATE and MS_RANGE.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds