Trading off safety and performance in the kernel

By Jonathan Corbet
May 12, 2015

The kernel community ordinarily tries to avoid letting users get into a position where the integrity of their data might be compromised. There are exceptions, though; consider, for example, the ability to explicitly flush important data to disk (or more importantly, to avoid flushing at any given time). Buffering I/O in this manner can significantly improve disk write I/O throughput, but if application developers are careless, the result can be data loss should the system go down at an inopportune time. Recently there have been a couple of proposed performance-oriented changes that have tested the community's willingness to let users put themselves into danger.

O_NOMTIME

A file's "mtime" tracks the last modification time of the file's contents; it is typically updated when the file is written to. Zach Brown recently posted a patch creating a new open() flag called O_NOMTIME; if that flag is present, the filesystem will not update mtime when the file is changed. This change is wanted by the developers of the Ceph filesystem, which has no use for mtime updates:

The ceph servers don't use mtime at all. They're using the local file system as a backing store and any backups would be driven by their upper level ceph metadata. For ceph, slow IO from mtime updates in the file system is as daft as if we had block devices slowing down IO for per-block write timestamps that file systems never use.

Disabling mtime updates, Zach said, can reduce total I/O associated with a write operation by a factor of two or more.

There are, of course, a couple of problems with turning off mtime updates. Trond Myklebust noted that it would break NFS "pretty catastrophically" to not maintain that information; NFS clients would lose the ability to detect when they have stale cached data, leading to potential data corruption. The biggest concern, though, appears to be the effect on filesystem backups; if a file's mtime is not updated when the file is modified, that file will not be picked up in an incremental backup (assuming the backup scheme uses mtime, which most do). A system's administrator might decide to run that risk, but there is the possibility that users may run it for them. As Dave Chinner put it:

The last thing an admin wants when doing disaster recovery is to find out that the app started using O_NOMTIME as a result of the upgrade they did 6 months ago. Hence the last 6 months of production data isn't in the backups despite the backup procedure having been extensively tested and verified when it was first put in place.

Another way of putting it is that the mtime value is often not there for the benefit of the creator of the file; it is often used by others as part of the management of the system. Allowing the creator to disable mtime updates may have implications for those others, who would then have cause to wish that they had been part of that decision before it was made.

Despite the concerns, most developers appear to recognize that there is a real use case for being able to turn off mtime updates. So the discussion shifted quickly to how this capability could be provided without creating unpleasant surprises for system administrators. There appear to be two approaches toward achieving that goal.

The first of those is to not allow applications to disable mtime updates unless the system administrator has agreed to it. That agreement is most likely to take the form of a special mount option; unless a specific filesystem has been mounted with the "allow_nomtime" option, attempts to disable mtime updates on that filesystem will be denied. The second aspect is to hide the option in a place where it does not look like part of the generic POSIX API. In practice, that means that, rather than being a flag for the open() system call, O_NOMTIME will probably become a mode that is enabled with an ioctl() call.

Syncing and suspending

Putting a system into the suspended state is a complicated task with a number of steps; in current kernels, one of those steps is to call sys_sync() to flush all dirty file pages back out to persistent storage. It might seem intuitively obvious that saving the contents of files before suspending is a good thing to do, but that has not stopped Len Brown from posting a patch to remove the sys_sync() call from the suspend path.

Len's contention is that flushing disks can be an expensive operation (it can take multiple seconds) and that this cost should not necessarily be paid every time the system is suspended. Doing the sync unconditionally in the kernel, in other words, is a policy decision that may not match what all users want. Anybody who wants file data to be flushed is free to run sync before suspending the system, so removing the call just increases the flexibility of the system.

This change concerns some; Alan Cox was quick to point out some reasons why it makes sense to flush out file data, including the facts that resume doesn't always work and that users will sometimes disconnect drives from a suspended system. It has also been pointed out that, sometimes, a suspended system will never resume due to running out of battery or the kernel being upgraded. For cases like this, it was argued, removing the sys_sync() call is just asking for data to be lost.

Nobody, of course, is trying to make the kernel more likely to lose data. The driving force here is something different: the meaning of "suspending" a system is changing. A user who suspends a laptop by closing the lid prior to tossing it into a backpack almost certainly wants all data written to disk first. But when a system is using suspend as a power-management mechanism, the case is not quite so clear. If a system is able to suspend itself between every keystroke — as some systems are — it may not make sense to do a bunch of disk I/O every time. That may be doubly true on small mobile devices where the power requirements are strict and the I/O devices are slow. On such systems, it may well make sense to suspend the system without flushing I/O to persistent storage first.

The end result is that most (but not all) developers seem to agree that there is value in being able to suspend the system without syncing the disks first. There is rather less consensus, though, on whether that should be the kernel's default behavior. If this change goes in, it is likely to be controlled by a sysctl knob, and the default value of that knob will probably be to continue to sync files as is done in current kernels.

Index entries for this article
Kernel	Filesystems
Kernel	O_NOMTIME

Trading off safety and performance in the kernel

Posted May 12, 2015 20:28 UTC (Tue) by pj (subscriber, #4506) [Link] (10 responses)

wrt Suspend-without-sync - I suspect whether or not to sync is going to want to be determined on a per-call basis, so it would seem to me that another parameter/flag to the suspend call would be a better solution than a sysctl knob. Otherwise there might be a weird (though I guess not really harmful?) race condition on devices that normally suspend-without-sync but want to sometimes suspend-with-sync: that device that suspends between keystrokes might want to suspend-with-sync on lid-close, for instance.

Trading off safety and performance in the kernel

Posted May 12, 2015 20:36 UTC (Tue) by zuki (subscriber, #41808) [Link] (9 responses)

There's no suspend() call. You just write stuff to /sys/power/disk and /sys/power/state. It's already racy.

Trading off safety and performance in the kernel

Posted May 12, 2015 22:01 UTC (Tue) by neilbrown (subscriber, #359) [Link] (8 responses)

> It's already racy.

Racy in what sense?

Trading off safety and performance in the kernel

Posted May 13, 2015 2:52 UTC (Wed) by krakensden (subscriber, #72039) [Link] (6 responses)

If suspend means "execute this script":

#!/bin/sh
sync
echo -n mem > /sys/power/state

then there's a window between writing to disk and suspending the system where the in-memory filesystem cache can get dirtied.

Trading off safety and performance in the kernel

Posted May 13, 2015 3:03 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

Thanks for being specific.

In the enter_state() function in kernel/power/suspend.c we have:

	trace_suspend_resume(TPS("sync_filesystems"), 0, true);
	printk(KERN_INFO "PM: Syncing filesystems ... ");
	sys_sync();
	printk("done.\n");
	trace_suspend_resume(TPS("sync_filesystems"), 0, false);

	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
	error = suspend_prepare(state);

and in suspend_prepare():

	pm_prepare_console();

	error = pm_notifier_call_chain(PM_SUSPEND_PREPARE);
	if (error)
		goto Finish;

	trace_suspend_resume(TPS("freeze_processes"), 0, true);
	error = suspend_freeze_processes();
	trace_suspend_resume(TPS("freeze_processes"), 0, false);

So the freezing of user-space processes happens *after* the sys_sync call. So that race is already present.

Trading off safety and performance in the kernel

Posted May 13, 2015 3:41 UTC (Wed) by chloe_zen (guest, #8258) [Link]

Given fuse, it rather has to be sync-then-suspend, doesn't it?

Trading off safety and performance in the kernel

Posted May 13, 2015 19:12 UTC (Wed) by flussence (guest, #85566) [Link] (3 responses)

A tangentially related aside: "echo -n" no longer works that way in recent versions of dash, which is the default /bin/sh on some distros. You'd need to use printf(1) instead.

Trading off safety and performance in the kernel

Posted May 13, 2015 22:41 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> A tangentially related aside: "echo -n" no longer works that way in recent versions of dash,

I wonder where this habit of using "-n" came from. Just

echo mem > /sys/power/state

If any file in /sys requires "echo -n" then that is kernel bug. Please report it.

Trading off safety and performance in the kernel

Posted May 14, 2015 0:42 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

-n on echo was never portable.

Trading off safety and performance in the kernel

Posted May 15, 2015 17:27 UTC (Fri) by flussence (guest, #85566) [Link]

I'm aware of that, but there's a huge amount of software in the wild convinced that it is: https://bugs.gentoo.org/show_bug.cgi?id=nonbash

Trading off safety and performance in the kernel

Posted May 13, 2015 12:48 UTC (Wed) by zuki (subscriber, #41808) [Link]

>> It's already racy.
> Racy in what sense?

I should have said "non-atomic". OP was wondering if a syscall option can be added to do configuration+execution in one step. I was just pointing out that the configuration step is already separate (e.g. for hibernation).

Trading off safety and performance in the kernel

Posted May 12, 2015 21:23 UTC (Tue) by caitlinbestler (guest, #32532) [Link] (1 responses)

This is dancing around the edges. The real usage of O_NOMTIME is for "files" that really aren't files, they are chunks of objects that have their own metadata.

An O_THIS_IS_JUST_A_CHUNK option would have been clearer.

Trading off safety and performance in the kernel

Posted May 22, 2015 5:50 UTC (Fri) by scientes (guest, #83068) [Link]

The backup program example still applies to chunks however. How about having mtime still update every 24-hours if the file is modified (like how noatime currently works)?

Trading off safety and performance in the kernel

Posted May 12, 2015 21:36 UTC (Tue) by pr1268 (guest, #24648) [Link]

Allowing the creator to disable mtime updates may have implications for those others,

Am I to understand that opening a file with O_NOMTIME that creates the file means that the mtime can never be updated for that file, even if it's opened sometime later for read/write?

attempts to disable mtime updates on that filesystem will be denied

Would open(2) silently ignore the O_NOMTIME flag in such a situation, or return -1 instead of a valid file descriptor?

Trading off safety and performance in the kernel

Posted May 12, 2015 22:29 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (82 responses)

> A user who suspends a laptop by closing the lid prior to tossing it into a backpack almost certainly wants all data written to disk first.

No, I really do not, *especially* in that specific case. Delaying the suspend for a synchronous operation with unbounded running time increases the likelihood that the disk is still spinning and the CPU is still dissipating a crapton of heat when the laptop lands in the backpack. Excess heat and decreased resistance to shock will lose far more data than any of the other things might.

Sync on suspend made sense in the 1990's, when the best firmware was terrible, journalled filesystems were new and scary experimental things, and applications didn't have toolkits like sqlite for local data persistence. There was a high probability that the system wouldn't come back from suspend, there would be a lengthy fsck on the inevitable reboot, and various other downstream problems from the crash would have to be fixed by hand.

The 1990's were almost two decades ago. Now we have filesystems with not just journalling but also CoW, many laptops have properly working firmware(*), and applications use fsync() to ensure that there is never data in RAM to lose. The probability of a suspend/resume cycle failure is now on par with--or even lower than--the probability of catastrophic hard disk failure, battery failure, or ordinary crashing kernel bugs (i.e. those crashes that occur at times other than suspend or resume).

Today, sync on suspend is an anachronism. Worse than that, it introduces entirely unnecessary failure modes for users of network filesystem clients and systems with big RAM and slow disks (especially the common case of SD/MMC devices where a sync might exceed the 20-second suspend failure timeout).

(*) Yes, there is still crappy firmware; however, such firmware tends to make itself obvious after a handful of suspend/resume cycles, and can be treated as a special case requiring kernel workarounds like sync-on-suspend. Normal firmware can do hundreds of suspend/resume cycles without dropping a single bit.

Trading off safety and performance in the kernel

Posted May 12, 2015 23:37 UTC (Tue) by wahern (subscriber, #37304) [Link] (32 responses)

Speaking of anachronisms, how common are spinning disks in laptops these days?

Trading off safety and performance in the kernel

Posted May 13, 2015 1:46 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (7 responses)

> Speaking of anachronisms, how common are spinning disks in laptops these days?

Most laptop drives spin by default, and they'll likely continue to do so as long as SSDs insist on pricing over $70/TB.

Trading off safety and performance in the kernel

Posted May 13, 2015 20:25 UTC (Wed) by zlynx (guest, #2285) [Link] (6 responses)

This price per gigabyte making SSDs too expensive idea just keeps moving the goalposts.

I remember when people were claiming $1 per GB was the magic price point. Then it was $0.50 per GB.

As long as spinning hard drives are cheaper per GB there will always be people claiming SSD is too expensive. But at some point it gets to be like claiming your laptop needs a tape drive or a floppy.

Trading off safety and performance in the kernel

Posted May 13, 2015 21:27 UTC (Wed) by reubenhwk (guest, #75803) [Link] (4 responses)

Please still using tape drives should be recycled along with their crappy electronics.

Trading off safety and performance in the kernel

Posted May 13, 2015 21:28 UTC (Wed) by reubenhwk (guest, #75803) [Link] (3 responses)

Please -> *People*

Trading off safety and performance in the kernel

Posted May 13, 2015 21:43 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

tape drives still have a better cost profile when dealing with very large data volumes

Trading off safety and performance in the kernel

Posted May 13, 2015 22:01 UTC (Wed) by zlynx (guest, #2285) [Link]

And yet you still don't want one in your laptop, which is why I compared spinning hard disks to tape drives. A spinning disk is rapidly approaching a tape drive in speed and general usefulness.

Trading off safety and performance in the kernel

Posted May 13, 2015 22:26 UTC (Wed) by drag (guest, #31333) [Link]

> tape drives still have a better cost profile when dealing with very large data volumes

Sorta, sometimes, and not really.. depending on your use case.

If you actually want to be able to access to your data in any sort of reasonable time frame then throwing the cheapest servers possible stuff full of 3.5 inch 7200rpm drives is far far better option. For money, time, and sanity. :)

SSHD

Posted May 14, 2015 16:22 UTC (Thu) by marcH (subscriber, #57642) [Link]

> As long as spinning hard drives are cheaper per GB there will always be people claiming SSD is too expensive. But at some point it gets to be like claiming your laptop needs a tape drive or a floppy.

As long as you can have more storage for the same price, why would you not? Pictures and movies don't need SSD performance at all.

Asking whether SSDs will win over spinning drives is like asking whether L1 caches will win over L2 caches.

Desktop users have solved this problem long ago: they get one small and cheap SSD for the system and applications + one big and cheap HD for pure storage. For laptops SSHDs look interesting. Two drives in a single enclosure.

There is however something entirely different which is killing spinning drives much faster than SSD price points: the cloud. Making the idea of local storage itself obsolete. Laptops with a small and dirty cheap eMMC used mainly as a network cache.

Trading off safety and performance in the kernel

Posted May 13, 2015 13:22 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (23 responses)

CPUs still produce a lot of heat if not idle though.

Trading off safety and performance in the kernel

Posted May 13, 2015 20:28 UTC (Wed) by zlynx (guest, #2285) [Link] (22 responses)

Yeah. If designers were thinking of the full consumer experience they'd put a "force CPU to slowest clock" as the first thing in the suspend path. Then it could take as long as it needed to flush buffes and hibernate, or whatever.

Even waking up inside a laptop bag to check email or alert for calendar events is not too bad when the CPU is locked to 800 MHz.

Trading off safety and performance in the kernel

Posted May 14, 2015 12:18 UTC (Thu) by hmh (subscriber, #3838) [Link] (21 responses)

Race to idle at top clock speeds very often generates less heat than staying active for a longer period due to a slower clock.

The CPU should be idle most of the time during a sync: it is the kind of operation that is supposed to be IO-bound...

Trading off safety and performance in the kernel

Posted May 14, 2015 15:21 UTC (Thu) by cesarb (subscriber, #6266) [Link]

> The CPU should be idle most of the time during a sync: it is the kind of operation that is supposed to be IO-bound...

Not if you're using dm-crypt without AESNI.

Trading off safety and performance in the kernel

Posted May 14, 2015 18:01 UTC (Thu) by zlynx (guest, #2285) [Link] (4 responses)

Race to idle while in a bag with no heat dissipation could work, but only if the device includes temperature in its speed calculations. Most laptops don't. Not until the CPU has reached an excessively high temp, which also starts to cook its surrounding components like the battery and GPU.

The simplest way to ensure that overheat doesn't happen is to use the slowest possible CPU speed. But the best way to implement it would be with temperature sensors so that after running 10 seconds at 3 GHz it realizes there's no airflow and slows down.

Sadly, with how ignored and haphazard sensor support is in Linux, I don't think that will work.

If Linux users really cared about sensors, sensor data would show up in desktop tools like Gnome System Monitor, with watts used and CPU temperature shown alongside CPU usage and clock speed. Common hardware would be recognized and their sensors properly labeled instead of "temp1" and "fan2."

Trading off safety and performance in the kernel

Posted May 14, 2015 18:48 UTC (Thu) by pizza (subscriber, #46) [Link] (3 responses)

> Common hardware would be recognized and their sensors properly labeled instead of "temp1" and "fan2."

Aside from on-die CPU sensors, there is no such thing as "common hardware" when it comes to sensors.

All Linux can do is provide the raw data and any exposed hooks to control the system. Which it already does. The rest is purely policy, and that's ultimately up to the user, with initial policy configured by the system builder/integrator.

As for displaying sensor data, I've had that data displayed on my desktop for fifteen years or so. And that's all that can be done without customizing the policy for each and every special snowflake of a system.

Trading off safety and performance in the kernel

Posted May 14, 2015 21:29 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

> Aside from on-die CPU sensors, there is no such thing as "common hardware" when it comes to sensors.

Somebody should -- I know that means that I should do it but no time and not enough interest -- should build a CDDB type system for Linux hardware so that for each machine type it is enough for ONE person to label everything in the system. Sensors, audio ports, etc.

Then on system setup the distro could look up all of that stuff.

Trading off safety and performance in the kernel

Posted May 14, 2015 23:34 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

good luck. How are you going to reliably identify what system you are running on? With some vendors even the exact model number isn't enough because they change the internals without changing the model number

Trading off safety and performance in the kernel

Posted May 15, 2015 4:33 UTC (Fri) by marcH (subscriber, #57642) [Link]

CDDB is not 100% reliable but it's useful and popular anyway.

At the start it was very incomplete yet it became popular very quickly. Same as many other crowd-sourced services.

Trading off safety and performance in the kernel

Posted May 14, 2015 18:30 UTC (Thu) by dlang (guest, #313) [Link] (14 responses)

race to idle assumes that it's cpu processing that determines when you can get to idle.

If you are saving the contents of RAM to disk, then you aren't going to finish any sooner at 5GHz clock than at 500MHz, the limiting factor is going to be your disk I/O performance. So if you can do this at 500MHz rather than 5GHz, you generate significantly less heat.

Also, even in a backpack, there is some heat dissipation going on, so you may not overheat if you are running slowly enough.

Trading off safety and performance in the kernel

Posted May 14, 2015 19:29 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (13 responses)

> So if you can do this at 500MHz rather than 5GHz, you generate significantly less heat.

Really? Doing it at 500MHz means

(a) that the CPU is going to spend more time in C0, and that's going to limit your ability to get into the deeper C states.
(b) that your memory bus is probably going to be clocked lower, which is going to have a pretty significant impact on the length of time the CPU is going to spend awake

I'm not saying that it's impossible, but it's certainly not obvious.

Trading off safety and performance in the kernel

Posted May 14, 2015 19:42 UTC (Thu) by dlang (guest, #313) [Link] (8 responses)

If you are writing memory to disk, clocking the memory slower isn't going to slow down how quickly the disk can write data.

As for the CPU spending more time in C0, do you really think that it is going into various sleep states while it is writing data to disk as fast as it can?

Trading off safety and performance in the kernel

Posted May 14, 2015 19:48 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (7 responses)

> If you are writing memory to disk, clocking the memory slower isn't going to slow down how quickly the disk can write data.

It's going to increase the amount of time the package has to stay awake to have the memory controller powered.

> do you really think that it is going into various sleep states while it is writing data to disk as fast as it can?

Your argument is that you're not CPU bound. If you're not CPU bound, the CPU is spending time idle. If the CPU is idle, it enters C states.

Trading off safety and performance in the kernel

Posted May 14, 2015 20:01 UTC (Thu) by dlang (guest, #313) [Link] (6 responses)

> It's going to increase the amount of time the package has to stay awake to have the memory controller powered.

the memory doesn't get powered off between accesses. It can only be powered off once the data has been written to disk. If the limiting factor is the time it takes to write it to disk, slowing down the memory clock is not going to require that it remain powered on longer.

> Your argument is that you're not CPU bound. If you're not CPU bound, the CPU is spending time idle. If the CPU is idle, it enters C states.

Ok, but how deep a C state is it going to be able to go into if it's in the middle of writing to disk as quickly as it can? and do the shallow C states really save that much power over a lower clock speed? race-to-idle really assumes that you can stop powering everything when you hit idle. If the vast majority of the system (and even CPU) is still having to run to be able to respond to interrupts and manage I/O, you aren't really idle yet.

Trading off safety and performance in the kernel

Posted May 14, 2015 20:06 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (5 responses)

> the memory doesn't get powered off between accesses

I didn't say it did. I said that the memory controller gets powered down between accesses, and the memory goes into self refresh.

> how deep a C state is it going to be able to go into if it's in the middle of writing to disk as quickly as it can?

That's going to depend on a bunch of factors, including I/O latency. There's no single answer.

> and do the shallow C states really save that much power over a lower clock speed?

Yes. Even the most shallow C state will unclock the core, and running at 0MHz is somewhat cheaper than running at 500MHz.

> race-to-idle really assumes that you can stop powering everything when you hit idle.

No it doesn't.

Trading off safety and performance in the kernel

Posted May 14, 2015 23:42 UTC (Thu) by dlang (guest, #313) [Link] (4 responses)

>> and do the shallow C states really save that much power over a lower clock speed?

> Yes. Even the most shallow C state will unclock the core, and running at 0MHz is somewhat cheaper than running at 500MHz.

remember that switching C states isn't free (in either energy or time), so it may not be a win if you don't stay there very long.

We obviously have very different expectations in how the hardware is going to behave at the different states. But keep in mind that I'm not saying that reducing the clock speed is always the right thing to do, I am just unconvinced that it's never the right thing to do the way that you seem to be.

Trading off safety and performance in the kernel

Posted May 14, 2015 23:56 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Shallow C states are basically free on modern CPUs. Deeper ones will drop cache, but that's basically irrelevant in the case we're discussing.

Trading off safety and performance in the kernel

Posted May 15, 2015 4:46 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> But keep in mind that I'm not saying that [X] is always the right thing to do, I am just unconvinced that it's never the right thing to do the way that you seem to be.

I had to waste 5 minutes reading the entire thread again to make sure I did not dream and that the exact opposite happened.

Trading off safety and performance in the kernel

Posted May 15, 2015 19:55 UTC (Fri) by bronson (subscriber, #4806) [Link]

Me too. That was surreal.

Trading off safety and performance in the kernel

Posted May 15, 2015 4:51 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Oh, right. Yes.

> But keep in mind that I'm not saying that reducing the clock speed is always the right thing to do, I am just unconvinced that it's never the right thing to do the way that you seem to be.

From https://lwn.net/Articles/644541/ (written by you)

> If you are saving the contents of RAM to disk, then you aren't going to finish any sooner at 5GHz clock than at 500MHz, the limiting factor is going to be your disk I/O performance. So if you can do this at 500MHz rather than 5GHz, you generate significantly less heat.

From https://lwn.net/Articles/644549/ (written by me)

> I'm not saying that it's impossible, but it's certainly not obvious.

…

Trading off safety and performance in the kernel

Posted May 14, 2015 21:34 UTC (Thu) by zlynx (guest, #2285) [Link] (3 responses)

I have actual, although anecdotal, data that a laptop allowed to clock to 2.5 GHz will overheat in a bag while a laptop locked to the lowest speed, 800 MHz in my case, will not overheat. It can sit in that bag until the battery runs down, happily processing things.

So, while in theory race to idle might be the way to go in practice a laptop that is running user-space while waiting to sync to disk is going to burn itself up at 2.5 GHz.

Trading off safety and performance in the kernel

Posted May 14, 2015 21:42 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (2 responses)

If the system is in any kind of state where it has an effectively unbounded amount of work to perform then the situation changes pretty significantly. There are various cases where apps behave badly when they lose network connectivity and spin trying to reconnect, for instance.

Trading off safety and performance in the kernel

Posted May 14, 2015 23:47 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

That's not the issue. The example given is that a machine running at max speed for 10 min will overheat while one running for several hours at the low speed will not.

for this example, race to idle fails if it takes too long because the system will overheat, while running at a lower speed, even if it takes a lot more time and power will succeed and not damage things.

race-to-idle requires a very specific combination of power/performance at the different states (full speed, partial speed, and idle). That combination has not always been the case and there's no reason to believe that it is going to continue to always be the case. Idle does not always mean that it requires zero power (even for the component that's idled, let alone for the entire system)

Trading off safety and performance in the kernel

Posted May 15, 2015 5:04 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

> The example given is that a machine running at max speed for 10 min will overheat while one running for several hours at the low speed will not.

Uh? I'm possibly missing something here, but I don't see any references to that example.

Trading off safety and performance in the kernel

Posted May 12, 2015 23:39 UTC (Tue) by dlang (guest, #313) [Link] (34 responses)

that still doesn't address the problem of the battery running out while it's in the backpack.

Far better to generate some extra heat for a little bit than loosing hours of data because it didn't get flushed out.

Trading off safety and performance in the kernel

Posted May 13, 2015 1:32 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (8 responses)

> that still doesn't address the problem of the battery running out while it's in the backpack.

...because it's *not* a problem.

Really, it's not. It's been years since I had a healthy laptop run out of battery. They last for hours at full load and days on suspend.

> Far better to generate some extra heat for a little bit than loosing hours of data because it didn't get flushed out.

No, it's not better.

If the sync takes longer than 20 seconds, the suspend fails completely and the laptop stays on (unless you've set up your ACPI scripts to forcibly kill the power at that point).

While the laptop is on, it's damaging its battery, reducing the charge it can hold *forever*. This also conveniently breaks the battery charge estimation function, so you get to be surprised when your battery abruptly shuts down at "40% charge" in the future.

There's no "hours of uncommitted data" either. There's one filesystem commit interval at most. If you're sane that's not more than 30 seconds or so. If you're not sane, you can configure laptop-mode-tools to run sync() from userspace.

Trading off safety and performance in the kernel

Posted May 13, 2015 2:01 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Really, it's not. It's been years since I had a healthy laptop run out of battery. They last for hours at full load and days on suspend.
Oh, what a BS.

I've had resume problems with ALL laptops that I had. Including MacBook Pro with OS X. Linux and Windows laptops tend to be even crashier.

Trading off safety and performance in the kernel

Posted May 13, 2015 3:02 UTC (Wed) by pizza (subscriber, #46) [Link] (3 responses)

> Really, it's not. It's been years since I had a healthy laptop run out of battery. They last for hours at full load and days on suspend.

Until you leave it in your backpack for *four* days instead of three, thanks to a long weekend.

Or the battery gets jostled. Or the battery isn't so healthy any more (do you get a nice email when it crosses that threshold?) Or you suspend it when the battery is relatively low. Or it doesn't resume properly because something plugged in was unplugged. Or one of many, many, many failure modes.

Or when the "low battery" threshold wakes the system up and the thing dies hard when trying to write the buffers back out.

Every single one of these situations has happened to me. (Funnily enough, in my experience, Linux suspend/resume is actually *more* reliable than Windows on the last couple of Tier-1 laptops I've owned.

I absolutely *want* any dirty buffers to be flushed to disk and the filesystems synced into a safe state before a system suspends. More than want; that is a hard requirement. Data loss is never acceptible, but when it's so damn easily preventable there is simply no excuse.

Trading off safety and performance in the kernel

Posted May 13, 2015 6:14 UTC (Wed) by tpo (subscriber, #25713) [Link] (2 responses)

My experience is similar to pizza's.

The single most frequent case "loss of data" occurring here is, when I work without the laptop being attached to power for some reason, then notice, that the end of battery power is near and close the lid.

At some point my laptop will wake up by itself and try to suspend to disk, which doesn't and hasn't ever worked here and will then run out of power hanging in that state.

I had managed to disable this behavior somehow in the past but some well meaning part of the system switched that on again and I am currently unwilling to spend my time to find out how to disable it again.

And the loss for me isn't usually the "data" but it is the state and context of the desktop: what shells did I have open with what content? Which files was I editing? Which applications were running? This can be more than annoying when I had my laptop prepared with all stuff needed open and at the right place to do a presentation only to find out when opening the laptop in front of people that its dead.

Of course syncing file buffers out to disk or not will not change anything wrt to the problem described in the last paragraph.

And it doesn't help that XFce's power indicator is too badly designed for me to be able to notice that the battery is running too low.

Trading off safety and performance in the kernel

Posted May 13, 2015 13:40 UTC (Wed) by jospoortvliet (guest, #33164) [Link]

at least syncing buffers to disks means you don't have to lose any data you wrote or worked on... ideally. For me, that's a huge deal and I certainly wouldn't like to lose stuff.

Trading off safety and performance in the kernel

Posted May 13, 2015 16:03 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> I had managed to disable this behavior somehow in the past but some well meaning part of the system switched that on again and I am currently unwilling to spend my time to find out how to disable it again.

I fired my distro's ACPI event-handling code. Several years ago it started being not merely useless, but an active source of failure. After several rounds of patches that consisted only of deletions, I gave up and replaced the entire thing with:

#!/bin/sh
echo mem > /sys/power/state

I have "find out where the acpi-support package is hiding today and kill it" on my to-do list for every dist-upgrade because the machine can be physically damaged if I don't.

Trading off safety and performance in the kernel

Posted May 13, 2015 19:56 UTC (Wed) by kleptog (subscriber, #1183) [Link]

> If the sync takes longer than 20 seconds, the suspend fails completely and the laptop stays on (unless you've set up your ACPI scripts to forcibly kill the power at that point).

Aah, so that's what happens. So what I need is a script that does: if suspend fails and laptop lid is closed, start playing an alarm at maximum volume. And cuts power if no response within a few minutes.

Much better than hearing a loud whirring noise a few hours later and then pulling an overcooked laptop out of your bag.

> There's no "hours of uncommitted data" either. There's one filesystem commit interval at most.

Well, not everything is saved on disk. If you have a document open that isn't saved then sync won't help anyway. It'd be great if there was a way to announce to running programs that the system is being suspended and to dump state, but that doesn't exist or isn't widely supported. Currently suspend (for me) is primarily a way to avoid the startup time. It's not reliable enough to rely on.

Mind you, I just found a basic-pm-debugging.txt in the kernel documentation which describs steps that can be used to debug issues. My current problem is that ext4 is trying to read a directory inode on resume while the the disk is not ready, and it remounts the rootfs readonly. The machine is then essentially unrecoverable (neither su nor sudo work with a readonly fs).

Trading off safety and performance in the kernel

Posted Jun 12, 2015 17:32 UTC (Fri) by bluefoxicy (guest, #25366) [Link] (1 responses)

> If the sync takes longer than 20 seconds, the suspend fails completely and the laptop stays on

SATA 3 6Gb/s: 15 gigabytes of recently-written dirty data to write

SATA 1 1.5Gb/s: 3.75 gigabytes

ATA100 100Mbit/s: 250 megabytes

I'm pretty sure this is a non-issue for any hardware made since 2003. /proc/meminfo shows 624kb of dirty pages on a big ass database server, 2712 on a busy Web server in a cluster. It's rare to have several gigabytes of unflushed disk just hanging around in memory; I've never seen more than a few megabytes.

Trading off safety and performance in the kernel

Posted Jun 12, 2015 17:50 UTC (Fri) by raven667 (subscriber, #5198) [Link]

That's just the link speed between the storage device and the main system and has little bearing on how fast you can actually write data to disk, especially if the disk is spinning rust. If the disk has to seek between writes then you aren't going to see more than around 100 writes per second which can be just a handful of megabytes, regardless of what the link speed is.

Your examples include a read heavy, write little web server and a db server which is probably explicitly flushing every IO to disk so that there is little data to write back in either case, neither of which is representative of how a laptop is used. It's easy to create a bunch of buffered writes, by copying a DVD image or compiling software or copying memory to disk for suspend, and on a laptop you may delay writes longer than normal to keep the disk subsystem in a low power state for as long as possible, leading to a storm of activity.

Trading off safety and performance in the kernel

Posted May 13, 2015 2:40 UTC (Wed) by neilbrown (subscriber, #359) [Link] (24 responses)

> Far better to generate some extra heat for a little bit than loosing hours of data because it didn't get flushed out.

What "hours of data" are you talking about?

Dirty pages get flushed after about 30 seconds, so if you don't 'sync' before suspend, then the most you could lose if resume fails is data that was written by an app in the 30 seconds before suspend.

I really don't think that is any significant data. Any apps that cares about data will 'fsync' at an appropriate time. Data that isn't fsynced, doesn't really matter ..e.g. logs.

Is there really *any* important data that becomes at-risk because of this change?

I think the greatest risk of data loss when resume fails is data in some application that hasn't been written to the filesystem yet, like the file you are in the middle of editing. That data isn't helped by sys_sync at alll. It requires pre-suspend notifications to apps so they can auto-save.

Trading off safety and performance in the kernel

Posted May 13, 2015 6:59 UTC (Wed) by marcH (subscriber, #57642) [Link] (15 responses)

> so if you don't 'sync' before suspend, then the most you could lose if resume fails is data that was written by an app in the 30 seconds before suspend.

How long can it take to sync that little?

Trading off safety and performance in the kernel

Posted May 13, 2015 7:21 UTC (Wed) by neilbrown (subscriber, #359) [Link] (10 responses)

> How long can it take to sync that little?

My "Open Phoenux" phone (www.gta04.org) runs a fairly ordinary Debian distro and sometimes has lots of kernel logging enabled.

When the logging is enable there is a very obvious lag on the way to suspend, such that trying to wake the phone again takes an annoyingly
long time (though probably less than 2 seconds).

I realise that might not be a common circumstance, but I'm also sure I'm not the only one who has a flash of insight the moment that I suspend the phone (or close the laptop lid, or submit the comment or seal the envelop).

But the point isn't really how long it takes. The point is that the 'sync' call really doesn't belong there. The benefit it provides is much more superstitious than scientific. If wanted, a sync-before-suspend is trivially performed in user-space, and if not wanted it currently requires a code edit to disable.

Trading off safety and performance in the kernel

Posted May 14, 2015 15:53 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> But the point isn't really how long it takes.

Yes it is, otherwise this entire discussion would not even exist.

No matter how you look at it, if syncing takes more than a few seconds on a system that regularly syncs every 30 seconds anyway, then there is something seriously weird or at the least very unusual about it.

> The benefit it provides is much more superstitious than scientific.

Except the suspend experience sometimes feels as safe as crossing the Atlantic in the 19th century. There are genuine reasons why so many people carry their Windows or Linux laptop open around the office despite safety rules (not Macs interestingly enough).

Power management has always been complex and does not look like it's getting simpler any time soon.

> If wanted, a sync-before-suspend is trivially performed in user-space,

Why not if this solves a problem for 1% of users or 1% of the time.

Why bother the change and disruption if it's only for 0.001%.

Trading off safety and performance in the kernel

Posted May 14, 2015 22:27 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

> Yes it is, otherwise this entire discussion would not even exist.

I'll be more precise. The point isn't the amount of time it takes, it is the fact that it takes any time at all.

You seem to be saying that it can't take enough time to bother you, and I suspect you are correct. Len Brown, by submitting the patch, is saying that it *does* take enough time to bother him. Are you saying he is wrong?

> Power management has always been complex and does not look like it's getting simpler any time soon.

Undoubtedly true. This has no effect on whether placing a 'sys_sync' at that point in the code actually provides any benefit. At all.

> Why bother the change and disruption if it's only for 0.001%.

Laying aside for the moment that 1% of 1% is 0.01%, these numbers are meaningless.

The vast majority of users get sync called (at least) twice on suspend - once by some user-space tooling and once by the kernel. Most (possibly all) distros already do this.
So removing the the sys_sync call from suspend in the kernel is already only going to affect few of the users that your are probably thinking of as the 99.99%.

But one of the drivers for this change is, apparently, android. I think the user-base there is a little more than 0.001%.

Trading off safety and performance in the kernel

Posted May 15, 2015 5:19 UTC (Fri) by marcH (subscriber, #57642) [Link]

> You seem to be saying that it can't take enough time to bother you, and I suspect you are correct. Len Brown, by submitting the patch, is saying that it *does* take enough time to bother him.

Whatever two particular individuals experience does not matter much, can we please go back to statistics?

> Are you saying he is wrong?

I really wonder where does that come from... are you related maybe? :-)

> Undoubtedly true. This has no effect on whether placing a 'sys_sync' at that point in the code actually provides any benefit. At all.

The connection is: reliability is inversely related to complexity and users simply want their data saved before their computer susp...crashes. See various horror stories in the other comments.

> Laying aside for the moment that 1% of 1% is 0.01%,

Agreed! Let's also lay aside that 0.1% of 1% is 0.001%. And a few others?!

> these numbers are meaningless.

They're semi-random examples, but not completely meaningless. The very simple point I was trying to drive (and hoping not to have to detail) is just: the kernel is never going to please every single use case for every single user. Proof: there are practically no hardware device shipping with an totally unpatched mainline kernel. The mainline only has code that has a significant number of actual users. So I think we all agree it's all about how [un]common is this or that use case. Statistics and trade-offs.

> So removing the the sys_sync call from suspend in the kernel is already only going to affect few of the users that your are probably thinking of as the 99.99%.
> But one of the drivers for this change is, apparently, android.

Thanks for the info and also the reminder; I got distracted by the laptop stories filling almost the entire comments space.

Trading off safety and performance in the kernel

Posted May 15, 2015 18:23 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (1 responses)

> There are genuine reasons why so many people carry their Windows or Linux laptop open around the office despite safety rules (not Macs interestingly enough).

In my case it's company policy that VPN sessions don't survive a suspend. When you open the laptop back up, even after 30 seconds, the VPN client reminds you of the policy and prompts you to start over. You need to find your RSA dongle, go through the authentication again, reconnect, you'll get a new IP address and so of course all existing connections are dropped.

My company is a complete disaster for IT policy, but then, so are thousands of other large employers around the world. So this is a real scenario, even though it's an unnecessary and obnoxious one.

Trading off safety and performance in the kernel

Posted May 15, 2015 20:35 UTC (Fri) by marcH (subscriber, #57642) [Link]

> In my case it's company policy that VPN sessions don't survive a suspend.

Well I feel sorry for you but that's not our case; suspend issues are why they do it here.

Trading off safety and performance in the kernel

Posted May 18, 2015 19:32 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> There are genuine reasons why so many people carry their Windows or Linux laptop open around the office despite safety rules (not Macs interestingly enough).

Am I the only person who disables suspend-on-lid-close? For all OS variants. Never did like that behavior…

Trading off safety and performance in the kernel

Posted May 16, 2015 15:52 UTC (Sat) by ghane (guest, #1805) [Link] (3 responses)

neilbrown wrote:
> But the point isn't really how long it takes. The point is that the 'sync' call really doesn't belong there. The benefit it provides is much more superstitious than scientific.

Back on the late 80s, I was taught to type:
sync ; sync ; sync

before a shutdown or reboot. I assume this was so that the SVR5 kernel would know I really wanted to sync.

Trading off safety and performance in the kernel

Posted May 16, 2015 20:14 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

From what I understand, the issue was that a single sync could return to the command line before the data was actually on disk, but a second one couldn't start until the first finished.

that still doesn't justify three invocations, but would justify two.

Trading off safety and performance in the kernel

Posted May 16, 2015 20:42 UTC (Sat) by neilbrown (subscriber, #359) [Link] (1 responses)

Yes, the "sync" semantics are "wait for any pending writeout to complete, then start writeout on any dirty data", so 2 is sensible and 3 is superstitious.

http://pubs.opengroup.org/onlinepubs/7908799/xsh/sync.html

This is part of why calling sys_sync() once in the suspend path is wrong (twice has been suggested), though I'm not certain if the Linux implementation exactly matches the specification.

Trading off safety and performance in the kernel

Posted May 17, 2015 4:21 UTC (Sun) by neilbrown (subscriber, #359) [Link]

> Yes, the "sync" semantics are "wait for any pending writeout to complete, then start writeout on any dirty data",

Actually, I'll have to backtrack on this. I can find no evidence in historical Unix, all the way up to 4.3BSD, to suggest that the 'sync' system call would wait. It just initiated IO. So maybe calling it 3 times makes sense.

Linux (roughly) followed that approach until Linux 1.3.20. That version introduced the change:

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -228,7 +228,7 @@ int fsync_dev(dev_t dev)
 
 asmlinkage int sys_sync(void)
 {
-       sync_dev(0);
+       fsync_dev(0);
        return 0;
 }

To give the fuller context:

void sync_dev(dev_t dev)
{
        sync_buffers(dev, 0);
        sync_supers(dev);
        sync_inodes(dev);
        sync_buffers(dev, 0);
}

int fsync_dev(dev_t dev)
{
        sync_buffers(dev, 0);
        sync_supers(dev);
        sync_inodes(dev);
        return sync_buffers(dev, 1);
}

asmlinkage int sys_sync(void)
{
        fsync_dev(0);
        return 0;
}

The second arg to sync_buffers() says whether it should 'wait'. So fsync_dev() waits, sync_dev() doesn't.

Exactly what is waited for when is hard to track. I'd need to read a lot more code to see what things sys_sync waits for today. But it does wait for something, but before 1.3.20, and all through "Unix", it certainly didn't wait for everything.

Trading off safety and performance in the kernel

Posted May 13, 2015 15:35 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (3 responses)

> How long can it take to sync that little?

A class 4 SD/MMC card inside a laptop that has 16GB of RAM: one hour, four minutes.

A network filesystem that is broken because the network interface is down due to the suspend event: forever.

The latter case is the reason why I've not had that SYS_sync() line in my kernels for almost 10 years.

Trading off safety and performance in the kernel

Posted May 13, 2015 16:53 UTC (Wed) by pizza (subscriber, #46) [Link] (2 responses)

> A class 4 SD/MMC card inside a laptop that has 16GB of RAM: one hour, four minutes.

Come now, this use case is rather facetious.

Or are you seriously saying that someone who will configure a system with 16GB of RAM (presumably for performance) will be so cheap as to use a class 4 SD card for backing storage? (a 16GB UHS-1 card can be easily found for $11!)

> A network filesystem that is broken because the network interface is down due to the suspend event: forever.

Heh. Given that I literally haven't had this happen in literally over a decade, this makes me wonder what broken-ass distro you're using that included such screwed up suspend scripts. (And for the record, I do this multiple times a day with both CIFS and NFS mounts)

Then again, you did say that you replaced your distro's suspend/resume scripts with your own stuff, so I suppose you only have yourself to blame.

> The latter case is the reason why I've not had that SYS_sync() line in my kernels for almost 10 years.

Good for you; you worked around your own mistakes...Do you want a cookie or something?

Trading off safety and performance in the kernel

Posted May 13, 2015 19:29 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (1 responses)

> Or are you seriously saying that someone who will configure a system with 16GB of RAM (presumably for performance) will be so cheap as to use a class 4 SD card for backing storage?

I'm saying that someone on a client site will use whatever horrible SD card they are given to transfer a crapton of data, then unexpectedly have to complete a suspend/resume cycle because they need to pack up the laptop so they can move to another building for a meeting with someone important who is only available now--not in an hour, when it's finally possible to remove the SD card without data loss.

This is a common case for me. I've modified both kernel and userspace to mitigate the problem, but anyone using default Linux distro and kernel behavior is screwed.

> Heh. Given that I literally haven't had this happen in literally over a decade, this makes me wonder what broken-ass distro you're using that included such screwed up suspend scripts. (And for the record, I do this multiple times a day with both CIFS and NFS mounts)

Debian gave a choice of two bad behaviors: fail to suspend, or forcibly umount filesystems to avoid suspend blocking on sync() (or read, or any other filesystem operation for that matter).

What I want is for the filesystem to stay mounted. If the suspend/resume is for a short walk to another building, there is no need to disrupt userspace with a umount. Any reads or writes in progress can be completed after resume using the network filesystem client's existing code for dealing with ordinary network interruptions. There should be no blocking code paths on suspend for such filesystems *at all*.

Trading off safety and performance in the kernel

Posted May 14, 2015 16:06 UTC (Thu) by marcH (subscriber, #57642) [Link]

> If the suspend/resume is for a short walk to another building,

Is there a user interface to express the difference between short walks versus long commute home + change of network?

Among the many things NFS was really not designed for, mobility must be very near the top of the list.

All traditional network filesystems suck and are on their way to the dustbin of history since they ignored Fallacy of Distributed Computing number 1 (and a few others). NFS just sucks more.

Trading off safety and performance in the kernel

Posted May 13, 2015 19:17 UTC (Wed) by dlang (guest, #313) [Link] (7 responses)

> What "hours of data" are you talking about?
>
> Dirty pages get flushed after about 30 seconds, so if you don't 'sync' before suspend, then the most you could lose if resume fails is data that was written by an app in the 30 seconds before suspend.

so you save the document you've worked for hours on, and close the lid.

the data was only written by the app a few seconds before suspending, but it represents hours of data for the user.

Trading off safety and performance in the kernel

Posted May 13, 2015 22:30 UTC (Wed) by neilbrown (subscriber, #359) [Link] (6 responses)

> the data was only written by the app a few seconds before suspending, but it represents hours of data for the user.

Any app worthy of the name will have called 'fsync' before giving you a visual indication that the save has completed. emacs certainly does.

If you close the lid before getting that visual notification, then you only have yourself to blame. In that case a sys_sync in the suspend path may not help anyway as the app may not have finished writing.

And of course, any real app would have auto-saved every few minutes so even in a disaster you wouldn't lose more than a few minutes work.

There is definitely a place to call fsync (rarely sys_sync) to make sure data is safe. The suspend path is not that place.

Trading off safety and performance in the kernel

Posted May 14, 2015 17:13 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> Any app worthy of the name will have called 'fsync' before giving you a visual indication that the save has completed. emacs certainly does. [...]
> And of course, any real app would have auto-saved every few minutes so even in a disaster you wouldn't lose more than a few minutes work.

"Don't break userspace" - even userspace bugs.

Trading off safety and performance in the kernel

Posted May 14, 2015 22:31 UTC (Thu) by neilbrown (subscriber, #359) [Link] (4 responses)

> "Don't break userspace" - even userspace bugs.

If userspace needs the kernel to call sync before crashing then the user-space is already broken. Systems can crash without entering suspend first.

But that is a big "if". Are there actually any non-trivial apps which don't save their data properly?

Trading off safety and performance in the kernel

Posted May 21, 2015 21:58 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Are there actually any non-trivial apps which don't save their data properly?

Okay, it's not linux, but ... MS Word ?

(maybe it's changed, but I OFTEN lose data if I'm working on a document and it crashes - often it's the attempted auto-save that causes the crash :-(

Cheers,
Wol

Trading off safety and performance in the kernel

Posted May 23, 2015 16:20 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

If userspace needs the kernel to call sync before crashing then the user-space is already broken.

No, the file system is broken.

Are there actually any non-trivial apps which don't save their data properly?

No, there are just file systems (e.g., ext4) which do not provide decent guarantees and use this kind of rethoric to justify their poor behaviour. I expect that pretty much all non-trivial applications do not jump all the time through all the hoops that some developers of file systems expect of them; that's because they have no good way to test that they meet the expectations of these file system developers, and most application developers probably have many more urgent things to care about.

Anyway, one example of a broken file system losing data of a popular application (including the autosave files that the application produces regularly) is here.

Trading off safety and performance in the kernel

Posted May 25, 2015 6:52 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

> No, the file system is broken.

That makes no sense.

I agree that "If a filesystem needs the kernel to call sync before crashing then the filesystem in already broken" with the understanding that "needs" means "needs in order to protect the data that it is responsible for."
Data that has not yet been written to the filesystem is certainly not the filesystem's reponsibility.
Data that has been written but hasn't been the subject of 'fsync' is also not completely the filesystem's responsibility (unless you mount with '-o sync').

> one example of a broken file system losing data of a popular application

That is a filesystems from decades ago. Yes it was broken, no question. Linux filesystems aren't like that. All non-trivial Linux filesystems do journalling of metadata, which is much safer than synchronous metadata updates. I cannot promise they are all 100% bug free in every release, but I am certain that calling 'sync' in the suspend path isn't going to usefully fix any bug that they might have.

It also sounds like that "popular application", which was emacs, wasn't calling 'fsync' as it should and as it certainly now does.
Yes - bugs should be fixed. But let's not scatter "sys_sync" calls around and pretend that fixes them.

Trading off safety and performance in the kernel

Posted May 25, 2015 9:00 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> That is a filesystems from decades ago. Yes it was broken, no question. Linux filesystems aren't like that. All non-trivial Linux filesystems do journalling of metadata, which is much safer than synchronous metadata updates. I cannot promise they are all 100% bug free in every release, but I am certain that calling 'sync' in the suspend path isn't going to usefully fix any bug that they might have.
As if on cue, today my laptop corrupted my filesystem during suspend/resume. I started synchronization of several large (~200G) directories with lots of small files from our local network and then totally forgot about it. Then I closed the laptop's lid and went home.

Resume failed (yet again) and after reboot my BTRFS filesystem refused to mount.

Trading off safety and performance in the kernel

Posted May 13, 2015 4:00 UTC (Wed) by imunsie (guest, #68550) [Link] (6 responses)

> The probability of a suspend/resume cycle failure is now on par with--or even lower than--the probability of catastrophic hard disk failure, battery failure, or ordinary crashing kernel bugs (i.e. those crashes that occur at times other than suspend or resume).

Wow, at the rate at which my laptop fails to suspend and/or resume, you must be going through at least one hard disk *AND* battery every single week!

Trading off safety and performance in the kernel

Posted May 13, 2015 19:30 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (5 responses)

> Wow, at the rate at which my laptop fails to suspend and/or resume you must be going through at least one hard disk *AND* battery every single week!

We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures. I recommend you look into what your distro is doing wrong, or try another laptop--*any* other laptop.

The last 7 laptops in my care have had one resume failure in 8 years. That's close to 4300 successful suspend-to-RAM+resume cycles in the field. Contrast with ~170 crashes at other times on those same laptops, most of which occurred while the laptop was sitting on an office desk in front of a user with stable AC power. Also in that reporting interval there were 3 hard disk failures requiring drive replacement and over a dozen UNC sector data loss events (i.e. incidents when restore from backups was required because the disk had spontaneously lost some previously stable data, but remained otherwise healthy). 3 batteries had to be replaced as well, contributing several at-full-power crashes each to the total when the power failed without warning.

By my count, on a given day when data is lost, it is more than twelve times more likely that the cause will be something that is not suspend/resume failure. The probability of suspend/resume failure is so tiny, and its impact relative to other common failure modes so small, that it's not worth the time and risk to prepare for it. The cure is worse than the disease.

Speaking of disease: six of those hundreds of crashes were due to consequences of failing to enter the suspend state (i.e. remaining on full power until the battery was exhausted or shutdown was forced by the thermal protection circuitry). Another dozen or more potential crashes were avoided because the user noticed the suspend failure and corrected the proximate cause manually (e.g. by unmounting network filesystems or disconnecting slow removable media devices). The negative impact of sync-before-suspend could have been significantly worse without those modifications and user attention, and there would be many more such incidents had I not disabled the problematic kernel behaviors early on. The relatively high frequency of this mode of failure was the trigger that drove me to modify code to fix the problem--and also to set up years of data logging to prove it had made things quantitatively better.

Trading off safety and performance in the kernel

Posted May 13, 2015 20:20 UTC (Wed) by scottwood (guest, #74349) [Link] (1 responses)

"We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures" sounds a bit odd juxtaposed with, "I fired my distro's ACPI event-handling code. Several years ago it started being not merely useless, but an active source of failure."

Keep in mind that most users have no idea how to "fire their distro's ACPI event-handling code". FWIW my laptop (which was sold specifically as being meant for running Linux!) is pretty terrible at suspend/resume, often exhibiting similar behavior to what tpo described -- short term suspend is usually OK, but leave it for hours and I'll be presented with a cold boot.

As for the sync issue, I think part of the problem is that sync is too blunt of an instrument. A reasonable compromise might be to flush out whatever can be done in 2-3 seconds, focusing on devices that look like normal local high-speed storage. Plus, a write delay of 30 seconds seems pretty high. Maybe it makes sense on servers with UPSes and software that is careful about when data actually hits the disk, but on a laptop running apps of varying quality? Where is the point of diminishing returns on delaying writes, especially with SSDs?

Trading off safety and performance in the kernel

Posted May 13, 2015 22:36 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> "We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures" sounds a bit odd juxtaposed with, "I fired my distro's ACPI event-handling code. Several years ago it started being not merely useless, but an active source of failure."

Several years ago...well, about 8 years ago, come to think of it, right before I started collecting reliability data. ;)

I also don't count the first few hundred suspend/resume cycles on a new laptop on the bench, since those are used to test and debug the acpi-support scripts before the laptop does any important work. On the other hand, the result of that testing is usually the end of the acpi-support scripts. On the last three laptops I've just skipped the testing phase and not installed the acpi-support scripts in the first place.

> Keep in mind that most users have no idea how to "fire their distro's ACPI event-handling code".

I do keep that in mind, but I have no better practical advice to offer. ACPI lid-event-handling userspace code in most distros is a byzantine nightmare of overlapping and mutually exclusive workarounds that are no longer necessary because the kernel (and X.org) has long been capable of easily and reliably handling the suspend process itself. There's nothing left to do but hold down the delete key until all the broken code goes away.

Trading off safety and performance in the kernel

Posted May 13, 2015 23:37 UTC (Wed) by imunsie (guest, #68550) [Link] (1 responses)

I think you probably missed my (admittedly subtle) point - while suspend/result may work reliably for you, it still is a complete mess for other people, so removing the sync seems like a terrible idea (a tunable would be ok, so long as it still does a sync by default).

As a side note, suspend/resume was actually working very reliably on my laptop until about three or four months ago when *something* (Kernel? Debian? systemd?) updated and completely broke it after the laptop had been in the dock (I really CBF tracking down yet another regression because I have better things to do with my time). But that's beside the point - if it doesn't work for me, it stands to reason that it doesn't work for a lot of other people, so it's not reliable.

Trading off safety and performance in the kernel

Posted May 14, 2015 0:07 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

The best outcome is going to be a tunable. Good or bad, the changing the default sync behavior will take years to be fully accepted, and even after all the major userspaces catch up, there will probably always be a few people who are stuck with buggy legacy userspace and firmware at the same time.

A tunable lets individual users choose when they make the transition. Look how long it took for atime to stop being the default to get an idea how long such a change can take.

There are always kernel regressions and crashes that lose some uncommitted data. We don't run filesystems in sync mode all the time because the performance (and wear and tear on disks, rotating or otherwise) is a price too high for the negligible benefit of less data lost on a crash. At some point that sync on suspend *must* go away.

Trading off safety and performance in the kernel

Posted May 16, 2015 11:54 UTC (Sat) by faramir (subscriber, #2327) [Link]

My Dell Lattitude D530 fails to resume about 1/3 of the time. I suspect a software bug as I think it worked back in the days of Ubuntu 10.04. I keep hoping some random software update will fix the problem.

Trading off safety and performance in the kernel

Posted May 13, 2015 21:31 UTC (Wed) by javispedro (guest, #83660) [Link] (6 responses)

I would say that yes, the 1990's were almost two decades ago, and so is the entirely artificial construction of having to "suspend". In fact, almost all of the devices sold these days (and a huge amount of all 2015 laptops) do not even have the ability to enter S3 state or anything that would remotely resemble that.

Thus, many current users of suspend are actually using it in order to "freeze user space, make some drivers do a bunch of other unrelated stuff" aspect of it. And it is because you're (ab)using suspend like this that you find yourself making a fuss because of a few seconds delay during suspend.

If you want to freeze user space, use the proper kernel functionality for that. If cgroups is too complicated, fix it. If your drivers do weird stuff during suspend, fix the drivers so that they also work with dynpm.

Suspend these days is 99% policy, 1% hw/firmware stuff. Therefore: move as many decisions as possible to userspace. At some point, it will be clear the entire concept is just a relic required to support a glitchy hardware feature that happened to be very useful during the 90s and 00s.

Trading off safety and performance in the kernel

Posted May 24, 2015 6:50 UTC (Sun) by simoncion (guest, #41674) [Link] (5 responses)

> In fact, almost all of the devices sold these days (and a huge amount of all 2015 laptops) do not even have the ability to enter S3 state or anything that would remotely resemble that.

Citation *seriously* needed. S3 is Suspend To Ram. Page 12 of this 2014 Intel datasheet [0] indicates that a large number of mobile i3, i5, and i7 processors support S0, S3, S4, and S5 mode. (Check page 10 for the list of processors.)

[0] http://www.intel.com/content/dam/www/public/us/en/documen...

Trading off safety and performance in the kernel

Posted May 30, 2015 14:43 UTC (Sat) by javispedro (guest, #83660) [Link] (4 responses)

This very website: http://lwn.net/Articles/580451/

Trading off safety and performance in the kernel

Posted May 30, 2015 16:08 UTC (Sat) by cesarb (subscriber, #6266) [Link] (3 responses)

> This very website: http://lwn.net/Articles/580451/

From that link: "They expect this mode to be used, as can be seen by the fact that machines that do not support the ACPI "S3" sleep state (a.k.a. suspend) at all will start shipping soon."

IMHO "machines [...] will start shipping soon" != "almost all of the devices sold these days (and a huge amount of all 2015 laptops)". So yeah, citation still needed.

Trading off safety and performance in the kernel

Posted May 30, 2015 16:34 UTC (Sat) by javispedro (guest, #83660) [Link] (2 responses)

Let's integrate by parts: "almost all of the devices sold these days" do not support S3, since currently the only non-negliglible architecture where S3 is supported is x86, and x86 is clearly a minority of "all devices sold these days", which includes ARM smartphones and tables, µC and other embedded devices.

The other part, "a huge amount of all 2015 laptops" do not support S3, is what should be apparent from the link I sent. The Microsoft logo requirements for Windows 8 laptops/tablets/hybrids recommend Connected Standby support. When Connected Standby support is present, the firmware _must disable_ S3 support; this is a Microsoft logo _requirement_ ( e.g. http://www.anandtech.com/show/8038/windows-81-x64-connect... ) and therefore every system out there which supports Connected Standby and Windows 8 _does not support S3_.

In 2014 the first systems with that started to ship (hybrids -- including the Surface Pro, the Dell Venue Pros, last Sony Vaios, the new Helix, etc.). _All_ hybrids and tablets from 2014 already shipped with Connected Standby, but only a few laptops. If you know of any hybrid/tablet that shipped in late 2014 without CS/AOAC please let me know.

Now it's 2015, and, even within laptops, it's hard to find a system not shipping with connected standby. The Dell XPS, the new Yoga and LaVie ThinkPads, the Asus Zenbooks all ship with Connected Standby.

However, it is a fact that there are some laptops left without Connected Standby, which (supposedly) still support S3. Which is why I said "a huge amount of all 2015 laptops", and not "all of 2015 laptops". Two examples: the Thinkpad X250 and the HP Spectre x360 do not support CS, to be best of my knowledge.

Hope this explains it.

Trading off safety and performance in the kernel

Posted Sep 11, 2015 6:18 UTC (Fri) by mcortese (guest, #52099) [Link] (1 responses)

Now it's 2015, and, even within laptops, it's hard to find a system not shipping with connected standby

Funny. I see that most laptops still come with spinning disk: they shouldn't support Connected Standby as SSD is listed as a requirement. What am I missing?

Trading off safety and performance in the kernel

Posted Oct 20, 2015 22:35 UTC (Tue) by javispedro (guest, #83660) [Link]

> I see that most laptops still come with spinning disk

At least in the country where I live, only 2 out of the 20 most sold laptops in Amazon contain a spinning disk. Those 2 peculiarly enough ship with Windows 7.

In fact, there are more laptops sold with eMMCs than spinning drives. I always learn something...

The missing option

Posted May 13, 2015 2:18 UTC (Wed) by bojan (subscriber, #14302) [Link]

http://en.wikipedia.org/wiki/Abort,_Retry,_Fail%3F

Now, that would be a real improvement if we really want to lose data. ;-)

No battery [was: Trading off safety and performance in the kernel]

Posted May 13, 2015 14:50 UTC (Wed) by cesarb (subscriber, #6266) [Link] (5 responses)

> It has also been pointed out that, sometimes, a suspended system will never resume due to running out of battery or the kernel being upgraded.

Not all systems being suspended have a battery. For instance, at home we have a desktop which is often suspended while not in use. Whenever the power fails while it's suspended, it has to cold boot instead of resuming.

No battery [was: Trading off safety and performance in the kernel]

Posted May 13, 2015 17:33 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (4 responses)

Wasn't there a wishlist from GNOME or somewhere which had "hybrid suspend" (to disk and RAM so if power goes out, it comes back from hibernation) on it linked a while ago? Anyone know of its status?

No battery [was: Trading off safety and performance in the kernel]

Posted May 13, 2015 21:17 UTC (Wed) by bojan (subscriber, #14302) [Link] (3 responses)

Yeah, suspend to both has been available in the kernel for a while now. You just put suspend into /sys/power/disk before you hibernate.

No battery [was: Trading off safety and performance in the kernel]

Posted May 15, 2015 0:07 UTC (Fri) by bnorris (subscriber, #92090) [Link] (2 responses)

Thanks for the tip. BTW, the "suspend" option is not documented in Documentation/ABI/testing/sysfs-power.

No battery [was: Trading off safety and performance in the kernel]

Posted May 15, 2015 0:48 UTC (Fri) by bojan (subscriber, #14302) [Link] (1 responses)

https://www.kernel.org/doc/Documentation/power/swsusp.txt

No battery [was: Trading off safety and performance in the kernel]

Posted May 15, 2015 6:12 UTC (Fri) by bnorris (subscriber, #92090) [Link]

I saw that, but it should also be at least mentioned in Documentation/ABI/testing/sysfs-power alongside all the other /sys/power/disk options. Maybe I'll send a patch, if you don't.

Trading off safety and performance in the kernel

Posted May 13, 2015 17:28 UTC (Wed) by riteshsarraf (subscriber, #11138) [Link] (1 responses)

It is interesting to see various views from users here.

All use cases mentioned have a legit need. What the kernel needs is (which it mostly does) to keep being flexible.

The point about removing the sys_sync() call is fair, given the performance benefits it may provide.

How to guard safety of the data should be delegated to the user.
And the default behaviors should be the responsibility of userspaces.

I'm surprised nobody has talked about uswsusp. I use laptop-mode-tools with 600 commit interval, when on battery. But at the same time, my suspend task is: `sync && s2both`.

It allows the safety of my data, the faster resumption when returning, and possibility to resume the hibernated state, in case the battery ran out.

Having the functionalities flexible, and liberating the userspace helps keep all happy and sane.

This is quite similar to what happened when ext4 introduced delayed allocation.

Trading off safety and performance in the kernel

Posted May 14, 2015 12:31 UTC (Thu) by hmh (subscriber, #3838) [Link]

Calling sys_sync() on something that userspace has already sync()'d should be fast enough to not bother removing it at all.

Therefore, either something is strange (sys_sync() being slow on almost-fully-sync'd/fully-sync'd filesystems), or the goal of userspace is to not sync at all before suspend-to-ram, and anything else is just a Red Herring.

And "not sync at all" cannot be the kernel default behavior for "system-wide deep suspend": using opt-out for this would be irresponsible. If someone wants to opt-in at runtime to no-sync-before-suspend, it is a different matter.

Trading off safety and performance in the kernel

Posted May 26, 2015 7:31 UTC (Tue) by toyotabedzrock (guest, #88005) [Link] (1 responses)

There is a simple solution to this that Microsoft uses. A file attribute that gets set when the file is modified. When a backup occurs that bit is reset.

Trading off safety and performance in the kernel

Posted May 27, 2015 22:31 UTC (Wed) by nix (subscriber, #2304) [Link]

This works great as long as you only have one generation of backups. More than that and bad things happen.