Trading off safety and performance in the kernel

Posted May 13, 2015 19:30 UTC (Wed) by zblaxell (subscriber, #26385)
In reply to: Trading off safety and performance in the kernel by imunsie
Parent article: Trading off safety and performance in the kernel

> Wow, at the rate at which my laptop fails to suspend and/or resume you must be going through at least one hard disk *AND* battery every single week!

We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures. I recommend you look into what your distro is doing wrong, or try another laptop--*any* other laptop.

The last 7 laptops in my care have had one resume failure in 8 years. That's close to 4300 successful suspend-to-RAM+resume cycles in the field. Contrast with ~170 crashes at other times on those same laptops, most of which occurred while the laptop was sitting on an office desk in front of a user with stable AC power. Also in that reporting interval there were 3 hard disk failures requiring drive replacement and over a dozen UNC sector data loss events (i.e. incidents when restore from backups was required because the disk had spontaneously lost some previously stable data, but remained otherwise healthy). 3 batteries had to be replaced as well, contributing several at-full-power crashes each to the total when the power failed without warning.

By my count, on a given day when data is lost, it is more than twelve times more likely that the cause will be something that is not suspend/resume failure. The probability of suspend/resume failure is so tiny, and its impact relative to other common failure modes so small, that it's not worth the time and risk to prepare for it. The cure is worse than the disease.

Speaking of disease: six of those hundreds of crashes were due to consequences of failing to enter the suspend state (i.e. remaining on full power until the battery was exhausted or shutdown was forced by the thermal protection circuitry). Another dozen or more potential crashes were avoided because the user noticed the suspend failure and corrected the proximate cause manually (e.g. by unmounting network filesystems or disconnecting slow removable media devices). The negative impact of sync-before-suspend could have been significantly worse without those modifications and user attention, and there would be many more such incidents had I not disabled the problematic kernel behaviors early on. The relatively high frequency of this mode of failure was the trigger that drove me to modify code to fix the problem--and also to set up years of data logging to prove it had made things quantitatively better.

Trading off safety and performance in the kernel

Posted May 13, 2015 20:20 UTC (Wed) by scottwood (guest, #74349) [Link] (1 responses)

"We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures" sounds a bit odd juxtaposed with, "I fired my distro's ACPI event-handling code. Several years ago it started being not merely useless, but an active source of failure."

Keep in mind that most users have no idea how to "fire their distro's ACPI event-handling code". FWIW my laptop (which was sold specifically as being meant for running Linux!) is pretty terrible at suspend/resume, often exhibiting similar behavior to what tpo described -- short term suspend is usually OK, but leave it for hours and I'll be presented with a cold boot.

As for the sync issue, I think part of the problem is that sync is too blunt of an instrument. A reasonable compromise might be to flush out whatever can be done in 2-3 seconds, focusing on devices that look like normal local high-speed storage. Plus, a write delay of 30 seconds seems pretty high. Maybe it makes sense on servers with UPSes and software that is careful about when data actually hits the disk, but on a laptop running apps of varying quality? Where is the point of diminishing returns on delaying writes, especially with SSDs?

Trading off safety and performance in the kernel

Posted May 13, 2015 22:36 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> "We go through hard disks and batteries (and even displays) 2-3 times faster than suspend/resume failures" sounds a bit odd juxtaposed with, "I fired my distro's ACPI event-handling code. Several years ago it started being not merely useless, but an active source of failure."

Several years ago...well, about 8 years ago, come to think of it, right before I started collecting reliability data. ;)

I also don't count the first few hundred suspend/resume cycles on a new laptop on the bench, since those are used to test and debug the acpi-support scripts before the laptop does any important work. On the other hand, the result of that testing is usually the end of the acpi-support scripts. On the last three laptops I've just skipped the testing phase and not installed the acpi-support scripts in the first place.

> Keep in mind that most users have no idea how to "fire their distro's ACPI event-handling code".

I do keep that in mind, but I have no better practical advice to offer. ACPI lid-event-handling userspace code in most distros is a byzantine nightmare of overlapping and mutually exclusive workarounds that are no longer necessary because the kernel (and X.org) has long been capable of easily and reliably handling the suspend process itself. There's nothing left to do but hold down the delete key until all the broken code goes away.

Trading off safety and performance in the kernel

Posted May 13, 2015 23:37 UTC (Wed) by imunsie (guest, #68550) [Link] (1 responses)

I think you probably missed my (admittedly subtle) point - while suspend/result may work reliably for you, it still is a complete mess for other people, so removing the sync seems like a terrible idea (a tunable would be ok, so long as it still does a sync by default).

As a side note, suspend/resume was actually working very reliably on my laptop until about three or four months ago when *something* (Kernel? Debian? systemd?) updated and completely broke it after the laptop had been in the dock (I really CBF tracking down yet another regression because I have better things to do with my time). But that's beside the point - if it doesn't work for me, it stands to reason that it doesn't work for a lot of other people, so it's not reliable.

Trading off safety and performance in the kernel

Posted May 14, 2015 0:07 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

The best outcome is going to be a tunable. Good or bad, the changing the default sync behavior will take years to be fully accepted, and even after all the major userspaces catch up, there will probably always be a few people who are stuck with buggy legacy userspace and firmware at the same time.

A tunable lets individual users choose when they make the transition. Look how long it took for atime to stop being the default to get an idea how long such a change can take.

There are always kernel regressions and crashes that lose some uncommitted data. We don't run filesystems in sync mode all the time because the performance (and wear and tear on disks, rotating or otherwise) is a price too high for the negligible benefit of less data lost on a crash. At some point that sync on suspend *must* go away.

Trading off safety and performance in the kernel

Posted May 16, 2015 11:54 UTC (Sat) by faramir (subscriber, #2327) [Link]

My Dell Lattitude D530 fails to resume about 1/3 of the time. I suspect a software bug as I think it worked back in the days of Ubuntu 10.04. I keep hoping some random software update will fix the problem.