|
|
Log in / Subscribe / Register

Removing the kthread freezer?

By Jake Edge
April 25, 2018

LSFMM

Using the kernel thread (kthread) freezer has been a longtime problem for a variety of reasons. It is meant as a way to suspend kthreads on the way toward system suspend, but in practice has proved problematic to the point that it came up at both the 2015 and 2016 Kernel Summits (as well as on the mailing lists over the years); the intent is to try to remove the kthread freezer entirely. To that end, Luis Rodriguez led a discussion in the filesystem track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit on the problems and possible solutions.

[Luis Rodriguez]

Rodriguez has picked up the work that Jiri Kosina was doing to eliminate the kthread freezer, but is moving more cautiously than Kosina originally planned. One problem is that the kernel does not want to freeze kthreads in unexpected places, so there is a mechanism that allows the threads to block the freezing process. Part of the thinking there is that there should not be DMA in flight while the suspend is going on, Kent Overstreet said. He asked, wouldn't it be better if the drivers put themselves in a sane state for suspend?

Dave Chinner said that even if the devices are ready to suspend, the filesystems can still be making in-memory changes. A recurring problem is that suspend would sync a filesystem to make it stable, but the filesystem would still have threads and work on workqueues that were operating on the in-memory data. That led to an inconsistent state between what was on disk and what was in the memory image used by the suspend.

In general, Rodriguez said, the kernel should not be freezing kthreads. The threads want full control of where they can be frozen; it is hard to get it all right if it is imposed on them. But trying to address this problem in a generic form is "really hard"; phasing out the kthread freezer will be difficult, so he suggested a divide-and-conquer approach.

For filesystems that implement the freeze_fs() method, it should be straightforward, but there is still a problem in getting the order right. The current mechanism freezes the most recently mounted filesystems first and thaws them in the order in which they were mounted. That is simple to do using iterate_supers(), but does it work in all cases?

Al Viro said that it does not. There is a "nasty ioctl()", which he is sorry for implementing, that can break the ordering. It is quite possible that a filesystem that was mounted later shows up earlier in the list. The ordering described is also not sufficient for FUSE filesystems, Jan Kara said, though Chinner suggested those simply be skipped in the walk.

But there are filesystems that talk to several devices, such as those hosted on a RAID device or with their journal on a separate device, Viro said. These topologies can also change at run time, so he does not recommend relying on any kind of ordering.

In fact, a directed acyclic graph (DAG) could describe these relationships, Kara said. It would have nodes for filesystems and devices, with edges that describe the dependencies between them. It would be nice to build that DAG in the kernel, but it is not done today. Viro agreed that it is probably needed at some point.

Rodriguez wondered whether the DAG generation was required before making any progress on eliminating the kthread freezer. As long as the existence of the problem is kept in mind, Viro said, work can proceed. If these problem configurations can be detected, suspend could be prohibited for those systems, Rodriguez said. But that will be difficult to detect without the graph, Kara said.

There are a number of problem areas that came up in the discussion: freezing races with automounting, the control group (cgroup) freezer is "completely broken", freezing FUSE filesystems is problematic, and so on. It was noted that applications would like to know if the filesystem they are using is about to freeze so they can quiesce their own data to keep it consistent. Rodriguez was surprised to find out that there is no generic framework for the kernel to notify user space about an upcoming suspend: "That's insane!"

No real conclusions came out of the discussion. Rodriguez plans to post his notes to the mailing list for feedback. There was also talk about discussing it more later in the summit, though that has not been scheduled as of this writing.


Index entries for this article
KernelKernel threads
ConferenceStorage, Filesystem, and Memory-Management Summit/2018


to post comments

Removing the kthread freezer?

Posted Apr 25, 2018 21:01 UTC (Wed) by neilbrown (subscriber, #359) [Link] (4 responses)

> It would have nodes for filesystems and devices, with edges that describe the dependencies between them

Maybe we will, at last, get individual filesystems appearing in /sys/devices - with symlinks for dependencies. That would be nice.

Removing the kthread freezer?

Posted Apr 25, 2018 22:40 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (3 responses)

Nice except for the naming problem, and the information leak.

/sys does a reasonable job for hardware but once we get into software abstractions it can be a real drag on maintenance.

Placing filesystem instances in sysfs does not seem like a good idea at all.

Removing the kthread freezer?

Posted Apr 26, 2018 3:03 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

> Nice except for the naming problem, and the information leak.

Every filesystem has a bdi, and every bdi has a unique name. Maybe some filesytems have multiple bdi, but they can choose one. Actually, every filesystem has a unique st_dev, does it not?

> /sys does a reasonable job for hardware but once we get into software abstractions it can be a real drag on maintenance.

Seems to work well enough for md, which can be seen extremely simple filesystem - certainly not hardware. All of /sys/devices/virtual isn't hardware (though some bits are closer than other bits).
Now I confess that md is represented in /sys/devices in an unfortunate way - it should have its own bus rather hang just hanging off block devices - so it shouldn't serve as a model. It can serve as an existence proof though.

Can you say more about the "information leak" issue?

Thanks.

Removing the kthread freezer?

Posted Apr 26, 2018 8:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I would guess that /sys is not namespace-aware, so containers can get leaks from the parent's domain.

Perhaps /proc is a better place?

Removing the kthread freezer?

Posted Apr 27, 2018 2:41 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> so containers can get leaks from the parent's domain.

Leaks of what, exactly?
Leaks of the list of existing devices?
Leaks of a list of which devices are mounted - any maybe which filesystem. Don't need to expose options is sysfs, they already appear in /proc/mounts.

Maybe there would be leaks, but without being specific they are hard to reason about.

In sysfs there is a file I can write to which removes a disk drive from the system. Does that mean someone in a container can already unplug someone else's disk drive?

nasty ioctl()

Posted Apr 29, 2018 8:31 UTC (Sun) by amir73il (subscriber, #66165) [Link]

FYI, the "nasty ioctl()" is LOOP_CHANGE_FD:
https://mirrors.edge.kernel.org/pub/linux/kernel/people/a...

Not only can one use this ioctl to change the filesystem dependency graph, but it could also be used by an evil privileged user to loop a device into a backing file that is created inside the file system that is mounted on the loop device itself. Don't try this at home..

If we had a dependency graph, LOOP_CHANGE_FD can be fixed to not allow creating loops in the graph.

Removing the kthread freezer?

Posted May 8, 2018 18:57 UTC (Tue) by mcgrof (subscriber, #25917) [Link]


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds