LWN: Comments on "A block layer introduction part 1: the bio layer"

A block layer introduction part 1: the bio layer

willy — Mon, 30 Oct 2017 15:19:47 +0000

> If one device become 'dead' (unresponsive), any process accessing it falls into 'D' state (TASK_UNINTERRUPTIBLE) and it couldn't be terminated (even with kill -9) until server reboot. Moreover, it couldn't be even 'Ctrl-Z' or SIGSTOP'ed, locking console forever.

This is heritage from traditional Unix. About 10 years ago, Linux added a new state -- TASK_KILLABLE. Of course, it takes a long time to go through and change all the UNINTERRUPTIBLE sleeps into KILLABLE sleeps. Every time you want to do that, you need to add error handling and back out of the operation appropriately. It's really hard. It requires a lot of knowledge of the code you're changing and a lot of thinking about what might have gone wrong in order for this operation to have failed and what the appropriate response is. Some people have done sterling work to make tasks more killable, but the work will probably never be completed.

A block layer introduction part 1: the bio layer

neilbrown — Sun, 29 Oct 2017 04:56:26 +0000

Hi,
the problem you describe here is not directly related to the block layer. It is probably a driver bug, possible a SCSI-layer bug, but definitely not a block-layer problem.
Reporting bugs against Linux is, unfortunately, a bit of a hit-and-miss affair. Some developers refused to touch bugzilla, some love it, and some (like me) only use it begrudgingly.
The alternative is to send email. For that you need to choose the right list and maybe the right developer, and you need to catch them when they are in a good mood or aren't too busy or not on holidays. Some people will make an effort to respond to everything, others are completely unpredictable - and that is for me who usually sends a patch with any bug report. If you just have a bug that you barely understand yourself, your expected response rate is probably lower. Sad, but true.

Lots of bugs do get responded to and dealt with, but lots do not.

I don't think it is fair to say that nobody cares, but it probably is true that nobody sees it as being as important as you do. If you want a solution, then you need to drive it. One way to drive it is to spend money on a consultant or with a support contract from a distributor. I suspect that isn't possible in your situation. Another way is to learn how the code works and find a solution yourself. Lots of people do that, but again it might not be an option for you. Another way is to keep raising the issue on different relevant forums until you get a response. Persistence can bear fruit. You would need to be prepared to perform whatever testing is asked of you, possibly including building a new kernel to test.

If you are able to reproduce this problem on a recent kernel (4.12 or later) I suggest that you email a report to
linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, and me (neilb@suse.com) (note that you do not need to subscribe to these lists to send mail, just send it). Describe the hardware and how to trigger the problem.
Include the stack trace of any process in "D" state. You can get this with
cat /proc/$PID/stack
where "$PID" is the pid of the process.

Be sure to avoid complaining or saying how this has been broken for years and how it is grossly inadequate. Nobody cares about that. We do care about bugs and generally want to fix them. So just report the relevant facts.
Try to include all facts in the mail rather than via links to somewhere else. Sometimes links are necessary, but in the case of your script, it is 8 lines long so just include it in the email (and avoid descriptions like "fuckup"; just call it "broken" or similar). Also make sure your email isn't sent as HTML. We like just plain text. HTML is rejected by all @vger.kernel.org mailing lists. You might need to configure your email program to not send HTML.

A block layer introduction part 1: the bio layer

amarao — Fri, 27 Oct 2017 10:18:54 +0000

I somehow dislike current block level in Linux.

Many years ago I found an easily reproducible bug within linux kernel. It requires just a two SATA drives pugged together into SAS enclosure and connected to LSI HBA (Pretty common configuration for many servers).

I found that a very simple shell script (3 lines, literally) can cause whole enclosure of disks to become unresponsive. This script is available here: https://github.com/amarao/lsi-sata-fuckup

I reported it to upstream bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=98121 for some other special case), I reported it to LSI. No fixes or reaction insofar (5 years!).

One may think that this is 'one driver issue'. May be. But I found that other parts of kernel handle this situation really badly. If one device become 'dead' (unresponsive), any process accessing it falls into 'D' state (TASK_UNINTERRUPTIBLE) and it couldn't be terminated (even with kill -9) until server reboot. Moreover, it couldn't be even 'Ctrl-Z' or SIGSTOP'ed, locking console forever.

Why? Why no one care about such grossly inadequate behavior?

A block layer introduction part 1: the bio layer

neilbrown — Fri, 27 Oct 2017 02:01:01 +0000

> Can the abovementioned scenario deadlock with the current bio layer?

No, hence the parenthetical comment (newer code will have sorted this to the end of the list, to help avoid the deadlock).
Providing drivers which split bios only process one of them and submit the other directly to generic_make_request(), there should be no deadlock (of this sort).

A block layer introduction part 1: the bio layer

edos — Thu, 26 Oct 2017 08:49:52 +0000

Nice example, thank you!

A block layer introduction part 1: the bio layer

Cyberax — Thu, 26 Oct 2017 06:04:58 +0000

I'm confused. Can the abovementioned scenario deadlock with the current bio layer?

A block layer introduction part 1: the bio layer

neilbrown — Thu, 26 Oct 2017 05:57:32 +0000

> Crawl out of the out-of-memory situation to resolve the deadlock.

Surely it is better to design the code to be dead-lock free. It isn't that hard once the problem is understood. (and if the problem isn't understood, then a workaround like that might not be a complete solution).

A block layer introduction part 1: the bio layer

Cyberax — Thu, 26 Oct 2017 04:06:30 +0000

Crawl out of the out-of-memory situation to resolve the deadlock.

A block layer introduction part 1: the bio layer

neilbrown — Thu, 26 Oct 2017 04:03:38 +0000

What would be the purpose, or value, of this "limp along" mode. I don't understand...

A block layer introduction part 1: the bio layer

Cyberax — Wed, 25 Oct 2017 22:03:26 +0000

I've been looking at the bio layer and I'm wondering if BIO can have a "limp along" mode where it stops all threads and does synchronous submission from one thread? It then can either use a "last reserve" mempool or unsplit pending BIOs.

A block layer introduction part 1: the bio layer

neilbrown — Wed, 25 Oct 2017 21:34:39 +0000

A simple, though extremely unlikely, scenario that could cause a deadlock is:
- Suppose I have a RAID1 array where each of the member devices is a RAID0 array with a 4K chunk size.
- An 8K write BIO arrives for the RAID1 array. raid1 code allocates two bios from a private pool and sends an 8K bio to each of the RAID0 devices. These two bios gets queued by generic_make_request.
- Then generic_make_request starts processing the first RAID0 bio. raid0 code needs to split it into 2 4K bios and so allocates a bio from a private pool and submits the new bio and the old bio (now reduced in size) to the underlying devices. These two bios get queued by generic_make_request.
- Then generic_make_request starts processing the second RAID0 bio (newer code will have sorted this to the end of the list, to help avoid the deadlock). Again raid0 code needs to split the bio.

Now, suppose there is no free memory, suppose the private mempool has 16 preallocated entries, and suppose 16 threads all perform exactly this 8K write submission (to different addresses in the RAID1) at the same time.
We will end up with 16 threads all trying to allocate a second bio from the same private pool, while the 16 preallocated entries are each trapped, one per thread, in the generic_make_request queue. The allocations will wait for a previously allocated bio to complete, and those previous bios won't be processed by generic_make_request() until after the allocation completes.

There are other scenarios that are more complex, but are likely enough to actually happen in practice.

A block layer introduction part 1: the bio layer

edos — Wed, 25 Oct 2017 19:12:20 +0000

I didn't get completely about deadlock in the article.
How is that possible when we have stacked block devices to produce a deadlock based on interdependency? It is not clear for me still

A block layer introduction part 1: the bio layer

unixbhaskar — Wed, 25 Oct 2017 10:53:32 +0000

Sure. Will follow.

A block layer introduction part 1: the bio layer

javigon — Wed, 25 Oct 2017 10:42:04 +0000

On recursion avoidance, would be relevant to mention direct_make_request, which is being pushed by Christoph ("block: provide a direct_make_request helper").

A block layer introduction part 1: the bio layer

corbet — Wed, 25 Oct 2017 10:30:26 +0000

Fixed, thanks.

For future reference, this sort of typo report is best sent via email so that readers don't need to plow through them.

A block layer introduction part 1: the bio layer

unixbhaskar — Wed, 25 Oct 2017 09:55:11 +0000

Jon ,I think little change is needed in this line ..

"The remainder if this article will take us down into the former while the latter will be left for a subsequent article."

Should be :

"The remainder of this article will take us down into the former while the latter will be left for a subsequent article."

Diff if>>of