2019-04-30 21:42:43 +03:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
* Copyright (C) 1994, Karl Keyte: Added support for disk statistics
|
|
|
|
* Elevator latency, (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
|
|
|
|
* Queue request tables / lock, selectable elevator, Jens Axboe <axboe@suse.de>
|
2008-01-31 15:03:55 +03:00
|
|
|
* kernel-doc documentation started by NeilBrown <neilb@cse.unsw.edu.au>
|
|
|
|
* - July2000
|
2005-04-17 02:20:36 +04:00
|
|
|
* bio rewrite, highmem i/o, etc, Jens Axboe <axboe@suse.de> - may 2001
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This handles all read/write requests to block devices
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2020-12-09 08:29:51 +03:00
|
|
|
#include <linux/blk-pm.h>
|
2021-09-20 15:33:27 +03:00
|
|
|
#include <linux/blk-integrity.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/mm.h>
|
mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
The only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.
I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.
These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).
This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 07:46:07 +03:00
|
|
|
#include <linux/pagemap.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/kernel_stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
2006-12-10 13:19:35 +03:00
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2006-12-08 13:39:46 +03:00
|
|
|
#include <linux/fault-inject.h>
|
2011-03-08 15:19:51 +03:00
|
|
|
#include <linux/list_sort.h>
|
2011-10-19 16:32:38 +04:00
|
|
|
#include <linux/delay.h>
|
2012-04-20 03:29:22 +04:00
|
|
|
#include <linux/ratelimit.h>
|
2013-03-23 07:42:26 +04:00
|
|
|
#include <linux/pm_runtime.h>
|
2019-09-16 18:44:29 +03:00
|
|
|
#include <linux/t10-pi.h>
|
2017-02-01 01:53:20 +03:00
|
|
|
#include <linux/debugfs.h>
|
2018-02-07 01:05:39 +03:00
|
|
|
#include <linux/bpf.h>
|
2019-08-08 22:03:00 +03:00
|
|
|
#include <linux/psi.h>
|
2021-11-23 21:53:12 +03:00
|
|
|
#include <linux/part_stat.h>
|
2020-05-14 11:45:09 +03:00
|
|
|
#include <linux/sched/sysctl.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 03:37:18 +03:00
|
|
|
#include <linux/blk-crypto.h>
|
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 09:43:05 +04:00
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/block.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-01-29 16:51:59 +03:00
|
|
|
#include "blk.h"
|
2021-11-23 21:53:08 +03:00
|
|
|
#include "blk-mq-sched.h"
|
2018-09-27 00:01:03 +03:00
|
|
|
#include "blk-pm.h"
|
2022-02-11 13:11:49 +03:00
|
|
|
#include "blk-cgroup.h"
|
2021-10-05 18:11:56 +03:00
|
|
|
#include "blk-throttle.h"
|
2022-03-14 07:30:18 +03:00
|
|
|
#include "blk-rq-qos.h"
|
2008-01-29 16:51:59 +03:00
|
|
|
|
2017-02-01 01:53:20 +03:00
|
|
|
struct dentry *blk_debugfs_root;
|
|
|
|
|
2010-11-16 14:52:38 +03:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
|
2009-10-01 23:16:13 +04:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
|
2013-04-18 20:00:26 +04:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
|
2014-04-28 22:30:52 +04:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_split);
|
2012-12-14 23:49:27 +04:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
|
2021-02-22 08:29:59 +03:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_insert);
|
2008-11-26 13:59:56 +03:00
|
|
|
|
2011-12-14 03:33:37 +04:00
|
|
|
DEFINE_IDA(blk_queue_ida);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* For queue allocation
|
|
|
|
*/
|
2008-01-31 15:03:55 +03:00
|
|
|
struct kmem_cache *blk_requestq_cachep;
|
2021-12-03 16:15:32 +03:00
|
|
|
struct kmem_cache *blk_requestq_srcu_cachep;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Controlling structure to kblockd
|
|
|
|
*/
|
2006-01-09 18:02:34 +03:00
|
|
|
static struct workqueue_struct *kblockd_workqueue;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-03-08 04:10:04 +03:00
|
|
|
/**
|
|
|
|
* blk_queue_flag_set - atomically set a queue flag
|
|
|
|
* @flag: flag to be set
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-14 19:02:07 +03:00
|
|
|
set_bit(flag, &q->queue_flags);
|
2018-03-08 04:10:04 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_set);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_flag_clear - atomically clear a queue flag
|
|
|
|
* @flag: flag to be cleared
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-14 19:02:07 +03:00
|
|
|
clear_bit(flag, &q->queue_flags);
|
2018-03-08 04:10:04 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_clear);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_flag_test_and_set - atomically test and set a queue flag
|
|
|
|
* @flag: flag to be set
|
|
|
|
* @q: request queue
|
|
|
|
*
|
|
|
|
* Returns the previous value of @flag - 0 if the flag was not set and 1 if
|
|
|
|
* the flag was already set.
|
|
|
|
*/
|
|
|
|
bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-14 19:02:07 +03:00
|
|
|
return test_and_set_bit(flag, &q->queue_flags);
|
2018-03-08 04:10:04 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
|
|
|
|
|
2019-06-20 20:59:16 +03:00
|
|
|
#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
|
|
|
|
static const char *const blk_op_name[] = {
|
|
|
|
REQ_OP_NAME(READ),
|
|
|
|
REQ_OP_NAME(WRITE),
|
|
|
|
REQ_OP_NAME(FLUSH),
|
|
|
|
REQ_OP_NAME(DISCARD),
|
|
|
|
REQ_OP_NAME(SECURE_ERASE),
|
|
|
|
REQ_OP_NAME(ZONE_RESET),
|
2019-08-01 20:26:36 +03:00
|
|
|
REQ_OP_NAME(ZONE_RESET_ALL),
|
2019-10-27 17:05:45 +03:00
|
|
|
REQ_OP_NAME(ZONE_OPEN),
|
|
|
|
REQ_OP_NAME(ZONE_CLOSE),
|
|
|
|
REQ_OP_NAME(ZONE_FINISH),
|
2020-05-12 11:55:47 +03:00
|
|
|
REQ_OP_NAME(ZONE_APPEND),
|
2019-06-20 20:59:16 +03:00
|
|
|
REQ_OP_NAME(WRITE_ZEROES),
|
|
|
|
REQ_OP_NAME(DRV_IN),
|
|
|
|
REQ_OP_NAME(DRV_OUT),
|
|
|
|
};
|
|
|
|
#undef REQ_OP_NAME
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_op_str - Return string XXX in the REQ_OP_XXX.
|
|
|
|
* @op: REQ_OP_XXX.
|
|
|
|
*
|
|
|
|
* Description: Centralize block layer function to convert REQ_OP_XXX into
|
|
|
|
* string format. Useful in the debugging and tracing bio or request. For
|
|
|
|
* invalid REQ_OP_XXX it returns string "UNKNOWN".
|
|
|
|
*/
|
|
|
|
inline const char *blk_op_str(unsigned int op)
|
|
|
|
{
|
|
|
|
const char *op_str = "UNKNOWN";
|
|
|
|
|
|
|
|
if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
|
|
|
|
op_str = blk_op_name[op];
|
|
|
|
|
|
|
|
return op_str;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_op_str);
|
|
|
|
|
2017-06-03 10:38:04 +03:00
|
|
|
static const struct {
|
|
|
|
int errno;
|
|
|
|
const char *name;
|
|
|
|
} blk_errors[] = {
|
|
|
|
[BLK_STS_OK] = { 0, "" },
|
|
|
|
[BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" },
|
|
|
|
[BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" },
|
|
|
|
[BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" },
|
|
|
|
[BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" },
|
|
|
|
[BLK_STS_TARGET] = { -EREMOTEIO, "critical target" },
|
|
|
|
[BLK_STS_NEXUS] = { -EBADE, "critical nexus" },
|
|
|
|
[BLK_STS_MEDIUM] = { -ENODATA, "critical medium" },
|
|
|
|
[BLK_STS_PROTECTION] = { -EILSEQ, "protection" },
|
|
|
|
[BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" },
|
2018-01-31 06:04:57 +03:00
|
|
|
[BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" },
|
2017-06-20 15:05:46 +03:00
|
|
|
[BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" },
|
2022-02-03 22:28:26 +03:00
|
|
|
[BLK_STS_OFFLINE] = { -ENODEV, "device offline" },
|
2017-06-03 10:38:04 +03:00
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
/* device mapper special case, should not leak out: */
|
|
|
|
[BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" },
|
|
|
|
|
2020-09-24 23:53:28 +03:00
|
|
|
/* zone device specific errors */
|
|
|
|
[BLK_STS_ZONE_OPEN_RESOURCE] = { -ETOOMANYREFS, "open zones exceeded" },
|
|
|
|
[BLK_STS_ZONE_ACTIVE_RESOURCE] = { -EOVERFLOW, "active zones exceeded" },
|
|
|
|
|
2017-06-03 10:38:04 +03:00
|
|
|
/* everything else not covered above: */
|
|
|
|
[BLK_STS_IOERR] = { -EIO, "I/O" },
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_status_t errno_to_blk_status(int errno)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
|
|
|
|
if (blk_errors[i].errno == errno)
|
|
|
|
return (__force blk_status_t)i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(errno_to_blk_status);
|
|
|
|
|
|
|
|
int blk_status_to_errno(blk_status_t status)
|
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-21 20:55:46 +03:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2017-06-03 10:38:04 +03:00
|
|
|
return -EIO;
|
|
|
|
return blk_errors[idx].errno;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_status_to_errno);
|
|
|
|
|
2021-11-17 09:14:03 +03:00
|
|
|
const char *blk_status_to_str(blk_status_t status)
|
2017-06-03 10:38:04 +03:00
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-21 20:55:46 +03:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2021-11-17 09:14:03 +03:00
|
|
|
return "<null>";
|
|
|
|
return blk_errors[idx].name;
|
2017-06-03 10:38:04 +03:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/**
|
|
|
|
* blk_sync_queue - cancel any pending callbacks on a queue
|
|
|
|
* @q: the queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* The block layer may perform asynchronous callback activity
|
|
|
|
* on a queue, such as calling the unplug function after a timeout.
|
|
|
|
* A block device may call blk_sync_queue to ensure that any
|
|
|
|
* such activity is cancelled, thus allowing it to release resources
|
2007-05-09 10:57:56 +04:00
|
|
|
* that the callbacks might use. The caller must already have made sure
|
2020-07-01 11:59:43 +03:00
|
|
|
* that its ->submit_bio will not re-add plugging prior to calling
|
2005-04-17 02:20:36 +04:00
|
|
|
* this function.
|
|
|
|
*
|
2011-03-03 03:05:33 +03:00
|
|
|
* This function does not cancel any asynchronous activity arising
|
2014-09-08 20:27:23 +04:00
|
|
|
* out of elevator or throttling code. That would require elevator_exit()
|
2012-03-06 01:15:12 +04:00
|
|
|
* and blkcg_exit_queue() to be called with queue lock initialized.
|
2011-03-03 03:05:33 +03:00
|
|
|
*
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
void blk_sync_queue(struct request_queue *q)
|
|
|
|
{
|
2008-11-19 16:38:39 +03:00
|
|
|
del_timer_sync(&q->timeout);
|
2017-10-19 20:00:48 +03:00
|
|
|
cancel_work_sync(&q->timeout_work);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_sync_queue);
|
|
|
|
|
2017-11-09 21:49:57 +03:00
|
|
|
/**
|
2018-09-27 00:01:04 +03:00
|
|
|
* blk_set_pm_only - increment pm_only counter
|
2017-11-09 21:49:57 +03:00
|
|
|
* @q: request queue pointer
|
|
|
|
*/
|
2018-09-27 00:01:04 +03:00
|
|
|
void blk_set_pm_only(struct request_queue *q)
|
2017-11-09 21:49:57 +03:00
|
|
|
{
|
2018-09-27 00:01:04 +03:00
|
|
|
atomic_inc(&q->pm_only);
|
2017-11-09 21:49:57 +03:00
|
|
|
}
|
2018-09-27 00:01:04 +03:00
|
|
|
EXPORT_SYMBOL_GPL(blk_set_pm_only);
|
2017-11-09 21:49:57 +03:00
|
|
|
|
2018-09-27 00:01:04 +03:00
|
|
|
void blk_clear_pm_only(struct request_queue *q)
|
2017-11-09 21:49:57 +03:00
|
|
|
{
|
2018-09-27 00:01:04 +03:00
|
|
|
int pm_only;
|
|
|
|
|
|
|
|
pm_only = atomic_dec_return(&q->pm_only);
|
|
|
|
WARN_ON_ONCE(pm_only < 0);
|
|
|
|
if (pm_only == 0)
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2017-11-09 21:49:57 +03:00
|
|
|
}
|
2018-09-27 00:01:04 +03:00
|
|
|
EXPORT_SYMBOL_GPL(blk_clear_pm_only);
|
2017-11-09 21:49:57 +03:00
|
|
|
|
2020-06-19 23:47:23 +03:00
|
|
|
/**
|
|
|
|
* blk_put_queue - decrement the request_queue refcount
|
|
|
|
* @q: the request_queue structure to decrement the refcount for
|
|
|
|
*
|
|
|
|
* Decrements the refcount of the request_queue kobject. When this reaches 0
|
|
|
|
* we'll have blk_release_queue() called.
|
2020-06-19 23:47:25 +03:00
|
|
|
*
|
|
|
|
* Context: Any context, but the last reference must not be dropped from
|
|
|
|
* atomic context.
|
2020-06-19 23:47:23 +03:00
|
|
|
*/
|
2007-07-24 11:28:11 +04:00
|
|
|
void blk_put_queue(struct request_queue *q)
|
2006-03-19 02:34:37 +03:00
|
|
|
{
|
|
|
|
kobject_put(&q->kobj);
|
|
|
|
}
|
2011-05-27 09:44:43 +04:00
|
|
|
EXPORT_SYMBOL(blk_put_queue);
|
2006-03-19 02:34:37 +03:00
|
|
|
|
2021-09-29 10:12:40 +03:00
|
|
|
void blk_queue_start_drain(struct request_queue *q)
|
2014-12-23 00:04:42 +03:00
|
|
|
{
|
2017-03-27 15:06:58 +03:00
|
|
|
/*
|
|
|
|
* When queue DYING flag is set, we need to block new req
|
|
|
|
* entering queue, so we call blk_freeze_queue_start() to
|
|
|
|
* prevent I/O from crossing blk_queue_enter().
|
|
|
|
*/
|
|
|
|
blk_freeze_queue_start(q);
|
2018-11-15 22:22:51 +03:00
|
|
|
if (queue_is_mq(q))
|
2014-12-23 00:04:42 +03:00
|
|
|
blk_mq_wake_waiters(q);
|
2017-11-09 21:49:53 +03:00
|
|
|
/* Make blk_queue_enter() reexamine the DYING flag. */
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2014-12-23 00:04:42 +03:00
|
|
|
}
|
2021-09-29 10:12:40 +03:00
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
/**
|
|
|
|
* blk_cleanup_queue - shutdown a request queue
|
|
|
|
* @q: request queue to shutdown
|
|
|
|
*
|
2012-12-06 17:32:01 +04:00
|
|
|
* Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
|
|
|
|
* put it. All future requests will be failed immediately with -ENODEV.
|
2020-06-19 23:47:25 +03:00
|
|
|
*
|
|
|
|
* Context: can sleep
|
2011-03-03 03:04:42 +03:00
|
|
|
*/
|
2008-01-31 15:03:55 +03:00
|
|
|
void blk_cleanup_queue(struct request_queue *q)
|
2006-03-19 02:34:37 +03:00
|
|
|
{
|
2020-06-19 23:47:25 +03:00
|
|
|
/* cannot be called from atomic context */
|
|
|
|
might_sleep();
|
|
|
|
|
2019-10-01 02:00:43 +03:00
|
|
|
WARN_ON_ONCE(blk_queue_registered(q));
|
|
|
|
|
2012-11-28 16:42:38 +04:00
|
|
|
/* mark @q DYING, no new request or merges will be allowed afterwards */
|
2022-02-17 10:52:31 +03:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
|
|
|
|
blk_queue_start_drain(q);
|
2012-03-06 01:14:59 +04:00
|
|
|
|
2018-11-14 19:02:07 +03:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
|
2012-12-06 17:32:01 +04:00
|
|
|
/*
|
|
|
|
* Drain all requests queued before DYING marking. Set DEAD flag to
|
2019-08-02 01:39:55 +03:00
|
|
|
* prevent that blk_mq_run_hw_queues() accesses the hardware queues
|
|
|
|
* after draining finished.
|
2012-12-06 17:32:01 +04:00
|
|
|
*/
|
2015-10-21 20:20:12 +03:00
|
|
|
blk_freeze_queue(q);
|
2018-10-24 16:18:09 +03:00
|
|
|
|
2022-03-14 07:30:18 +03:00
|
|
|
/* cleanup rq qos structures for queue without disk */
|
|
|
|
rq_qos_exit(q);
|
|
|
|
|
2018-11-14 19:02:07 +03:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
|
|
|
|
blk_sync_queue(q);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 04:43:43 +03:00
|
|
|
if (queue_is_mq(q)) {
|
|
|
|
blk_mq_cancel_work_sync(q);
|
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 04:52:25 +03:00
|
|
|
blk_mq_exit_queue(q);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 04:43:43 +03:00
|
|
|
}
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-29 19:23:51 +03:00
|
|
|
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 16:08:02 +03:00
|
|
|
/*
|
|
|
|
* In theory, request pool of sched_tags belongs to request queue.
|
|
|
|
* However, the current implementation requires tag_set for freeing
|
|
|
|
* requests, so free the pool now.
|
|
|
|
*
|
|
|
|
* Queue has become frozen, there can't be any in-queue requests, so
|
|
|
|
* it is safe to free requests now.
|
|
|
|
*/
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
if (q->elevator)
|
2021-10-05 13:23:31 +03:00
|
|
|
blk_mq_sched_free_rqs(q);
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 16:08:02 +03:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
/* @q is and will stay empty, shutdown and put */
|
2006-03-19 02:34:37 +03:00
|
|
|
blk_put_queue(q);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
EXPORT_SYMBOL(blk_cleanup_queue);
|
|
|
|
|
2017-11-09 21:49:58 +03:00
|
|
|
/**
|
|
|
|
* blk_queue_enter() - try to increase q->q_usage_counter
|
|
|
|
* @q: request queue pointer
|
2020-12-09 08:29:50 +03:00
|
|
|
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM
|
2017-11-09 21:49:58 +03:00
|
|
|
*/
|
2017-11-09 21:49:59 +03:00
|
|
|
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
|
2015-10-21 20:20:12 +03:00
|
|
|
{
|
2020-12-09 08:29:50 +03:00
|
|
|
const bool pm = flags & BLK_MQ_REQ_PM;
|
2017-11-09 21:49:58 +03:00
|
|
|
|
2021-09-29 10:12:38 +03:00
|
|
|
while (!blk_try_enter_queue(q, pm)) {
|
2017-11-09 21:49:58 +03:00
|
|
|
if (flags & BLK_MQ_REQ_NOWAIT)
|
2015-10-21 20:20:12 +03:00
|
|
|
return -EBUSY;
|
|
|
|
|
2017-03-27 15:06:56 +03:00
|
|
|
/*
|
2021-09-29 10:12:38 +03:00
|
|
|
* read pair of barrier in blk_freeze_queue_start(), we need to
|
|
|
|
* order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
|
|
|
|
* reading .mq_freeze_depth or queue dying flag, otherwise the
|
|
|
|
* following wait may never return if the two reads are
|
|
|
|
* reordered.
|
2017-03-27 15:06:56 +03:00
|
|
|
*/
|
|
|
|
smp_rmb();
|
2018-04-12 21:11:58 +03:00
|
|
|
wait_event(q->mq_freeze_wq,
|
2019-05-21 06:25:55 +03:00
|
|
|
(!q->mq_freeze_depth &&
|
2020-12-09 08:29:51 +03:00
|
|
|
blk_pm_resume_queue(pm, q)) ||
|
2018-04-12 21:11:58 +03:00
|
|
|
blk_queue_dying(q));
|
2015-10-21 20:20:12 +03:00
|
|
|
if (blk_queue_dying(q))
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
2021-09-29 10:12:38 +03:00
|
|
|
|
|
|
|
return 0;
|
2015-10-21 20:20:12 +03:00
|
|
|
}
|
|
|
|
|
2021-11-04 21:45:51 +03:00
|
|
|
int __bio_queue_enter(struct request_queue *q, struct bio *bio)
|
2020-04-28 14:27:56 +03:00
|
|
|
{
|
2021-09-29 10:12:39 +03:00
|
|
|
while (!blk_try_enter_queue(q, false)) {
|
2021-10-14 17:03:29 +03:00
|
|
|
struct gendisk *disk = bio->bi_bdev->bd_disk;
|
|
|
|
|
2021-09-29 10:12:39 +03:00
|
|
|
if (bio->bi_opf & REQ_NOWAIT) {
|
2021-09-29 10:12:40 +03:00
|
|
|
if (test_bit(GD_DEAD, &disk->state))
|
2021-09-29 10:12:39 +03:00
|
|
|
goto dead;
|
2020-04-28 14:27:56 +03:00
|
|
|
bio_wouldblock_error(bio);
|
2021-09-29 10:12:39 +03:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* read pair of barrier in blk_freeze_queue_start(), we need to
|
|
|
|
* order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
|
|
|
|
* reading .mq_freeze_depth or queue dying flag, otherwise the
|
|
|
|
* following wait may never return if the two reads are
|
|
|
|
* reordered.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
wait_event(q->mq_freeze_wq,
|
|
|
|
(!q->mq_freeze_depth &&
|
|
|
|
blk_pm_resume_queue(false, q)) ||
|
2021-09-29 10:12:40 +03:00
|
|
|
test_bit(GD_DEAD, &disk->state));
|
|
|
|
if (test_bit(GD_DEAD, &disk->state))
|
2021-09-29 10:12:39 +03:00
|
|
|
goto dead;
|
2020-04-28 14:27:56 +03:00
|
|
|
}
|
|
|
|
|
2021-09-29 10:12:39 +03:00
|
|
|
return 0;
|
|
|
|
dead:
|
|
|
|
bio_io_error(bio);
|
|
|
|
return -ENODEV;
|
2020-04-28 14:27:56 +03:00
|
|
|
}
|
|
|
|
|
2015-10-21 20:20:12 +03:00
|
|
|
void blk_queue_exit(struct request_queue *q)
|
|
|
|
{
|
|
|
|
percpu_ref_put(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_queue_usage_counter_release(struct percpu_ref *ref)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
|
|
|
container_of(ref, struct request_queue, q_usage_counter);
|
|
|
|
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
|
|
|
}
|
|
|
|
|
2017-08-29 01:03:41 +03:00
|
|
|
static void blk_rq_timed_out_timer(struct timer_list *t)
|
2015-10-30 15:57:30 +03:00
|
|
|
{
|
2017-08-29 01:03:41 +03:00
|
|
|
struct request_queue *q = from_timer(q, t, timeout);
|
2015-10-30 15:57:30 +03:00
|
|
|
|
|
|
|
kblockd_schedule_work(&q->timeout_work);
|
|
|
|
}
|
|
|
|
|
2019-01-30 16:21:45 +03:00
|
|
|
static void blk_timeout_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-12-03 16:15:32 +03:00
|
|
|
struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
|
2005-06-23 11:08:19 +04:00
|
|
|
{
|
2007-07-24 11:28:11 +04:00
|
|
|
struct request_queue *q;
|
2018-05-21 01:25:47 +03:00
|
|
|
int ret;
|
2005-06-23 11:08:19 +04:00
|
|
|
|
2021-12-03 16:15:32 +03:00
|
|
|
q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu),
|
|
|
|
GFP_KERNEL | __GFP_ZERO, node_id);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!q)
|
|
|
|
return NULL;
|
|
|
|
|
2021-12-03 16:15:32 +03:00
|
|
|
if (alloc_srcu) {
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
|
|
|
|
if (init_srcu_struct(q->srcu) != 0)
|
|
|
|
goto fail_q;
|
|
|
|
}
|
|
|
|
|
2018-05-31 20:11:36 +03:00
|
|
|
q->last_merge = NULL;
|
|
|
|
|
2020-03-27 11:30:11 +03:00
|
|
|
q->id = ida_simple_get(&blk_queue_ida, 0, 0, GFP_KERNEL);
|
2011-12-14 03:33:37 +04:00
|
|
|
if (q->id < 0)
|
2021-12-03 16:15:32 +03:00
|
|
|
goto fail_srcu;
|
2011-12-14 03:33:37 +04:00
|
|
|
|
2021-01-11 06:05:53 +03:00
|
|
|
ret = bioset_init(&q->bio_split, BIO_POOL_SIZE, 0, 0);
|
2018-05-21 01:25:47 +03:00
|
|
|
if (ret)
|
2015-04-24 08:37:18 +03:00
|
|
|
goto fail_id;
|
|
|
|
|
2017-03-22 02:20:01 +03:00
|
|
|
q->stats = blk_alloc_queue_stats();
|
|
|
|
if (!q->stats)
|
2021-08-09 17:17:43 +03:00
|
|
|
goto fail_split;
|
2017-03-22 02:20:01 +03:00
|
|
|
|
2011-11-23 13:59:13 +04:00
|
|
|
q->node = node_id;
|
2009-06-12 16:42:56 +04:00
|
|
|
|
2021-10-05 13:23:39 +03:00
|
|
|
atomic_set(&q->nr_active_requests_shared_tags, 0);
|
2020-08-19 18:20:26 +03:00
|
|
|
|
2017-08-29 01:03:41 +03:00
|
|
|
timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
|
2019-01-30 16:21:45 +03:00
|
|
|
INIT_WORK(&q->timeout_work, blk_timeout_work);
|
2011-12-14 03:33:41 +04:00
|
|
|
INIT_LIST_HEAD(&q->icq_list);
|
2006-03-19 02:34:37 +03:00
|
|
|
|
2008-01-29 16:51:59 +03:00
|
|
|
kobject_init(&q->kobj, &blk_queue_ktype);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2020-06-19 23:47:30 +03:00
|
|
|
mutex_init(&q->debugfs_mutex);
|
2006-03-19 02:34:37 +03:00
|
|
|
mutex_init(&q->sysfs_lock);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 14:01:48 +03:00
|
|
|
mutex_init(&q->sysfs_dir_lock);
|
2018-11-15 22:17:28 +03:00
|
|
|
spin_lock_init(&q->queue_lock);
|
2011-03-03 03:04:42 +03:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
init_waitqueue_head(&q->mq_freeze_wq);
|
2019-05-21 06:25:55 +03:00
|
|
|
mutex_init(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
2015-10-21 20:20:12 +03:00
|
|
|
/*
|
|
|
|
* Init percpu_ref in atomic mode so that it's faster to shutdown.
|
|
|
|
* See blk_register_queue() for details.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_init(&q->q_usage_counter,
|
|
|
|
blk_queue_usage_counter_release,
|
|
|
|
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
|
2021-08-09 17:17:43 +03:00
|
|
|
goto fail_stats;
|
2012-03-06 01:15:05 +04:00
|
|
|
|
2020-03-27 11:30:11 +03:00
|
|
|
blk_queue_dma_alignment(q, 511);
|
|
|
|
blk_set_default_limits(&q->limits);
|
2021-10-05 13:23:27 +03:00
|
|
|
q->nr_requests = BLKDEV_DEFAULT_RQ;
|
2020-03-27 11:30:11 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return q;
|
2011-12-14 03:33:37 +04:00
|
|
|
|
2017-03-22 02:20:01 +03:00
|
|
|
fail_stats:
|
2021-08-09 17:17:43 +03:00
|
|
|
blk_free_queue_stats(q->stats);
|
2015-04-24 08:37:18 +03:00
|
|
|
fail_split:
|
2018-05-21 01:25:47 +03:00
|
|
|
bioset_exit(&q->bio_split);
|
2011-12-14 03:33:37 +04:00
|
|
|
fail_id:
|
|
|
|
ida_simple_remove(&blk_queue_ida, q->id);
|
2021-12-03 16:15:32 +03:00
|
|
|
fail_srcu:
|
|
|
|
if (alloc_srcu)
|
|
|
|
cleanup_srcu_struct(q->srcu);
|
2011-12-14 03:33:37 +04:00
|
|
|
fail_q:
|
2021-12-03 16:15:32 +03:00
|
|
|
kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q);
|
2011-12-14 03:33:37 +04:00
|
|
|
return NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2020-06-19 23:47:23 +03:00
|
|
|
/**
|
|
|
|
* blk_get_queue - increment the request_queue refcount
|
|
|
|
* @q: the request_queue structure to increment the refcount for
|
|
|
|
*
|
|
|
|
* Increment the refcount of the request_queue kobject.
|
2020-06-19 23:47:24 +03:00
|
|
|
*
|
|
|
|
* Context: Any context.
|
2020-06-19 23:47:23 +03:00
|
|
|
*/
|
2011-12-14 03:33:38 +04:00
|
|
|
bool blk_get_queue(struct request_queue *q)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-11-28 16:42:38 +04:00
|
|
|
if (likely(!blk_queue_dying(q))) {
|
2011-12-14 03:33:38 +04:00
|
|
|
__blk_get_queue(q);
|
|
|
|
return true;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-12-14 03:33:38 +04:00
|
|
|
return false;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2011-05-27 09:44:43 +04:00
|
|
|
EXPORT_SYMBOL(blk_get_queue);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-12-08 13:39:46 +03:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
|
|
|
|
|
|
|
static DECLARE_FAULT_ATTR(fail_make_request);
|
|
|
|
|
|
|
|
static int __init setup_fail_make_request(char *str)
|
|
|
|
{
|
|
|
|
return setup_fault_attr(&fail_make_request, str);
|
|
|
|
}
|
|
|
|
__setup("fail_make_request=", setup_fail_make_request);
|
|
|
|
|
2021-11-17 09:13:58 +03:00
|
|
|
bool should_fail_request(struct block_device *part, unsigned int bytes)
|
2006-12-08 13:39:46 +03:00
|
|
|
{
|
2020-11-24 11:36:54 +03:00
|
|
|
return part->bd_make_it_fail && should_fail(&fail_make_request, bytes);
|
2006-12-08 13:39:46 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __init fail_make_request_debugfs(void)
|
|
|
|
{
|
2011-08-04 03:21:01 +04:00
|
|
|
struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
|
|
|
|
NULL, &fail_make_request);
|
|
|
|
|
2014-04-11 11:58:56 +04:00
|
|
|
return PTR_ERR_OR_ZERO(dir);
|
2006-12-08 13:39:46 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
late_initcall(fail_make_request_debugfs);
|
|
|
|
#endif /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2021-01-24 13:02:35 +03:00
|
|
|
static inline bool bio_check_ro(struct bio *bio)
|
2018-01-11 16:09:11 +03:00
|
|
|
{
|
2021-01-24 13:02:35 +03:00
|
|
|
if (op_is_write(bio_op(bio)) && bdev_read_only(bio->bi_bdev)) {
|
2018-09-06 01:14:36 +03:00
|
|
|
if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
|
|
|
|
return false;
|
2022-03-04 21:00:56 +03:00
|
|
|
pr_warn("Trying to write to read-only block-device %pg\n",
|
|
|
|
bio->bi_bdev);
|
Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.
The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.
This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=200439
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442
The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:
generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40
Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701c7, but letting the previously
uncaught request go through with a warning instead of stopping it.
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-03 22:22:09 +03:00
|
|
|
/* Older lvm-tools actually trigger this */
|
|
|
|
return false;
|
2018-01-11 16:09:11 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-02-07 01:05:39 +03:00
|
|
|
static noinline int should_fail_bio(struct bio *bio)
|
|
|
|
{
|
2021-01-24 13:02:34 +03:00
|
|
|
if (should_fail_request(bdev_whole(bio->bi_bdev), bio->bi_iter.bi_size))
|
2018-02-07 01:05:39 +03:00
|
|
|
return -EIO;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
|
|
|
|
|
2018-03-14 18:56:53 +03:00
|
|
|
/*
|
|
|
|
* Check whether this bio extends beyond the end of the device or partition.
|
|
|
|
* This may well happen - the kernel calls bread() without checking the size of
|
|
|
|
* the device, e.g., when mounting a file system.
|
|
|
|
*/
|
2021-01-24 13:02:35 +03:00
|
|
|
static inline int bio_check_eod(struct bio *bio)
|
2018-03-14 18:56:53 +03:00
|
|
|
{
|
2021-01-24 13:02:35 +03:00
|
|
|
sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
|
2018-03-14 18:56:53 +03:00
|
|
|
unsigned int nr_sectors = bio_sectors(bio);
|
|
|
|
|
|
|
|
if (nr_sectors && maxsector &&
|
|
|
|
(nr_sectors > maxsector ||
|
|
|
|
bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
|
2022-03-04 21:00:57 +03:00
|
|
|
pr_info_ratelimited("%s: attempt to access beyond end of device\n"
|
|
|
|
"%pg: rw=%d, want=%llu, limit=%llu\n",
|
|
|
|
current->comm,
|
|
|
|
bio->bi_bdev, bio->bi_opf,
|
|
|
|
bio_end_sector(bio), maxsector);
|
2018-03-14 18:56:53 +03:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:10:32 +03:00
|
|
|
/*
|
|
|
|
* Remap block n of partition p to block n+start(p) of the disk.
|
|
|
|
*/
|
2021-01-24 13:02:35 +03:00
|
|
|
static int blk_partition_remap(struct bio *bio)
|
2017-08-23 20:10:32 +03:00
|
|
|
{
|
2021-01-24 13:02:34 +03:00
|
|
|
struct block_device *p = bio->bi_bdev;
|
2017-08-23 20:10:32 +03:00
|
|
|
|
2018-03-14 18:56:53 +03:00
|
|
|
if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
|
2021-01-24 13:02:35 +03:00
|
|
|
return -EIO;
|
2019-11-11 05:39:25 +03:00
|
|
|
if (bio_sectors(bio)) {
|
2020-11-24 11:36:54 +03:00
|
|
|
bio->bi_iter.bi_sector += p->bd_start_sect;
|
2020-12-03 19:21:38 +03:00
|
|
|
trace_block_bio_remap(bio, p->bd_dev,
|
2020-11-24 11:34:24 +03:00
|
|
|
bio->bi_iter.bi_sector -
|
2020-11-24 11:36:54 +03:00
|
|
|
p->bd_start_sect);
|
2018-03-14 18:56:53 +03:00
|
|
|
}
|
2021-01-24 13:02:36 +03:00
|
|
|
bio_set_flag(bio, BIO_REMAPPED);
|
2021-01-24 13:02:35 +03:00
|
|
|
return 0;
|
2017-08-23 20:10:32 +03:00
|
|
|
}
|
|
|
|
|
2020-05-12 11:55:47 +03:00
|
|
|
/*
|
|
|
|
* Check write append to a zoned block device.
|
|
|
|
*/
|
|
|
|
static inline blk_status_t blk_check_zone_append(struct request_queue *q,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
sector_t pos = bio->bi_iter.bi_sector;
|
|
|
|
int nr_sectors = bio_sectors(bio);
|
|
|
|
|
|
|
|
/* Only applicable to zoned block devices */
|
|
|
|
if (!blk_queue_is_zoned(q))
|
|
|
|
return BLK_STS_NOTSUPP;
|
|
|
|
|
|
|
|
/* The bio sector must point to the start of a sequential zone */
|
|
|
|
if (pos & (blk_queue_zone_sectors(q) - 1) ||
|
|
|
|
!blk_queue_zone_is_seq(q, pos))
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not allowed to cross zone boundaries. Otherwise, the BIO will be
|
|
|
|
* split and could result in non-contiguous sectors being written in
|
|
|
|
* different zones.
|
|
|
|
*/
|
|
|
|
if (nr_sectors > q->limits.chunk_sectors)
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
/* Make sure the BIO is small enough and will not get split */
|
|
|
|
if (nr_sectors > q->limits.max_zone_append_sectors)
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
bio->bi_opf |= REQ_NOMERGE;
|
|
|
|
|
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
2021-11-03 14:47:09 +03:00
|
|
|
static void __submit_bio(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = bio->bi_bdev->bd_disk;
|
2021-09-29 10:12:37 +03:00
|
|
|
|
2022-02-16 07:45:08 +03:00
|
|
|
if (unlikely(!blk_crypto_bio_prep(&bio)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!disk->fops->submit_bio) {
|
2021-10-12 14:12:24 +03:00
|
|
|
blk_mq_submit_bio(bio);
|
2022-02-16 07:45:08 +03:00
|
|
|
} else if (likely(bio_queue_enter(bio) == 0)) {
|
|
|
|
disk->fops->submit_bio(bio);
|
|
|
|
blk_queue_exit(disk->queue);
|
|
|
|
}
|
2020-05-16 21:28:01 +03:00
|
|
|
}
|
|
|
|
|
2020-07-01 11:59:45 +03:00
|
|
|
/*
|
|
|
|
* The loop in this function may be a bit non-obvious, and so deserves some
|
|
|
|
* explanation:
|
|
|
|
*
|
|
|
|
* - Before entering the loop, bio->bi_next is NULL (as all callers ensure
|
|
|
|
* that), so we have a list with a single bio.
|
|
|
|
* - We pretend that we have just taken it off a longer list, so we assign
|
|
|
|
* bio_list to a pointer to the bio_list_on_stack, thus initialising the
|
|
|
|
* bio_list of new bios to be added. ->submit_bio() may indeed add some more
|
|
|
|
* bios through a recursive call to submit_bio_noacct. If it did, we find a
|
|
|
|
* non-NULL value in bio_list and re-enter the loop from the top.
|
|
|
|
* - In this case we really did just take the bio of the top of the list (no
|
|
|
|
* pretending) and so remove it from bio_list, and call into ->submit_bio()
|
|
|
|
* again.
|
|
|
|
*
|
|
|
|
* bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
|
|
|
|
* bio_list_on_stack[1] contains bios that were submitted before the current
|
2022-03-05 05:08:03 +03:00
|
|
|
* ->submit_bio, but that haven't been processed yet.
|
2020-07-01 11:59:45 +03:00
|
|
|
*/
|
2021-10-12 14:12:24 +03:00
|
|
|
static void __submit_bio_noacct(struct bio *bio)
|
2020-07-01 11:59:45 +03:00
|
|
|
{
|
|
|
|
struct bio_list bio_list_on_stack[2];
|
|
|
|
|
|
|
|
BUG_ON(bio->bi_next);
|
|
|
|
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
current->bio_list = bio_list_on_stack;
|
|
|
|
|
|
|
|
do {
|
2021-10-14 17:03:29 +03:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
2020-07-01 11:59:45 +03:00
|
|
|
struct bio_list lower, same;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a fresh bio_list for all subordinate requests.
|
|
|
|
*/
|
|
|
|
bio_list_on_stack[1] = bio_list_on_stack[0];
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
|
2021-10-12 14:12:24 +03:00
|
|
|
__submit_bio(bio);
|
2020-07-01 11:59:45 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Sort new bios into those for a lower level and those for the
|
|
|
|
* same level.
|
|
|
|
*/
|
|
|
|
bio_list_init(&lower);
|
|
|
|
bio_list_init(&same);
|
|
|
|
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
|
2021-10-14 17:03:29 +03:00
|
|
|
if (q == bdev_get_queue(bio->bi_bdev))
|
2020-07-01 11:59:45 +03:00
|
|
|
bio_list_add(&same, bio);
|
|
|
|
else
|
|
|
|
bio_list_add(&lower, bio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now assemble so we handle the lowest level first.
|
|
|
|
*/
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &lower);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &same);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
|
|
|
|
} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
|
|
|
|
|
|
|
|
current->bio_list = NULL;
|
|
|
|
}
|
|
|
|
|
2021-10-12 14:12:24 +03:00
|
|
|
static void __submit_bio_noacct_mq(struct bio *bio)
|
2020-07-01 11:59:46 +03:00
|
|
|
{
|
2020-07-02 22:21:25 +03:00
|
|
|
struct bio_list bio_list[2] = { };
|
2020-07-01 11:59:46 +03:00
|
|
|
|
2020-07-02 22:21:25 +03:00
|
|
|
current->bio_list = bio_list;
|
2020-07-01 11:59:46 +03:00
|
|
|
|
|
|
|
do {
|
2021-10-12 14:12:24 +03:00
|
|
|
__submit_bio(bio);
|
2020-07-02 22:21:25 +03:00
|
|
|
} while ((bio = bio_list_pop(&bio_list[0])));
|
2020-07-01 11:59:46 +03:00
|
|
|
|
|
|
|
current->bio_list = NULL;
|
|
|
|
}
|
|
|
|
|
2022-02-16 07:45:10 +03:00
|
|
|
void submit_bio_noacct_nocheck(struct bio *bio)
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
|
|
|
{
|
2011-09-15 16:01:40 +04:00
|
|
|
/*
|
2020-07-01 11:59:45 +03:00
|
|
|
* We only want one ->submit_bio to be active at a time, else stack
|
|
|
|
* usage with stacked devices could be a problem. Use current->bio_list
|
|
|
|
* to collect a list of requests submited by a ->submit_bio method while
|
|
|
|
* it is active, and then process them after it returned.
|
2011-09-15 16:01:40 +04:00
|
|
|
*/
|
2021-10-12 14:12:24 +03:00
|
|
|
if (current->bio_list)
|
2017-03-10 09:00:47 +03:00
|
|
|
bio_list_add(¤t->bio_list[0], bio);
|
2021-10-12 14:12:24 +03:00
|
|
|
else if (!bio->bi_bdev->bd_disk->fops->submit_bio)
|
|
|
|
__submit_bio_noacct_mq(bio);
|
|
|
|
else
|
|
|
|
__submit_bio_noacct(bio);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
|
|
|
}
|
2022-02-16 07:45:10 +03:00
|
|
|
|
|
|
|
/**
|
|
|
|
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
|
|
|
|
* @bio: The bio describing the location in memory and on the device.
|
|
|
|
*
|
|
|
|
* This is a version of submit_bio() that shall only be used for I/O that is
|
|
|
|
* resubmitted to lower level drivers by stacking block drivers. All file
|
|
|
|
* systems and other upper level users of the block layer should use
|
|
|
|
* submit_bio() instead.
|
|
|
|
*/
|
|
|
|
void submit_bio_noacct(struct bio *bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2021-01-24 13:02:34 +03:00
|
|
|
struct block_device *bdev = bio->bi_bdev;
|
2021-10-14 17:03:29 +03:00
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
2017-06-03 10:38:06 +03:00
|
|
|
blk_status_t status = BLK_STS_IOERR;
|
2020-06-04 20:23:39 +03:00
|
|
|
struct blk_plug *plug;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
2020-06-04 20:23:39 +03:00
|
|
|
plug = blk_mq_plug(q, bio);
|
|
|
|
if (plug && plug->nowait)
|
|
|
|
bio->bi_opf |= REQ_NOWAIT;
|
|
|
|
|
2017-06-20 15:05:46 +03:00
|
|
|
/*
|
2020-05-28 22:19:29 +03:00
|
|
|
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
|
2020-09-23 23:06:51 +03:00
|
|
|
* if queue does not support NOWAIT.
|
2017-06-20 15:05:46 +03:00
|
|
|
*/
|
2020-09-23 23:06:51 +03:00
|
|
|
if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q))
|
2020-05-28 22:19:29 +03:00
|
|
|
goto not_supported;
|
2017-06-20 15:05:46 +03:00
|
|
|
|
2018-02-07 01:05:39 +03:00
|
|
|
if (should_fail_bio(bio))
|
2011-09-12 14:12:01 +04:00
|
|
|
goto end_io;
|
2021-01-24 13:02:35 +03:00
|
|
|
if (unlikely(bio_check_ro(bio)))
|
|
|
|
goto end_io;
|
2021-01-25 21:39:57 +03:00
|
|
|
if (!bio_flagged(bio, BIO_REMAPPED)) {
|
|
|
|
if (unlikely(bio_check_eod(bio)))
|
|
|
|
goto end_io;
|
|
|
|
if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
|
|
|
|
goto end_io;
|
|
|
|
}
|
2006-03-23 22:00:26 +03:00
|
|
|
|
2011-09-12 14:12:01 +04:00
|
|
|
/*
|
2020-07-01 11:59:44 +03:00
|
|
|
* Filter flush bio's early so that bio based drivers without flush
|
|
|
|
* support don't have to worry about them.
|
2011-09-12 14:12:01 +04:00
|
|
|
*/
|
2017-01-27 19:08:23 +03:00
|
|
|
if (op_is_flush(bio->bi_opf) &&
|
2016-04-13 22:33:19 +03:00
|
|
|
!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
|
2016-08-06 00:35:16 +03:00
|
|
|
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
|
2020-07-01 11:59:42 +03:00
|
|
|
if (!bio_sectors(bio)) {
|
2017-06-03 10:38:06 +03:00
|
|
|
status = BLK_STS_OK;
|
2007-11-02 10:49:08 +03:00
|
|
|
goto end_io;
|
|
|
|
}
|
2011-09-12 14:12:01 +04:00
|
|
|
}
|
2006-10-31 09:07:21 +03:00
|
|
|
|
2018-12-14 19:21:22 +03:00
|
|
|
if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
|
2021-10-12 14:12:21 +03:00
|
|
|
bio_clear_polled(bio);
|
2018-12-14 19:21:22 +03:00
|
|
|
|
2016-06-09 17:00:36 +03:00
|
|
|
switch (bio_op(bio)) {
|
|
|
|
case REQ_OP_DISCARD:
|
2022-04-15 07:52:55 +03:00
|
|
|
if (!bdev_max_discard_sectors(bdev))
|
2016-06-09 17:00:36 +03:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
2022-04-15 07:52:57 +03:00
|
|
|
if (!bdev_max_secure_erase_sectors(bdev))
|
2016-06-09 17:00:36 +03:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
2020-05-12 11:55:47 +03:00
|
|
|
case REQ_OP_ZONE_APPEND:
|
|
|
|
status = blk_check_zone_append(q, bio);
|
|
|
|
if (status != BLK_STS_OK)
|
|
|
|
goto end_io;
|
|
|
|
break;
|
2016-10-18 09:40:32 +03:00
|
|
|
case REQ_OP_ZONE_RESET:
|
2019-10-27 17:05:45 +03:00
|
|
|
case REQ_OP_ZONE_OPEN:
|
|
|
|
case REQ_OP_ZONE_CLOSE:
|
|
|
|
case REQ_OP_ZONE_FINISH:
|
2017-08-23 20:10:32 +03:00
|
|
|
if (!blk_queue_is_zoned(q))
|
2016-10-18 09:40:32 +03:00
|
|
|
goto not_supported;
|
2016-06-09 17:00:36 +03:00
|
|
|
break;
|
2019-08-01 20:26:36 +03:00
|
|
|
case REQ_OP_ZONE_RESET_ALL:
|
|
|
|
if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-11-30 23:28:59 +03:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
2017-08-23 20:10:32 +03:00
|
|
|
if (!q->limits.max_write_zeroes_sectors)
|
2016-11-30 23:28:59 +03:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-06-09 17:00:36 +03:00
|
|
|
default:
|
|
|
|
break;
|
2011-09-12 14:12:01 +04:00
|
|
|
}
|
2009-09-08 23:56:38 +04:00
|
|
|
|
2021-11-12 12:33:54 +03:00
|
|
|
if (blk_throtl_bio(bio))
|
2022-02-16 07:45:10 +03:00
|
|
|
return;
|
2020-06-27 10:31:58 +03:00
|
|
|
|
|
|
|
blk_cgroup_bio_start(bio);
|
|
|
|
blkcg_bio_issue_init(bio);
|
2011-09-15 16:01:40 +04:00
|
|
|
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 18:40:52 +03:00
|
|
|
if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
|
2020-12-03 19:21:36 +03:00
|
|
|
trace_block_bio_queue(bio);
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 18:40:52 +03:00
|
|
|
/* Now that enqueuing has been traced, we need to trace
|
|
|
|
* completion as well.
|
|
|
|
*/
|
|
|
|
bio_set_flag(bio, BIO_TRACE_COMPLETION);
|
|
|
|
}
|
2022-02-16 07:45:10 +03:00
|
|
|
submit_bio_noacct_nocheck(bio);
|
2022-02-16 07:45:11 +03:00
|
|
|
return;
|
2008-11-28 07:32:03 +03:00
|
|
|
|
2016-06-09 17:00:36 +03:00
|
|
|
not_supported:
|
2017-06-03 10:38:06 +03:00
|
|
|
status = BLK_STS_NOTSUPP;
|
2008-11-28 07:32:03 +03:00
|
|
|
end_io:
|
2017-06-03 10:38:06 +03:00
|
|
|
bio->bi_status = status;
|
2015-07-20 16:29:37 +03:00
|
|
|
bio_endio(bio);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
|
|
|
}
|
2020-07-01 11:59:44 +03:00
|
|
|
EXPORT_SYMBOL(submit_bio_noacct);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/**
|
2008-08-19 22:13:11 +04:00
|
|
|
* submit_bio - submit a bio to the block device layer for I/O
|
2005-04-17 02:20:36 +04:00
|
|
|
* @bio: The &struct bio which describes the I/O
|
|
|
|
*
|
2020-04-28 14:27:53 +03:00
|
|
|
* submit_bio() is used to submit I/O requests to block devices. It is passed a
|
|
|
|
* fully set up &struct bio that describes the I/O that needs to be done. The
|
2021-01-24 13:02:34 +03:00
|
|
|
* bio will be send to the device described by the bi_bdev field.
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
2020-04-28 14:27:53 +03:00
|
|
|
* The success/failure status of the request, along with notification of
|
|
|
|
* completion, is delivered asynchronously through the ->bi_end_io() callback
|
|
|
|
* in @bio. The bio must NOT be touched by thecaller until ->bi_end_io() has
|
|
|
|
* been called.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2021-10-12 14:12:24 +03:00
|
|
|
void submit_bio(struct bio *bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2019-06-27 23:39:52 +03:00
|
|
|
if (blkcg_punt_bio_submit(bio))
|
2021-10-12 14:12:24 +03:00
|
|
|
return;
|
2019-06-27 23:39:52 +03:00
|
|
|
|
2007-09-27 15:01:25 +04:00
|
|
|
/*
|
|
|
|
* If it's a regular read/write or a barrier with data attached,
|
|
|
|
* go through the normal accounting stuff before submission.
|
|
|
|
*/
|
2012-09-18 20:19:25 +04:00
|
|
|
if (bio_has_data(bio)) {
|
2022-02-09 11:28:28 +03:00
|
|
|
unsigned int count = bio_sectors(bio);
|
2012-09-18 20:19:27 +04:00
|
|
|
|
2016-06-05 22:31:45 +03:00
|
|
|
if (op_is_write(bio_op(bio))) {
|
2007-09-27 15:01:25 +04:00
|
|
|
count_vm_events(PGPGOUT, count);
|
|
|
|
} else {
|
2013-10-12 02:44:27 +04:00
|
|
|
task_io_account_read(bio->bi_iter.bi_size);
|
2007-09-27 15:01:25 +04:00
|
|
|
count_vm_events(PGPGIN, count);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2019-08-08 22:03:00 +03:00
|
|
|
/*
|
2020-04-28 14:27:54 +03:00
|
|
|
* If we're reading data that is part of the userspace workingset, count
|
|
|
|
* submission time as memory stall. When the device is congested, or
|
|
|
|
* the submitting cgroup IO-throttled, submission can be a significant
|
|
|
|
* part of overall IO time.
|
2019-08-08 22:03:00 +03:00
|
|
|
*/
|
2020-04-28 14:27:54 +03:00
|
|
|
if (unlikely(bio_op(bio) == REQ_OP_READ &&
|
|
|
|
bio_flagged(bio, BIO_WORKINGSET))) {
|
|
|
|
unsigned long pflags;
|
2019-08-08 22:03:00 +03:00
|
|
|
|
2020-04-28 14:27:54 +03:00
|
|
|
psi_memstall_enter(&pflags);
|
2021-10-12 14:12:24 +03:00
|
|
|
submit_bio_noacct(bio);
|
2019-08-08 22:03:00 +03:00
|
|
|
psi_memstall_leave(&pflags);
|
2021-10-12 14:12:24 +03:00
|
|
|
return;
|
2020-04-28 14:27:54 +03:00
|
|
|
}
|
|
|
|
|
2021-10-12 14:12:24 +03:00
|
|
|
submit_bio_noacct(bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(submit_bio);
|
|
|
|
|
2021-10-12 14:12:24 +03:00
|
|
|
/**
|
|
|
|
* bio_poll - poll for BIO completions
|
|
|
|
* @bio: bio to poll for
|
2021-11-25 19:20:55 +03:00
|
|
|
* @iob: batches of IO
|
2021-10-12 14:12:24 +03:00
|
|
|
* @flags: BLK_POLL_* flags that control the behavior
|
|
|
|
*
|
|
|
|
* Poll for completions on queue associated with the bio. Returns number of
|
|
|
|
* completed entries found.
|
|
|
|
*
|
|
|
|
* Note: the caller must either be the context that submitted @bio, or
|
|
|
|
* be in a RCU critical section to prevent freeing of @bio.
|
|
|
|
*/
|
2021-10-12 18:24:29 +03:00
|
|
|
int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags)
|
2021-10-12 14:12:24 +03:00
|
|
|
{
|
2021-10-20 00:24:11 +03:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
2021-10-12 14:12:24 +03:00
|
|
|
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
|
2022-03-05 05:08:03 +03:00
|
|
|
int ret = 0;
|
2021-10-12 14:12:24 +03:00
|
|
|
|
|
|
|
if (cookie == BLK_QC_T_NONE ||
|
|
|
|
!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
|
|
|
|
return 0;
|
|
|
|
|
2022-01-27 10:05:49 +03:00
|
|
|
blk_flush_plug(current->plug, false);
|
2021-10-12 14:12:24 +03:00
|
|
|
|
|
|
|
if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT))
|
|
|
|
return 0;
|
2022-03-05 05:08:03 +03:00
|
|
|
if (queue_is_mq(q)) {
|
2021-10-12 18:24:29 +03:00
|
|
|
ret = blk_mq_poll(q, cookie, iob, flags);
|
2022-03-05 05:08:03 +03:00
|
|
|
} else {
|
|
|
|
struct gendisk *disk = q->disk;
|
|
|
|
|
|
|
|
if (disk && disk->fops->poll_bio)
|
|
|
|
ret = disk->fops->poll_bio(bio, iob, flags);
|
|
|
|
}
|
2021-10-12 14:12:24 +03:00
|
|
|
blk_queue_exit(q);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(bio_poll);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper to implement file_operations.iopoll. Requires the bio to be stored
|
|
|
|
* in iocb->private, and cleared before freeing the bio.
|
|
|
|
*/
|
2021-10-12 18:24:29 +03:00
|
|
|
int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
|
|
|
|
unsigned int flags)
|
2021-10-12 14:12:24 +03:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: the bio cache only uses SLAB_TYPESAFE_BY_RCU, so bio can
|
|
|
|
* point to a freshly allocated bio at this point. If that happens
|
|
|
|
* we have a few cases to consider:
|
|
|
|
*
|
|
|
|
* 1) the bio is beeing initialized and bi_bdev is NULL. We can just
|
|
|
|
* simply nothing in this case
|
|
|
|
* 2) the bio points to a not poll enabled device. bio_poll will catch
|
|
|
|
* this and return 0
|
|
|
|
* 3) the bio points to a poll capable device, including but not
|
|
|
|
* limited to the one that the original bio pointed to. In this
|
|
|
|
* case we will call into the actual poll method and poll for I/O,
|
|
|
|
* even if we don't need to, but it won't cause harm either.
|
|
|
|
*
|
|
|
|
* For cases 2) and 3) above the RCU grace period ensures that bi_bdev
|
|
|
|
* is still allocated. Because partitions hold a reference to the whole
|
|
|
|
* device bdev and thus disk, the disk is also still valid. Grabbing
|
|
|
|
* a reference to the queue in bio_poll() ensures the hctxs and requests
|
|
|
|
* are still valid as well.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
bio = READ_ONCE(kiocb->private);
|
|
|
|
if (bio && bio->bi_bdev)
|
2021-10-12 18:24:29 +03:00
|
|
|
ret = bio_poll(bio, iob, flags);
|
2021-10-12 14:12:24 +03:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
|
|
|
|
|
2021-11-17 09:14:01 +03:00
|
|
|
void update_io_ticks(struct block_device *part, unsigned long now, bool end)
|
2020-05-27 08:24:13 +03:00
|
|
|
{
|
|
|
|
unsigned long stamp;
|
|
|
|
again:
|
2020-11-24 11:36:54 +03:00
|
|
|
stamp = READ_ONCE(part->bd_stamp);
|
2021-07-06 00:47:26 +03:00
|
|
|
if (unlikely(time_after(now, stamp))) {
|
2020-11-24 11:36:54 +03:00
|
|
|
if (likely(cmpxchg(&part->bd_stamp, stamp, now) == stamp))
|
2020-05-27 08:24:13 +03:00
|
|
|
__part_stat_add(part, io_ticks, end ? now - stamp : 1);
|
|
|
|
}
|
2020-11-24 11:36:54 +03:00
|
|
|
if (part->bd_partno) {
|
|
|
|
part = bdev_whole(part);
|
2020-05-27 08:24:13 +03:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-11-24 11:36:54 +03:00
|
|
|
static unsigned long __part_start_io_acct(struct block_device *part,
|
2022-01-28 18:58:39 +03:00
|
|
|
unsigned int sectors, unsigned int op,
|
|
|
|
unsigned long start_time)
|
2020-05-27 08:24:04 +03:00
|
|
|
{
|
|
|
|
const int sgrp = op_stat_group(op);
|
|
|
|
|
|
|
|
part_stat_lock();
|
2022-01-28 18:58:39 +03:00
|
|
|
update_io_ticks(part, start_time, false);
|
2020-05-27 08:24:04 +03:00
|
|
|
part_stat_inc(part, ios[sgrp]);
|
|
|
|
part_stat_add(part, sectors[sgrp], sectors);
|
|
|
|
part_stat_local_inc(part, in_flight[op_is_write(op)]);
|
|
|
|
part_stat_unlock();
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
2022-01-28 18:58:39 +03:00
|
|
|
return start_time;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* bio_start_io_acct_time - start I/O accounting for bio based drivers
|
|
|
|
* @bio: bio to start account for
|
|
|
|
* @start_time: start time that should be passed back to bio_end_io_acct().
|
|
|
|
*/
|
|
|
|
void bio_start_io_acct_time(struct bio *bio, unsigned long start_time)
|
|
|
|
{
|
|
|
|
__part_start_io_acct(bio->bi_bdev, bio_sectors(bio),
|
|
|
|
bio_op(bio), start_time);
|
2020-05-27 08:24:04 +03:00
|
|
|
}
|
2022-01-28 18:58:39 +03:00
|
|
|
EXPORT_SYMBOL_GPL(bio_start_io_acct_time);
|
2020-09-01 01:27:23 +03:00
|
|
|
|
2021-01-24 13:02:37 +03:00
|
|
|
/**
|
|
|
|
* bio_start_io_acct - start I/O accounting for bio based drivers
|
|
|
|
* @bio: bio to start account for
|
|
|
|
*
|
|
|
|
* Returns the start time that should be passed back to bio_end_io_acct().
|
|
|
|
*/
|
|
|
|
unsigned long bio_start_io_acct(struct bio *bio)
|
2020-09-01 01:27:23 +03:00
|
|
|
{
|
2022-01-28 18:58:39 +03:00
|
|
|
return __part_start_io_acct(bio->bi_bdev, bio_sectors(bio),
|
|
|
|
bio_op(bio), jiffies);
|
2020-09-01 01:27:23 +03:00
|
|
|
}
|
2021-01-24 13:02:37 +03:00
|
|
|
EXPORT_SYMBOL_GPL(bio_start_io_acct);
|
2020-09-01 01:27:23 +03:00
|
|
|
|
|
|
|
unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
|
|
|
|
unsigned int op)
|
|
|
|
{
|
2022-01-28 18:58:39 +03:00
|
|
|
return __part_start_io_acct(disk->part0, sectors, op, jiffies);
|
2020-09-01 01:27:23 +03:00
|
|
|
}
|
2020-05-27 08:24:04 +03:00
|
|
|
EXPORT_SYMBOL(disk_start_io_acct);
|
|
|
|
|
2020-11-24 11:36:54 +03:00
|
|
|
static void __part_end_io_acct(struct block_device *part, unsigned int op,
|
2020-09-01 01:27:23 +03:00
|
|
|
unsigned long start_time)
|
2020-05-27 08:24:04 +03:00
|
|
|
{
|
|
|
|
const int sgrp = op_stat_group(op);
|
|
|
|
unsigned long now = READ_ONCE(jiffies);
|
|
|
|
unsigned long duration = now - start_time;
|
2018-12-06 19:41:19 +03:00
|
|
|
|
2020-05-27 08:24:04 +03:00
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(part, now, true);
|
|
|
|
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
|
|
|
|
part_stat_local_dec(part, in_flight[op_is_write(op)]);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
2020-09-01 01:27:23 +03:00
|
|
|
|
2021-01-24 13:02:37 +03:00
|
|
|
void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time,
|
|
|
|
struct block_device *orig_bdev)
|
2020-09-01 01:27:23 +03:00
|
|
|
{
|
2021-01-24 13:02:37 +03:00
|
|
|
__part_end_io_acct(orig_bdev, bio_op(bio), start_time);
|
2020-09-01 01:27:23 +03:00
|
|
|
}
|
2021-01-24 13:02:37 +03:00
|
|
|
EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped);
|
2020-09-01 01:27:23 +03:00
|
|
|
|
|
|
|
void disk_end_io_acct(struct gendisk *disk, unsigned int op,
|
|
|
|
unsigned long start_time)
|
|
|
|
{
|
2020-11-24 11:36:54 +03:00
|
|
|
__part_end_io_acct(disk->part0, op, start_time);
|
2020-09-01 01:27:23 +03:00
|
|
|
}
|
2020-05-27 08:24:04 +03:00
|
|
|
EXPORT_SYMBOL(disk_end_io_acct);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
2008-10-01 18:12:15 +04:00
|
|
|
/**
|
|
|
|
* blk_lld_busy - Check if underlying low-level drivers of a device are busy
|
|
|
|
* @q : the queue of the device being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Check if underlying low-level drivers of a device are busy.
|
|
|
|
* If the drivers want to export their busy state, they must set own
|
|
|
|
* exporting function using blk_queue_lld_busy() first.
|
|
|
|
*
|
|
|
|
* Basically, this function is used only by request stacking drivers
|
|
|
|
* to stop dispatching requests to underlying devices when underlying
|
|
|
|
* devices are busy. This behavior helps more I/O merging on the queue
|
|
|
|
* of the request stacking driver and prevents I/O throughput regression
|
|
|
|
* on burst I/O load.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - Not busy (The request stacking driver should dispatch request)
|
|
|
|
* 1 - Busy (The request stacking driver should stop dispatching request)
|
|
|
|
*/
|
|
|
|
int blk_lld_busy(struct request_queue *q)
|
|
|
|
{
|
2018-11-15 22:22:51 +03:00
|
|
|
if (queue_is_mq(q) && q->mq_ops->busy)
|
2018-10-29 19:15:10 +03:00
|
|
|
return q->mq_ops->busy(q);
|
2008-10-01 18:12:15 +04:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_lld_busy);
|
|
|
|
|
2014-04-08 19:15:35 +04:00
|
|
|
int kblockd_schedule_work(struct work_struct *work)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
return queue_work(kblockd_workqueue, work);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_work);
|
|
|
|
|
2017-04-10 18:54:55 +03:00
|
|
|
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
|
|
|
|
|
2021-10-06 15:34:11 +03:00
|
|
|
void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is a nested plug, don't actually assign it.
|
|
|
|
*/
|
|
|
|
if (tsk->plug)
|
|
|
|
return;
|
|
|
|
|
2021-10-18 19:12:12 +03:00
|
|
|
plug->mq_list = NULL;
|
2021-10-06 15:34:11 +03:00
|
|
|
plug->cached_rq = NULL;
|
|
|
|
plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
|
|
|
|
plug->rq_count = 0;
|
|
|
|
plug->multiple_queues = false;
|
2021-10-19 15:02:30 +03:00
|
|
|
plug->has_elevator = false;
|
2021-10-06 15:34:11 +03:00
|
|
|
plug->nowait = false;
|
|
|
|
INIT_LIST_HEAD(&plug->cb_list);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Store ordering should not be needed here, since a potential
|
|
|
|
* preempt will imply a full memory barrier
|
|
|
|
*/
|
|
|
|
tsk->plug = plug;
|
|
|
|
}
|
|
|
|
|
2011-09-21 12:00:16 +04:00
|
|
|
/**
|
|
|
|
* blk_start_plug - initialize blk_plug and track it inside the task_struct
|
|
|
|
* @plug: The &struct blk_plug that needs to be initialized
|
|
|
|
*
|
|
|
|
* Description:
|
2019-01-09 00:57:34 +03:00
|
|
|
* blk_start_plug() indicates to the block layer an intent by the caller
|
|
|
|
* to submit multiple I/O requests in a batch. The block layer may use
|
|
|
|
* this hint to defer submitting I/Os from the caller until blk_finish_plug()
|
|
|
|
* is called. However, the block layer may choose to submit requests
|
|
|
|
* before a call to blk_finish_plug() if the number of queued I/Os
|
|
|
|
* exceeds %BLK_MAX_REQUEST_COUNT, or if the size of the I/O is larger than
|
|
|
|
* %BLK_PLUG_FLUSH_SIZE. The queued I/Os may also be submitted early if
|
|
|
|
* the task schedules (see below).
|
|
|
|
*
|
2011-09-21 12:00:16 +04:00
|
|
|
* Tracking blk_plug inside the task_struct will help with auto-flushing the
|
|
|
|
* pending I/O should the task end up blocking between blk_start_plug() and
|
|
|
|
* blk_finish_plug(). This is important from a performance perspective, but
|
|
|
|
* also ensures that we don't deadlock. For instance, if the task is blocking
|
|
|
|
* for a memory allocation, memory reclaim could end up wanting to free a
|
|
|
|
* page belonging to that request that is currently residing in our private
|
|
|
|
* plug. By flushing the pending I/O when the process goes to sleep, we avoid
|
|
|
|
* this kind of deadlock.
|
|
|
|
*/
|
2011-03-08 15:19:51 +03:00
|
|
|
void blk_start_plug(struct blk_plug *plug)
|
|
|
|
{
|
2021-10-06 15:34:11 +03:00
|
|
|
blk_start_plug_nr_ios(plug, 1);
|
2011-03-08 15:19:51 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_plug);
|
|
|
|
|
2012-07-31 11:08:15 +04:00
|
|
|
static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
|
2011-04-18 11:52:22 +04:00
|
|
|
{
|
|
|
|
LIST_HEAD(callbacks);
|
|
|
|
|
2012-07-31 11:08:15 +04:00
|
|
|
while (!list_empty(&plug->cb_list)) {
|
|
|
|
list_splice_init(&plug->cb_list, &callbacks);
|
2011-04-18 11:52:22 +04:00
|
|
|
|
2012-07-31 11:08:15 +04:00
|
|
|
while (!list_empty(&callbacks)) {
|
|
|
|
struct blk_plug_cb *cb = list_first_entry(&callbacks,
|
2011-04-18 11:52:22 +04:00
|
|
|
struct blk_plug_cb,
|
|
|
|
list);
|
2012-07-31 11:08:15 +04:00
|
|
|
list_del(&cb->list);
|
2012-07-31 11:08:15 +04:00
|
|
|
cb->callback(cb, from_schedule);
|
2012-07-31 11:08:15 +04:00
|
|
|
}
|
2011-04-18 11:52:22 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-31 11:08:14 +04:00
|
|
|
struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
|
|
|
|
int size)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct blk_plug_cb *cb;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
list_for_each_entry(cb, &plug->cb_list, list)
|
|
|
|
if (cb->callback == unplug && cb->data == data)
|
|
|
|
return cb;
|
|
|
|
|
|
|
|
/* Not currently on the callback list */
|
|
|
|
BUG_ON(size < sizeof(*cb));
|
|
|
|
cb = kzalloc(size, GFP_ATOMIC);
|
|
|
|
if (cb) {
|
|
|
|
cb->data = data;
|
|
|
|
cb->callback = unplug;
|
|
|
|
list_add(&cb->list, &plug->cb_list);
|
|
|
|
}
|
|
|
|
return cb;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_check_plugged);
|
|
|
|
|
2022-01-27 10:05:49 +03:00
|
|
|
void __blk_flush_plug(struct blk_plug *plug, bool from_schedule)
|
2011-03-08 15:19:51 +03:00
|
|
|
{
|
2021-10-20 17:41:18 +03:00
|
|
|
if (!list_empty(&plug->cb_list))
|
|
|
|
flush_plug_callbacks(plug, from_schedule);
|
2021-10-18 19:12:12 +03:00
|
|
|
if (!rq_list_empty(plug->mq_list))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
blk_mq_flush_plug_list(plug, from_schedule);
|
2021-11-03 14:49:07 +03:00
|
|
|
/*
|
|
|
|
* Unconditionally flush out cached requests, even if the unplug
|
|
|
|
* event came from schedule. Since we know hold references to the
|
|
|
|
* queue for cached requests, we don't want a blocked task holding
|
|
|
|
* up a queue freeze/quiesce event.
|
|
|
|
*/
|
|
|
|
if (unlikely(!rq_list_empty(plug->cached_rq)))
|
2021-10-06 15:34:11 +03:00
|
|
|
blk_mq_free_plug_rqs(plug);
|
2011-03-08 15:19:51 +03:00
|
|
|
}
|
|
|
|
|
2019-01-09 00:57:34 +03:00
|
|
|
/**
|
|
|
|
* blk_finish_plug - mark the end of a batch of submitted I/O
|
|
|
|
* @plug: The &struct blk_plug passed to blk_start_plug()
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Indicate that a batch of I/O submissions is complete. This function
|
|
|
|
* must be paired with an initial call to blk_start_plug(). The intent
|
|
|
|
* is to allow the block layer to optimize I/O submission. See the
|
|
|
|
* documentation for blk_start_plug() for more information.
|
|
|
|
*/
|
2011-03-08 15:19:51 +03:00
|
|
|
void blk_finish_plug(struct blk_plug *plug)
|
|
|
|
{
|
2021-10-20 17:41:19 +03:00
|
|
|
if (plug == current->plug) {
|
2022-01-27 10:05:49 +03:00
|
|
|
__blk_flush_plug(plug, false);
|
2021-10-20 17:41:19 +03:00
|
|
|
current->plug = NULL;
|
|
|
|
}
|
2011-03-08 15:19:51 +03:00
|
|
|
}
|
2011-04-15 17:20:10 +04:00
|
|
|
EXPORT_SYMBOL(blk_finish_plug);
|
2011-03-08 15:19:51 +03:00
|
|
|
|
2020-05-14 11:45:09 +03:00
|
|
|
void blk_io_schedule(void)
|
|
|
|
{
|
|
|
|
/* Prevent hang_check timer from firing at us during very long I/O */
|
|
|
|
unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
|
|
|
|
|
|
|
|
if (timeout)
|
|
|
|
io_schedule_timeout(timeout);
|
|
|
|
else
|
|
|
|
io_schedule();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_io_schedule);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
int __init blk_dev_init(void)
|
|
|
|
{
|
2016-10-28 17:48:16 +03:00
|
|
|
BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
|
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-09 21:31:43 +03:00
|
|
|
sizeof_field(struct request, cmd_flags));
|
2016-10-28 17:48:16 +03:00
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-09 21:31:43 +03:00
|
|
|
sizeof_field(struct bio, bi_opf));
|
2021-12-03 16:15:32 +03:00
|
|
|
BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu),
|
|
|
|
__alignof__(struct request_queue)) !=
|
|
|
|
sizeof(struct request_queue));
|
2009-04-27 16:53:54 +04:00
|
|
|
|
2011-01-03 17:01:47 +03:00
|
|
|
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
|
|
|
|
kblockd_workqueue = alloc_workqueue("kblockd",
|
2014-06-12 01:43:54 +04:00
|
|
|
WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!kblockd_workqueue)
|
|
|
|
panic("Failed to create kblockd\n");
|
|
|
|
|
2015-11-21 00:16:46 +03:00
|
|
|
blk_requestq_cachep = kmem_cache_create("request_queue",
|
2007-07-24 11:28:11 +04:00
|
|
|
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2021-12-03 16:15:32 +03:00
|
|
|
blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu",
|
|
|
|
sizeof(struct request_queue) +
|
|
|
|
sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL);
|
|
|
|
|
2017-02-01 01:53:20 +03:00
|
|
|
blk_debugfs_root = debugfs_create_dir("block", NULL);
|
|
|
|
|
2008-01-24 10:53:35 +03:00
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|