License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifndef _LINUX_BLKDEV_H
|
|
|
|
#define _LINUX_BLKDEV_H
|
|
|
|
|
2012-05-14 10:29:23 +04:00
|
|
|
#include <linux/sched.h>
|
2017-02-01 18:36:40 +03:00
|
|
|
#include <linux/sched/clock.h>
|
2012-05-14 10:29:23 +04:00
|
|
|
|
2007-09-21 11:19:54 +04:00
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/major.h>
|
|
|
|
#include <linux/genhd.h>
|
|
|
|
#include <linux/list.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/pagemap.h>
|
2015-05-23 00:13:32 +03:00
|
|
|
#include <linux/backing-dev-defs.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/wait.h>
|
|
|
|
#include <linux/mempool.h>
|
2016-01-16 03:56:14 +03:00
|
|
|
#include <linux/pfn.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/stringify.h>
|
2008-09-11 12:57:55 +04:00
|
|
|
#include <linux/gfp.h>
|
2007-07-09 14:40:35 +04:00
|
|
|
#include <linux/bsg.h>
|
2008-09-13 22:26:01 +04:00
|
|
|
#include <linux/smp.h>
|
2013-01-09 20:05:13 +04:00
|
|
|
#include <linux/rcupdate.h>
|
2014-07-01 20:34:38 +04:00
|
|
|
#include <linux/percpu-refcount.h>
|
2015-05-01 13:46:15 +03:00
|
|
|
#include <linux/scatterlist.h>
|
2016-10-18 09:40:33 +03:00
|
|
|
#include <linux/blkzoned.h>
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 19:29:48 +03:00
|
|
|
#include <linux/seqlock.h>
|
|
|
|
#include <linux/u64_stats_sync.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-05-26 21:46:22 +04:00
|
|
|
struct module;
|
2006-03-22 19:52:04 +03:00
|
|
|
struct scsi_ioctl_command;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
struct request_queue;
|
|
|
|
struct elevator_queue;
|
2006-03-23 22:00:26 +03:00
|
|
|
struct blk_trace;
|
2007-07-09 14:38:05 +04:00
|
|
|
struct request;
|
|
|
|
struct sg_io_hdr;
|
2011-08-01 00:05:09 +04:00
|
|
|
struct bsg_job;
|
2012-04-17 00:57:25 +04:00
|
|
|
struct blkcg_gq;
|
2014-09-25 19:23:43 +04:00
|
|
|
struct blk_flush_queue;
|
2015-10-15 15:10:48 +03:00
|
|
|
struct pr_ops;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
|
|
|
struct rq_wb;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 18:56:08 +03:00
|
|
|
struct blk_queue_stats;
|
|
|
|
struct blk_stat_callback;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
#define BLKDEV_MIN_RQ 4
|
|
|
|
#define BLKDEV_MAX_RQ 128 /* Default maximum */
|
|
|
|
|
2017-04-21 01:59:11 +03:00
|
|
|
/* Must be consisitent with blk_mq_poll_stats_bkt() */
|
|
|
|
#define BLK_MQ_POLL_STATS_BKTS 16
|
|
|
|
|
2012-04-14 00:11:28 +04:00
|
|
|
/*
|
|
|
|
* Maximum number of blkcg policies allowed to be registered concurrently.
|
|
|
|
* Defined here to simplify include dependency.
|
|
|
|
*/
|
block, bfq: add full hierarchical scheduling and cgroups support
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-12 19:23:08 +03:00
|
|
|
#define BLKCG_MAX_POLS 3
|
2012-04-14 00:11:28 +04:00
|
|
|
|
2017-06-03 10:38:04 +03:00
|
|
|
typedef void (rq_end_io_fn)(struct request *, blk_status_t);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-06-05 07:40:59 +04:00
|
|
|
#define BLK_RL_SYNCFULL (1U << 0)
|
|
|
|
#define BLK_RL_ASYNCFULL (1U << 1)
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
struct request_list {
|
2012-06-05 07:40:59 +04:00
|
|
|
struct request_queue *q; /* the queue this rl belongs to */
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
struct blkcg_gq *blkg; /* blkg this request pool belongs to */
|
|
|
|
#endif
|
2009-04-06 16:48:01 +04:00
|
|
|
/*
|
|
|
|
* count[], starved[], and wait[] are indexed by
|
|
|
|
* BLK_RW_SYNC/BLK_RW_ASYNC
|
|
|
|
*/
|
2012-06-05 07:40:58 +04:00
|
|
|
int count[2];
|
|
|
|
int starved[2];
|
|
|
|
mempool_t *rq_pool;
|
|
|
|
wait_queue_head_t wait[2];
|
2012-06-05 07:40:59 +04:00
|
|
|
unsigned int flags;
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2016-10-20 16:12:13 +03:00
|
|
|
/*
|
|
|
|
* request flags */
|
|
|
|
typedef __u32 __bitwise req_flags_t;
|
|
|
|
|
|
|
|
/* elevator knows about this request */
|
|
|
|
#define RQF_SORTED ((__force req_flags_t)(1 << 0))
|
|
|
|
/* drive already may have started this one */
|
|
|
|
#define RQF_STARTED ((__force req_flags_t)(1 << 1))
|
|
|
|
/* uses tagged queueing */
|
|
|
|
#define RQF_QUEUED ((__force req_flags_t)(1 << 2))
|
|
|
|
/* may not be passed by ioscheduler */
|
|
|
|
#define RQF_SOFTBARRIER ((__force req_flags_t)(1 << 3))
|
|
|
|
/* request for flush sequence */
|
|
|
|
#define RQF_FLUSH_SEQ ((__force req_flags_t)(1 << 4))
|
|
|
|
/* merge of different types, fail separately */
|
|
|
|
#define RQF_MIXED_MERGE ((__force req_flags_t)(1 << 5))
|
|
|
|
/* track inflight for MQ */
|
|
|
|
#define RQF_MQ_INFLIGHT ((__force req_flags_t)(1 << 6))
|
|
|
|
/* don't call prep for this one */
|
|
|
|
#define RQF_DONTPREP ((__force req_flags_t)(1 << 7))
|
|
|
|
/* set for "ide_preempt" requests and also for requests for which the SCSI
|
|
|
|
"quiesce" state must be ignored. */
|
|
|
|
#define RQF_PREEMPT ((__force req_flags_t)(1 << 8))
|
|
|
|
/* contains copies of user pages */
|
|
|
|
#define RQF_COPY_USER ((__force req_flags_t)(1 << 9))
|
|
|
|
/* vaguely specified driver internal error. Ignored by the block layer */
|
|
|
|
#define RQF_FAILED ((__force req_flags_t)(1 << 10))
|
|
|
|
/* don't warn about errors */
|
|
|
|
#define RQF_QUIET ((__force req_flags_t)(1 << 11))
|
|
|
|
/* elevator private data attached */
|
|
|
|
#define RQF_ELVPRIV ((__force req_flags_t)(1 << 12))
|
|
|
|
/* account I/O stat */
|
|
|
|
#define RQF_IO_STAT ((__force req_flags_t)(1 << 13))
|
|
|
|
/* request came from our alloc pool */
|
|
|
|
#define RQF_ALLOCED ((__force req_flags_t)(1 << 14))
|
|
|
|
/* runtime pm request */
|
|
|
|
#define RQF_PM ((__force req_flags_t)(1 << 15))
|
|
|
|
/* on IO scheduler merge hash */
|
|
|
|
#define RQF_HASHED ((__force req_flags_t)(1 << 16))
|
2016-11-08 07:32:37 +03:00
|
|
|
/* IO stats tracking on */
|
|
|
|
#define RQF_STATS ((__force req_flags_t)(1 << 17))
|
2016-12-09 01:20:32 +03:00
|
|
|
/* Look at ->special_vec for the actual data payload instead of the
|
|
|
|
bio chain. */
|
|
|
|
#define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18))
|
2017-12-21 09:43:38 +03:00
|
|
|
/* The per-zone write lock is held for this request */
|
|
|
|
#define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19))
|
2018-01-09 19:29:51 +03:00
|
|
|
/* timeout is expired */
|
|
|
|
#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20))
|
2018-01-10 21:30:56 +03:00
|
|
|
/* already slept for hybrid poll */
|
|
|
|
#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 21))
|
2016-10-20 16:12:13 +03:00
|
|
|
|
|
|
|
/* flags that prevent us from merging requests: */
|
|
|
|
#define RQF_NOMERGE_FLAGS \
|
2016-12-09 01:20:32 +03:00
|
|
|
(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
|
2016-10-20 16:12:13 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2014-05-06 14:12:45 +04:00
|
|
|
* Try to put the fields that are referenced together in the same cacheline.
|
|
|
|
*
|
|
|
|
* If you modify this structure, make sure to update blk_rq_init() and
|
|
|
|
* especially blk_mq_rq_ctx_init() to take care of the added fields.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
struct request {
|
2007-07-24 11:28:11 +04:00
|
|
|
struct request_queue *q;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
struct blk_mq_ctx *mq_ctx;
|
2006-08-10 11:00:21 +04:00
|
|
|
|
2016-06-09 17:00:35 +03:00
|
|
|
int cpu;
|
2016-10-28 17:48:16 +03:00
|
|
|
unsigned int cmd_flags; /* op and common flags */
|
2016-10-20 16:12:13 +03:00
|
|
|
req_flags_t rq_flags;
|
2017-01-31 22:34:41 +03:00
|
|
|
|
|
|
|
int internal_tag;
|
|
|
|
|
2009-05-07 17:24:44 +04:00
|
|
|
/* the following two fields are internal, NEVER access directly */
|
|
|
|
unsigned int __data_len; /* total data len */
|
2017-01-17 16:03:22 +03:00
|
|
|
int tag;
|
2010-03-19 10:58:16 +03:00
|
|
|
sector_t __sector; /* sector cursor */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
struct bio *bio;
|
|
|
|
struct bio *biotail;
|
|
|
|
|
2018-01-10 21:46:39 +03:00
|
|
|
struct list_head queuelist;
|
|
|
|
|
2014-04-10 06:27:01 +04:00
|
|
|
/*
|
|
|
|
* The hash is used inside the scheduler, and killed once the
|
|
|
|
* request reaches the dispatch list. The ipi_list is only used
|
|
|
|
* to queue the request for softirq completion, which is long
|
|
|
|
* after the request has been unhashed (and even removed from
|
|
|
|
* the dispatch list).
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct hlist_node hash; /* merge hash */
|
|
|
|
struct list_head ipi_list;
|
|
|
|
};
|
|
|
|
|
2006-08-10 11:00:21 +04:00
|
|
|
/*
|
|
|
|
* The rb_node is only used inside the io scheduler, requests
|
|
|
|
* are pruned when moved to the dispatch queue. So let the
|
2011-02-11 13:08:00 +03:00
|
|
|
* completion_data share space with the rb_node.
|
2006-08-10 11:00:21 +04:00
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct rb_node rb_node; /* sort/lookup */
|
2016-12-09 01:20:32 +03:00
|
|
|
struct bio_vec special_vec;
|
2011-02-11 13:08:00 +03:00
|
|
|
void *completion_data;
|
2017-04-20 17:03:11 +03:00
|
|
|
int error_count; /* for legacy drivers, don't use */
|
2006-08-10 11:00:21 +04:00
|
|
|
};
|
2006-07-28 11:23:08 +04:00
|
|
|
|
2006-07-12 16:04:37 +04:00
|
|
|
/*
|
2010-04-21 19:44:16 +04:00
|
|
|
* Three pointers are available for the IO schedulers, if they need
|
2011-02-11 13:08:00 +03:00
|
|
|
* more they have to dynamically allocate it. Flush requests are
|
|
|
|
* never put on the IO scheduler. So let the flush fields share
|
2011-12-14 03:33:41 +04:00
|
|
|
* space with the elevator data.
|
2006-07-12 16:04:37 +04:00
|
|
|
*/
|
2011-02-11 13:08:00 +03:00
|
|
|
union {
|
2011-12-14 03:33:41 +04:00
|
|
|
struct {
|
|
|
|
struct io_cq *icq;
|
|
|
|
void *priv[2];
|
|
|
|
} elv;
|
|
|
|
|
2011-02-11 13:08:00 +03:00
|
|
|
struct {
|
|
|
|
unsigned int seq;
|
|
|
|
struct list_head list;
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 23:37:25 +04:00
|
|
|
rq_end_io_fn *saved_end_io;
|
2011-02-11 13:08:00 +03:00
|
|
|
} flush;
|
|
|
|
};
|
2006-07-12 16:04:37 +04:00
|
|
|
|
2006-06-13 11:02:34 +04:00
|
|
|
struct gendisk *rq_disk;
|
2011-01-05 18:57:38 +03:00
|
|
|
struct hd_struct *part;
|
2005-04-17 02:20:36 +04:00
|
|
|
unsigned long start_time;
|
2016-11-08 07:32:37 +03:00
|
|
|
struct blk_issue_stat issue_stat;
|
2005-04-17 02:20:36 +04:00
|
|
|
/* Number of scatter-gather DMA addr+len pairs after
|
|
|
|
* physical address coalescing is performed.
|
|
|
|
*/
|
|
|
|
unsigned short nr_phys_segments;
|
2018-01-10 21:46:39 +03:00
|
|
|
|
2010-09-10 22:50:10 +04:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
unsigned short nr_integrity_segments;
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-01-10 21:46:39 +03:00
|
|
|
unsigned short write_hint;
|
2006-06-13 11:02:34 +04:00
|
|
|
unsigned short ioprio;
|
|
|
|
|
2017-04-05 21:16:38 +03:00
|
|
|
unsigned int timeout;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2009-04-23 06:05:20 +04:00
|
|
|
void *special; /* opaque pointer available for LLD use */
|
2006-07-28 11:32:07 +04:00
|
|
|
|
2008-03-04 13:17:11 +03:00
|
|
|
unsigned int extra_len; /* length of alignment and padding */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 19:29:48 +03:00
|
|
|
/*
|
|
|
|
* On blk-mq, the lower bits of ->gstate (generation number and
|
|
|
|
* state) carry the MQ_RQ_* state value and the upper bits the
|
|
|
|
* generation number which is monotonically incremented and used to
|
|
|
|
* distinguish the reuse instances.
|
|
|
|
*
|
|
|
|
* ->gstate_seq allows updates to ->gstate and other fields
|
|
|
|
* (currently ->deadline) during request start to be read
|
|
|
|
* atomically from the timeout path, so that it can operate on a
|
|
|
|
* coherent set of information.
|
|
|
|
*/
|
|
|
|
seqcount_t gstate_seq;
|
|
|
|
u64 gstate;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ->aborted_gstate is used by the timeout to claim a specific
|
|
|
|
* recycle instance of this request. See blk_mq_timeout_work().
|
|
|
|
*/
|
|
|
|
struct u64_stats_sync aborted_gstate_sync;
|
|
|
|
u64 aborted_gstate;
|
|
|
|
|
2018-01-10 00:23:42 +03:00
|
|
|
/* access through blk_rq_set_deadline, blk_rq_deadline */
|
|
|
|
unsigned long __deadline;
|
2017-06-27 18:22:02 +03:00
|
|
|
|
2008-09-14 16:55:09 +04:00
|
|
|
struct list_head timeout_list;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-01-10 21:46:39 +03:00
|
|
|
union {
|
Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
2018-01-29 22:51:49 +03:00
|
|
|
struct __call_single_data csd;
|
2018-01-10 21:46:39 +03:00
|
|
|
u64 fifo_time;
|
|
|
|
};
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2006-09-30 22:29:12 +04:00
|
|
|
* completion callback.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
rq_end_io_fn *end_io;
|
|
|
|
void *end_io_data;
|
2007-07-16 10:52:14 +04:00
|
|
|
|
|
|
|
/* for bidi */
|
|
|
|
struct request *next_rq;
|
2018-01-10 21:46:39 +03:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
struct request_list *rl; /* rl this rq is alloced from */
|
|
|
|
unsigned long long start_time_ns;
|
|
|
|
unsigned long long io_start_time_ns; /* when passed to hardware */
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2017-12-18 10:40:43 +03:00
|
|
|
static inline bool blk_op_is_scsi(unsigned int op)
|
|
|
|
{
|
|
|
|
return op == REQ_OP_SCSI_IN || op == REQ_OP_SCSI_OUT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_op_is_private(unsigned int op)
|
|
|
|
{
|
|
|
|
return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT;
|
|
|
|
}
|
|
|
|
|
2017-01-31 18:57:31 +03:00
|
|
|
static inline bool blk_rq_is_scsi(struct request *rq)
|
|
|
|
{
|
2017-12-18 10:40:43 +03:00
|
|
|
return blk_op_is_scsi(req_op(rq));
|
2017-01-31 18:57:31 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_rq_is_private(struct request *rq)
|
|
|
|
{
|
2017-12-18 10:40:43 +03:00
|
|
|
return blk_op_is_private(req_op(rq));
|
2017-01-31 18:57:31 +03:00
|
|
|
}
|
|
|
|
|
2017-01-31 18:57:29 +03:00
|
|
|
static inline bool blk_rq_is_passthrough(struct request *rq)
|
|
|
|
{
|
2017-01-31 18:57:31 +03:00
|
|
|
return blk_rq_is_scsi(rq) || blk_rq_is_private(rq);
|
2017-01-31 18:57:29 +03:00
|
|
|
}
|
|
|
|
|
2017-12-18 10:40:43 +03:00
|
|
|
static inline bool bio_is_passthrough(struct bio *bio)
|
|
|
|
{
|
|
|
|
unsigned op = bio_op(bio);
|
|
|
|
|
|
|
|
return blk_op_is_scsi(op) || blk_op_is_private(op);
|
|
|
|
}
|
|
|
|
|
2008-08-14 11:59:13 +04:00
|
|
|
static inline unsigned short req_get_ioprio(struct request *req)
|
|
|
|
{
|
|
|
|
return req->ioprio;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/elevator.h>
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
struct blk_queue_ctx;
|
|
|
|
|
2007-07-24 11:28:11 +04:00
|
|
|
typedef void (request_fn_proc) (struct request_queue *q);
|
2015-11-05 20:41:16 +03:00
|
|
|
typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
|
2017-11-02 21:29:54 +03:00
|
|
|
typedef bool (poll_q_fn) (struct request_queue *q, blk_qc_t);
|
2007-07-24 11:28:11 +04:00
|
|
|
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
|
2010-07-01 14:49:17 +04:00
|
|
|
typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
struct bio_vec;
|
2006-01-09 18:02:34 +03:00
|
|
|
typedef void (softirq_done_fn)(struct request *);
|
2008-02-19 13:36:53 +03:00
|
|
|
typedef int (dma_drain_needed_fn)(struct request *);
|
2008-10-01 18:12:15 +04:00
|
|
|
typedef int (lld_busy_fn) (struct request_queue *q);
|
2011-08-01 00:05:09 +04:00
|
|
|
typedef int (bsg_job_fn) (struct bsg_job *);
|
2017-01-27 19:51:45 +03:00
|
|
|
typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
|
|
|
|
typedef void (exit_rq_fn)(struct request_queue *, struct request *);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-09-14 16:55:09 +04:00
|
|
|
enum blk_eh_timer_return {
|
|
|
|
BLK_EH_NOT_HANDLED,
|
|
|
|
BLK_EH_HANDLED,
|
|
|
|
BLK_EH_RESET_TIMER,
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
enum blk_queue_state {
|
|
|
|
Queue_down,
|
|
|
|
Queue_up,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct blk_queue_tag {
|
|
|
|
struct request **tag_index; /* map of busy tags */
|
|
|
|
unsigned long *tag_map; /* bit map of free/busy tags */
|
|
|
|
int max_depth; /* what we will send to device */
|
2005-08-06 00:28:11 +04:00
|
|
|
int real_max_depth; /* what the array can hold */
|
2005-04-17 02:20:36 +04:00
|
|
|
atomic_t refcnt; /* map can be shared */
|
2015-01-16 04:32:25 +03:00
|
|
|
int alloc_policy; /* tag allocation policy */
|
|
|
|
int next_tag; /* next tag */
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
2015-01-16 04:32:25 +03:00
|
|
|
#define BLK_TAG_ALLOC_FIFO 0 /* allocate starting from 0 */
|
|
|
|
#define BLK_TAG_ALLOC_RR 1 /* allocate starting from last allocated tag */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-08-16 09:10:05 +04:00
|
|
|
#define BLK_SCSI_MAX_CMDS (256)
|
|
|
|
#define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
|
|
|
|
|
2016-10-18 09:40:29 +03:00
|
|
|
/*
|
|
|
|
* Zoned block device models (zoned limit).
|
|
|
|
*/
|
|
|
|
enum blk_zoned_model {
|
|
|
|
BLK_ZONED_NONE, /* Regular block device */
|
|
|
|
BLK_ZONED_HA, /* Host-aware zoned block device */
|
|
|
|
BLK_ZONED_HM, /* Host-managed zoned block device */
|
|
|
|
};
|
|
|
|
|
2009-05-23 01:17:51 +04:00
|
|
|
struct queue_limits {
|
|
|
|
unsigned long bounce_pfn;
|
|
|
|
unsigned long seg_boundary_mask;
|
2015-08-20 00:24:05 +03:00
|
|
|
unsigned long virt_boundary_mask;
|
2009-05-23 01:17:51 +04:00
|
|
|
|
|
|
|
unsigned int max_hw_sectors;
|
2015-11-14 00:46:48 +03:00
|
|
|
unsigned int max_dev_sectors;
|
2014-06-05 23:38:39 +04:00
|
|
|
unsigned int chunk_sectors;
|
2009-05-23 01:17:51 +04:00
|
|
|
unsigned int max_sectors;
|
|
|
|
unsigned int max_segment_size;
|
2009-05-23 01:17:53 +04:00
|
|
|
unsigned int physical_block_size;
|
|
|
|
unsigned int alignment_offset;
|
|
|
|
unsigned int io_min;
|
|
|
|
unsigned int io_opt;
|
2009-09-30 15:54:20 +04:00
|
|
|
unsigned int max_discard_sectors;
|
2015-07-16 18:14:26 +03:00
|
|
|
unsigned int max_hw_discard_sectors;
|
2012-09-18 20:19:27 +04:00
|
|
|
unsigned int max_write_same_sectors;
|
2016-11-30 23:28:59 +03:00
|
|
|
unsigned int max_write_zeroes_sectors;
|
2009-11-10 13:50:21 +03:00
|
|
|
unsigned int discard_granularity;
|
|
|
|
unsigned int discard_alignment;
|
2009-05-23 01:17:51 +04:00
|
|
|
|
|
|
|
unsigned short logical_block_size;
|
2010-02-26 08:20:39 +03:00
|
|
|
unsigned short max_segments;
|
2010-09-10 22:50:10 +04:00
|
|
|
unsigned short max_integrity_segments;
|
2017-02-08 16:46:49 +03:00
|
|
|
unsigned short max_discard_segments;
|
2009-05-23 01:17:51 +04:00
|
|
|
|
2009-05-23 01:17:53 +04:00
|
|
|
unsigned char misaligned;
|
2009-11-10 13:50:21 +03:00
|
|
|
unsigned char discard_misaligned;
|
2010-12-01 21:41:49 +03:00
|
|
|
unsigned char cluster;
|
2013-07-12 09:39:53 +04:00
|
|
|
unsigned char raid_partial_stripes_expensive;
|
2016-10-18 09:40:29 +03:00
|
|
|
enum blk_zoned_model zoned;
|
2009-05-23 01:17:51 +04:00
|
|
|
};
|
|
|
|
|
2016-10-18 09:40:33 +03:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
|
|
|
|
struct blk_zone_report_hdr {
|
|
|
|
unsigned int nr_zones;
|
|
|
|
u8 padding[60];
|
|
|
|
};
|
|
|
|
|
|
|
|
extern int blkdev_report_zones(struct block_device *bdev,
|
|
|
|
sector_t sector, struct blk_zone *zones,
|
|
|
|
unsigned int *nr_zones, gfp_t gfp_mask);
|
|
|
|
extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors,
|
|
|
|
sector_t nr_sectors, gfp_t gfp_mask);
|
|
|
|
|
2016-10-18 09:40:35 +03:00
|
|
|
extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
|
|
|
extern int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
|
|
|
|
|
|
|
#else /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
|
|
|
static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int blkdev_reset_zones_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
2016-10-18 09:40:33 +03:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
2011-07-13 23:17:23 +04:00
|
|
|
struct request_queue {
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Together with queue_head for cacheline sharing
|
|
|
|
*/
|
|
|
|
struct list_head queue_head;
|
|
|
|
struct request *last_merge;
|
2008-10-31 12:05:07 +03:00
|
|
|
struct elevator_queue *elevator;
|
2012-06-05 07:40:58 +04:00
|
|
|
int nr_rqs[2]; /* # allocated [a]sync rqs */
|
|
|
|
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-06-21 02:56:13 +03:00
|
|
|
atomic_t shared_hctx_restart;
|
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 18:56:08 +03:00
|
|
|
struct blk_queue_stats *stats;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
|
|
|
struct rq_wb *rq_wb;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
|
|
|
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
|
|
|
|
* is used, root blkg allocates from @q->root_rl and all other
|
|
|
|
* blkgs from their own blkg->rl. Which one to use should be
|
|
|
|
* determined using bio_request_list().
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
|
|
|
struct request_list root_rl;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
request_fn_proc *request_fn;
|
|
|
|
make_request_fn *make_request_fn;
|
2017-11-02 21:29:54 +03:00
|
|
|
poll_q_fn *poll_fn;
|
2005-04-17 02:20:36 +04:00
|
|
|
prep_rq_fn *prep_rq_fn;
|
2010-07-01 14:49:17 +04:00
|
|
|
unprep_rq_fn *unprep_rq_fn;
|
2006-01-09 18:02:34 +03:00
|
|
|
softirq_done_fn *softirq_done_fn;
|
2008-09-14 16:55:09 +04:00
|
|
|
rq_timed_out_fn *rq_timed_out_fn;
|
2008-02-19 13:36:53 +03:00
|
|
|
dma_drain_needed_fn *dma_drain_needed;
|
2008-10-01 18:12:15 +04:00
|
|
|
lld_busy_fn *lld_busy_fn;
|
2017-06-20 21:15:40 +03:00
|
|
|
/* Called just after a request is allocated */
|
2017-01-27 19:51:45 +03:00
|
|
|
init_rq_fn *init_rq_fn;
|
2017-06-20 21:15:40 +03:00
|
|
|
/* Called just before a request is freed */
|
2017-01-27 19:51:45 +03:00
|
|
|
exit_rq_fn *exit_rq_fn;
|
2017-06-20 21:15:40 +03:00
|
|
|
/* Called from inside blk_get_request() */
|
|
|
|
void (*initialize_rq_fn)(struct request *rq);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-12-13 19:24:51 +03:00
|
|
|
const struct blk_mq_ops *mq_ops;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
|
|
|
unsigned int *mq_map;
|
|
|
|
|
|
|
|
/* sw queues */
|
2014-06-03 07:24:06 +04:00
|
|
|
struct blk_mq_ctx __percpu *queue_ctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
unsigned int nr_queues;
|
|
|
|
|
2016-03-30 19:21:08 +03:00
|
|
|
unsigned int queue_depth;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
/* hw dispatch queues */
|
|
|
|
struct blk_mq_hw_ctx **queue_hw_ctx;
|
|
|
|
unsigned int nr_hw_queues;
|
|
|
|
|
2005-10-20 18:23:44 +04:00
|
|
|
/*
|
|
|
|
* Dispatch queue sorting
|
|
|
|
*/
|
2005-10-20 18:37:00 +04:00
|
|
|
sector_t end_sector;
|
2005-10-20 18:23:44 +04:00
|
|
|
struct request *boundary_rq;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2011-03-02 19:08:00 +03:00
|
|
|
* Delayed queue handling
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2011-03-02 19:08:00 +03:00
|
|
|
struct delayed_work delay_work;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-02-02 17:56:50 +03:00
|
|
|
struct backing_dev_info *backing_dev_info;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue owner gets to use this for whatever they like.
|
|
|
|
* ll_rw_blk doesn't touch it.
|
|
|
|
*/
|
|
|
|
void *queuedata;
|
|
|
|
|
|
|
|
/*
|
2011-07-13 23:17:23 +04:00
|
|
|
* various queue flags, see QUEUE_* below
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2011-07-13 23:17:23 +04:00
|
|
|
unsigned long queue_flags;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-12-14 03:33:37 +04:00
|
|
|
/*
|
|
|
|
* ida allocated id for this queue. Used to index queues from
|
|
|
|
* ioctx.
|
|
|
|
*/
|
|
|
|
int id;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2011-07-13 23:17:23 +04:00
|
|
|
* queue needs bounce pages for pages above this limit
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2011-07-13 23:17:23 +04:00
|
|
|
gfp_t bounce_gfp;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
2005-04-13 01:22:06 +04:00
|
|
|
* protects queue structures from reentrancy. ->__queue_lock should
|
|
|
|
* _never_ be used directly, it is queue private. always use
|
|
|
|
* ->queue_lock.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2005-04-13 01:22:06 +04:00
|
|
|
spinlock_t __queue_lock;
|
2005-04-17 02:20:36 +04:00
|
|
|
spinlock_t *queue_lock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject kobj;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
/*
|
|
|
|
* mq queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject mq_kobj;
|
|
|
|
|
2015-10-21 20:20:18 +03:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
struct blk_integrity integrity;
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2014-12-04 03:00:23 +03:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 07:42:26 +04:00
|
|
|
struct device *dev;
|
|
|
|
int rpm_status;
|
|
|
|
unsigned int nr_pending;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* queue settings
|
|
|
|
*/
|
|
|
|
unsigned long nr_requests; /* Max # of requests */
|
|
|
|
unsigned int nr_congestion_on;
|
|
|
|
unsigned int nr_congestion_off;
|
|
|
|
unsigned int nr_batching;
|
|
|
|
|
2008-01-10 20:30:36 +03:00
|
|
|
unsigned int dma_drain_size;
|
2011-07-13 23:17:23 +04:00
|
|
|
void *dma_drain_buffer;
|
2008-03-04 13:18:17 +03:00
|
|
|
unsigned int dma_pad_mask;
|
2005-04-17 02:20:36 +04:00
|
|
|
unsigned int dma_alignment;
|
|
|
|
|
|
|
|
struct blk_queue_tag *queue_tags;
|
2007-10-25 12:14:47 +04:00
|
|
|
struct list_head tag_busy_list;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2005-11-10 10:52:05 +03:00
|
|
|
unsigned int nr_sorted;
|
2009-05-20 10:54:31 +04:00
|
|
|
unsigned int in_flight[2];
|
2016-11-08 07:32:37 +03:00
|
|
|
|
2012-11-28 16:46:45 +04:00
|
|
|
/*
|
|
|
|
* Number of active block driver functions for which blk_drain_queue()
|
|
|
|
* must wait. Must be incremented around functions that unlock the
|
|
|
|
* queue_lock internally, e.g. scsi_request_fn().
|
|
|
|
*/
|
|
|
|
unsigned int request_fn_active;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-09-14 16:55:09 +04:00
|
|
|
unsigned int rq_timeout;
|
2016-11-14 23:03:03 +03:00
|
|
|
int poll_nsec;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 18:56:08 +03:00
|
|
|
|
|
|
|
struct blk_stat_callback *poll_cb;
|
2017-04-21 01:59:11 +03:00
|
|
|
struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 18:56:08 +03:00
|
|
|
|
2008-09-14 16:55:09 +04:00
|
|
|
struct timer_list timeout;
|
2015-10-30 15:57:30 +03:00
|
|
|
struct work_struct timeout_work;
|
2008-09-14 16:55:09 +04:00
|
|
|
struct list_head timeout_list;
|
|
|
|
|
2011-12-14 03:33:41 +04:00
|
|
|
struct list_head icq_list;
|
2012-03-06 01:15:18 +04:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-04-14 00:11:33 +04:00
|
|
|
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
|
2012-04-17 00:57:25 +04:00
|
|
|
struct blkcg_gq *root_blkg;
|
2012-03-06 01:15:19 +04:00
|
|
|
struct list_head blkg_list;
|
2012-03-06 01:15:18 +04:00
|
|
|
#endif
|
2011-12-14 03:33:41 +04:00
|
|
|
|
2009-05-23 01:17:51 +04:00
|
|
|
struct queue_limits limits;
|
|
|
|
|
2017-12-21 09:43:38 +03:00
|
|
|
/*
|
|
|
|
* Zoned block device information for request dispatch control.
|
|
|
|
* nr_zones is the total number of zones of the device. This is always
|
|
|
|
* 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones
|
|
|
|
* bits which indicates if a zone is conventional (bit clear) or
|
|
|
|
* sequential (bit set). seq_zones_wlock is a bitmap of nr_zones
|
|
|
|
* bits which indicates if a zone is write locked, that is, if a write
|
|
|
|
* request targeting the zone was dispatched. All three fields are
|
|
|
|
* initialized by the low level device driver (e.g. scsi/sd.c).
|
|
|
|
* Stacking drivers (device mappers) may or may not initialize
|
|
|
|
* these fields.
|
|
|
|
*/
|
|
|
|
unsigned int nr_zones;
|
|
|
|
unsigned long *seq_zones_bitmap;
|
|
|
|
unsigned long *seq_zones_wlock;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* sg stuff
|
|
|
|
*/
|
|
|
|
unsigned int sg_timeout;
|
|
|
|
unsigned int sg_reserved_size;
|
2005-06-23 11:08:19 +04:00
|
|
|
int node;
|
2006-09-29 12:59:40 +04:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
2006-03-23 22:00:26 +03:00
|
|
|
struct blk_trace *blk_trace;
|
2017-09-20 22:12:20 +03:00
|
|
|
struct mutex blk_trace_mutex;
|
2006-09-29 12:59:40 +04:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2010-09-03 13:56:16 +04:00
|
|
|
* for flush operations
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2014-09-25 19:23:43 +04:00
|
|
|
struct blk_flush_queue *fq;
|
2006-03-19 02:34:37 +03:00
|
|
|
|
2014-05-28 18:08:02 +04:00
|
|
|
struct list_head requeue_list;
|
|
|
|
spinlock_t requeue_lock;
|
2016-09-14 20:28:30 +03:00
|
|
|
struct delayed_work requeue_work;
|
2014-05-28 18:08:02 +04:00
|
|
|
|
2006-03-19 02:34:37 +03:00
|
|
|
struct mutex sysfs_lock;
|
2007-07-09 14:40:35 +04:00
|
|
|
|
2012-03-06 01:14:58 +04:00
|
|
|
int bypass_depth;
|
2015-05-07 10:38:13 +03:00
|
|
|
atomic_t mq_freeze_depth;
|
2012-03-06 01:14:58 +04:00
|
|
|
|
2007-07-09 14:40:35 +04:00
|
|
|
#if defined(CONFIG_BLK_DEV_BSG)
|
2011-08-01 00:05:09 +04:00
|
|
|
bsg_job_fn *bsg_job_fn;
|
2007-07-09 14:40:35 +04:00
|
|
|
struct bsg_class_device bsg_dev;
|
|
|
|
#endif
|
2010-09-16 01:06:35 +04:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
|
|
|
/* Throttle data */
|
|
|
|
struct throtl_data *td;
|
|
|
|
#endif
|
2013-01-09 20:05:13 +04:00
|
|
|
struct rcu_head rcu_head;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
wait_queue_head_t mq_freeze_wq;
|
2015-10-21 20:20:12 +03:00
|
|
|
struct percpu_ref q_usage_counter;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
struct list_head all_q_node;
|
2014-05-14 01:10:52 +04:00
|
|
|
|
|
|
|
struct blk_mq_tag_set *tag_set;
|
|
|
|
struct list_head tag_set_list;
|
2015-04-24 08:37:18 +03:00
|
|
|
struct bio_set *bio_split;
|
2015-09-26 20:09:20 +03:00
|
|
|
|
2017-02-01 01:53:18 +03:00
|
|
|
#ifdef CONFIG_BLK_DEBUG_FS
|
2017-01-25 19:06:40 +03:00
|
|
|
struct dentry *debugfs_dir;
|
2017-05-04 17:24:40 +03:00
|
|
|
struct dentry *sched_debugfs_dir;
|
2017-01-25 19:06:40 +03:00
|
|
|
#endif
|
|
|
|
|
2015-09-26 20:09:20 +03:00
|
|
|
bool mq_sysfs_init_done;
|
2017-01-27 19:51:45 +03:00
|
|
|
|
|
|
|
size_t cmd_size;
|
|
|
|
void *rq_alloc_data;
|
2017-06-14 22:27:50 +03:00
|
|
|
|
|
|
|
struct work_struct release_work;
|
2017-06-26 17:15:27 +03:00
|
|
|
|
|
|
|
#define BLK_MAX_WRITE_HINTS 5
|
|
|
|
u64 write_hints[BLK_MAX_WRITE_HINTS];
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2017-08-10 17:25:38 +03:00
|
|
|
#define QUEUE_FLAG_QUEUED 0 /* uses generic tag queueing */
|
|
|
|
#define QUEUE_FLAG_STOPPED 1 /* queue is stopped */
|
|
|
|
#define QUEUE_FLAG_DYING 2 /* queue being torn down */
|
|
|
|
#define QUEUE_FLAG_BYPASS 3 /* act as dumb FIFO queue */
|
|
|
|
#define QUEUE_FLAG_BIDI 4 /* queue supports bidi requests */
|
|
|
|
#define QUEUE_FLAG_NOMERGES 5 /* disable merge attempts */
|
|
|
|
#define QUEUE_FLAG_SAME_COMP 6 /* complete on same CPU-group */
|
|
|
|
#define QUEUE_FLAG_FAIL_IO 7 /* fake timeout */
|
|
|
|
#define QUEUE_FLAG_NONROT 9 /* non-rotational device (SSD) */
|
2008-10-27 12:44:46 +03:00
|
|
|
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
|
2017-08-10 17:25:38 +03:00
|
|
|
#define QUEUE_FLAG_IO_STAT 10 /* do IO stats */
|
|
|
|
#define QUEUE_FLAG_DISCARD 11 /* supports DISCARD */
|
|
|
|
#define QUEUE_FLAG_NOXMERGES 12 /* No extended merges */
|
|
|
|
#define QUEUE_FLAG_ADD_RANDOM 13 /* Contributes to random pool */
|
|
|
|
#define QUEUE_FLAG_SECERASE 14 /* supports secure erase */
|
|
|
|
#define QUEUE_FLAG_SAME_FORCE 15 /* force complete on same CPU */
|
|
|
|
#define QUEUE_FLAG_DEAD 16 /* queue tear-down finished */
|
|
|
|
#define QUEUE_FLAG_INIT_DONE 17 /* queue is initialized */
|
|
|
|
#define QUEUE_FLAG_NO_SG_MERGE 18 /* don't attempt to merge SG segments*/
|
|
|
|
#define QUEUE_FLAG_POLL 19 /* IO polling enabled if set */
|
|
|
|
#define QUEUE_FLAG_WC 20 /* Write back caching */
|
|
|
|
#define QUEUE_FLAG_FUA 21 /* device supports FUA writes */
|
|
|
|
#define QUEUE_FLAG_FLUSH_NQ 22 /* flush not queueuable */
|
|
|
|
#define QUEUE_FLAG_DAX 23 /* device supports DAX */
|
|
|
|
#define QUEUE_FLAG_STATS 24 /* track rq completion times */
|
|
|
|
#define QUEUE_FLAG_POLL_STATS 25 /* collecting stats for hybrid polling */
|
|
|
|
#define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */
|
|
|
|
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
|
|
|
|
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
|
2017-11-09 21:49:57 +03:00
|
|
|
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
|
2009-01-23 12:54:44 +03:00
|
|
|
|
|
|
|
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2010-06-09 12:42:09 +04:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP) | \
|
|
|
|
(1 << QUEUE_FLAG_ADD_RANDOM))
|
2006-01-06 11:51:03 +03:00
|
|
|
|
2013-11-19 20:25:07 +04:00
|
|
|
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2016-03-03 18:04:03 +03:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP) | \
|
|
|
|
(1 << QUEUE_FLAG_POLL))
|
2013-11-19 20:25:07 +04:00
|
|
|
|
2017-06-20 21:15:44 +03:00
|
|
|
/*
|
|
|
|
* @q->queue_lock is set while a queue is being initialized. Since we know
|
|
|
|
* that no other threads access the queue object before @q->queue_lock has
|
|
|
|
* been set, it is safe to manipulate queue flags without holding the
|
|
|
|
* queue_lock if @q->queue_lock == NULL. See also blk_alloc_queue_node() and
|
|
|
|
* blk_init_allocated_queue().
|
|
|
|
*/
|
2012-03-30 14:33:28 +04:00
|
|
|
static inline void queue_lockdep_assert_held(struct request_queue *q)
|
2008-04-29 21:16:38 +04:00
|
|
|
{
|
2012-03-30 14:33:28 +04:00
|
|
|
if (q->queue_lock)
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
2008-04-29 21:16:38 +04:00
|
|
|
}
|
|
|
|
|
2008-04-29 16:48:33 +04:00
|
|
|
static inline void queue_flag_set_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2008-07-03 15:18:54 +04:00
|
|
|
static inline int queue_flag_test_and_clear(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 14:33:28 +04:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 15:18:54 +04:00
|
|
|
|
|
|
|
if (test_bit(flag, &q->queue_flags)) {
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int queue_flag_test_and_set(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 14:33:28 +04:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 15:18:54 +04:00
|
|
|
|
|
|
|
if (!test_bit(flag, &q->queue_flags)) {
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-04-29 16:48:33 +04:00
|
|
|
static inline void queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 14:33:28 +04:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 16:48:33 +04:00
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_clear_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2009-05-20 10:54:31 +04:00
|
|
|
static inline int queue_in_flight(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->in_flight[0] + q->in_flight[1];
|
|
|
|
}
|
|
|
|
|
2008-04-29 16:48:33 +04:00
|
|
|
static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 14:33:28 +04:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 16:48:33 +04:00
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
|
|
|
|
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
|
2012-11-28 16:42:38 +04:00
|
|
|
#define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
|
2012-12-06 17:32:01 +04:00
|
|
|
#define blk_queue_dead(q) test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
|
2012-03-06 01:14:58 +04:00
|
|
|
#define blk_queue_bypass(q) test_bit(QUEUE_FLAG_BYPASS, &(q)->queue_flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
#define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
|
2008-04-29 16:44:19 +04:00
|
|
|
#define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
|
2010-01-29 11:04:08 +03:00
|
|
|
#define blk_queue_noxmerges(q) \
|
|
|
|
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
|
2008-09-24 15:03:33 +04:00
|
|
|
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
|
2009-01-23 12:54:44 +03:00
|
|
|
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
|
2010-06-09 12:42:09 +04:00
|
|
|
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
|
2009-09-30 15:52:12 +04:00
|
|
|
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
|
2016-06-09 17:00:36 +03:00
|
|
|
#define blk_queue_secure_erase(q) \
|
|
|
|
(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
|
2016-06-24 00:05:50 +03:00
|
|
|
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
|
2017-06-01 00:43:46 +03:00
|
|
|
#define blk_queue_scsi_passthrough(q) \
|
|
|
|
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-08-07 20:17:56 +04:00
|
|
|
#define blk_noretry_request(rq) \
|
|
|
|
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
|
|
|
|
REQ_FAILFAST_DRIVER))
|
2017-06-18 23:24:27 +03:00
|
|
|
#define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
|
2017-11-09 21:49:57 +03:00
|
|
|
#define blk_queue_preempt_only(q) \
|
|
|
|
test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
|
|
|
|
|
|
|
|
extern int blk_set_preempt_only(struct request_queue *q);
|
|
|
|
extern void blk_clear_preempt_only(struct request_queue *q);
|
2010-08-07 20:17:56 +04:00
|
|
|
|
2017-01-31 18:57:29 +03:00
|
|
|
static inline bool blk_account_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return (rq->rq_flags & RQF_STARTED) && !blk_rq_is_passthrough(rq);
|
|
|
|
}
|
2010-08-07 20:17:56 +04:00
|
|
|
|
2008-08-26 12:25:02 +04:00
|
|
|
#define blk_rq_cpu_valid(rq) ((rq)->cpu != -1)
|
2007-07-16 10:52:14 +04:00
|
|
|
#define blk_bidi_rq(rq) ((rq)->next_rq != NULL)
|
2007-12-12 01:40:30 +03:00
|
|
|
/* rq->queuelist of dequeued request must be list_empty() */
|
|
|
|
#define blk_queued_rq(rq) (!list_empty(&(rq)->queuelist))
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
#define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist)
|
|
|
|
|
2016-06-05 22:32:22 +03:00
|
|
|
#define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ)
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2014-04-16 20:57:18 +04:00
|
|
|
/*
|
|
|
|
* Driver can handle struct request, if it either has an old style
|
|
|
|
* request_fn defined, or is blk-mq based.
|
|
|
|
*/
|
|
|
|
static inline bool queue_is_rq_based(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->request_fn || q->mq_ops;
|
|
|
|
}
|
|
|
|
|
2010-12-01 21:41:49 +03:00
|
|
|
static inline unsigned int blk_queue_cluster(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.cluster;
|
|
|
|
}
|
|
|
|
|
2016-10-18 09:40:29 +03:00
|
|
|
static inline enum blk_zoned_model
|
|
|
|
blk_queue_zoned_model(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.zoned;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_queue_is_zoned(struct request_queue *q)
|
|
|
|
{
|
|
|
|
switch (blk_queue_zoned_model(q)) {
|
|
|
|
case BLK_ZONED_HA:
|
|
|
|
case BLK_ZONED_HM:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-01-12 17:58:32 +03:00
|
|
|
static inline unsigned int blk_queue_zone_sectors(struct request_queue *q)
|
2016-10-18 09:40:33 +03:00
|
|
|
{
|
|
|
|
return blk_queue_is_zoned(q) ? q->limits.chunk_sectors : 0;
|
|
|
|
}
|
|
|
|
|
2017-12-21 09:43:38 +03:00
|
|
|
static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->nr_zones;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_queue_zone_no(struct request_queue *q,
|
|
|
|
sector_t sector)
|
|
|
|
{
|
|
|
|
if (!blk_queue_is_zoned(q))
|
|
|
|
return 0;
|
|
|
|
return sector >> ilog2(q->limits.chunk_sectors);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_queue_zone_is_seq(struct request_queue *q,
|
|
|
|
sector_t sector)
|
|
|
|
{
|
|
|
|
if (!blk_queue_is_zoned(q) || !q->seq_zones_bitmap)
|
|
|
|
return false;
|
|
|
|
return test_bit(blk_queue_zone_no(q, sector), q->seq_zones_bitmap);
|
|
|
|
}
|
|
|
|
|
2009-04-06 16:48:01 +04:00
|
|
|
static inline bool rq_is_sync(struct request *rq)
|
|
|
|
{
|
2016-10-28 17:48:16 +03:00
|
|
|
return op_is_sync(rq->cmd_flags);
|
2009-04-06 16:48:01 +04:00
|
|
|
}
|
|
|
|
|
2012-06-05 07:40:59 +04:00
|
|
|
static inline bool blk_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-06-05 07:40:59 +04:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
return rl->flags & flag;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2012-06-05 07:40:59 +04:00
|
|
|
static inline void blk_set_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-06-05 07:40:59 +04:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags |= flag;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2012-06-05 07:40:59 +04:00
|
|
|
static inline void blk_clear_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-06-05 07:40:59 +04:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags &= ~flag;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2012-09-18 20:19:25 +04:00
|
|
|
static inline bool rq_mergeable(struct request *rq)
|
|
|
|
{
|
2017-01-31 18:57:29 +03:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-18 20:19:25 +04:00
|
|
|
return false;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-06-05 22:32:23 +03:00
|
|
|
if (req_op(rq) == REQ_OP_FLUSH)
|
|
|
|
return false;
|
|
|
|
|
2016-11-30 23:28:59 +03:00
|
|
|
if (req_op(rq) == REQ_OP_WRITE_ZEROES)
|
|
|
|
return false;
|
|
|
|
|
2012-09-18 20:19:25 +04:00
|
|
|
if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
|
2016-10-20 16:12:13 +03:00
|
|
|
return false;
|
|
|
|
if (rq->rq_flags & RQF_NOMERGE_FLAGS)
|
2012-09-18 20:19:25 +04:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-09-18 20:19:27 +04:00
|
|
|
static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
|
|
|
|
{
|
2017-06-19 10:24:41 +03:00
|
|
|
if (bio_page(a) == bio_page(b) &&
|
|
|
|
bio_offset(a) == bio_offset(b))
|
2012-09-18 20:19:27 +04:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-03-30 19:21:08 +03:00
|
|
|
static inline unsigned int blk_queue_depth(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->queue_depth)
|
|
|
|
return q->queue_depth;
|
|
|
|
|
|
|
|
return q->nr_requests;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* q->prep_rq_fn return values
|
|
|
|
*/
|
2016-02-04 08:52:12 +03:00
|
|
|
enum {
|
|
|
|
BLKPREP_OK, /* serve it */
|
|
|
|
BLKPREP_KILL, /* fatal error, kill, return -EIO */
|
|
|
|
BLKPREP_DEFER, /* leave on queue */
|
|
|
|
BLKPREP_INVALID, /* invalid command, kill, return -EREMOTEIO */
|
|
|
|
};
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
extern unsigned long blk_max_low_pfn, blk_max_pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* standard bounce addresses:
|
|
|
|
*
|
|
|
|
* BLK_BOUNCE_HIGH : bounce all highmem pages
|
|
|
|
* BLK_BOUNCE_ANY : don't bounce anything
|
|
|
|
* BLK_BOUNCE_ISA : bounce pages above ISA DMA boundary
|
|
|
|
*/
|
2008-04-21 11:51:05 +04:00
|
|
|
|
|
|
|
#if BITS_PER_LONG == 32
|
2005-04-17 02:20:36 +04:00
|
|
|
#define BLK_BOUNCE_HIGH ((u64)blk_max_low_pfn << PAGE_SHIFT)
|
2008-04-21 11:51:05 +04:00
|
|
|
#else
|
|
|
|
#define BLK_BOUNCE_HIGH -1ULL
|
|
|
|
#endif
|
|
|
|
#define BLK_BOUNCE_ANY (-1ULL)
|
2010-05-31 10:59:03 +04:00
|
|
|
#define BLK_BOUNCE_ISA (DMA_BIT_MASK(24))
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2007-07-09 14:38:05 +04:00
|
|
|
/*
|
|
|
|
* default timeout for SG_IO if none specified
|
|
|
|
*/
|
|
|
|
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
|
2008-12-06 01:49:18 +03:00
|
|
|
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
|
2007-07-09 14:38:05 +04:00
|
|
|
|
2008-08-28 11:17:06 +04:00
|
|
|
struct rq_map_data {
|
|
|
|
struct page **pages;
|
|
|
|
int page_order;
|
|
|
|
int nr_entries;
|
2008-12-18 08:49:37 +03:00
|
|
|
unsigned long offset;
|
2008-12-18 08:49:38 +03:00
|
|
|
int null_mapped;
|
2009-07-09 16:46:53 +04:00
|
|
|
int from_user;
|
2008-08-28 11:17:06 +04:00
|
|
|
};
|
|
|
|
|
2007-09-25 14:35:59 +04:00
|
|
|
struct req_iterator {
|
2013-11-24 05:19:00 +04:00
|
|
|
struct bvec_iter iter;
|
2007-09-25 14:35:59 +04:00
|
|
|
struct bio *bio;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* This should not be used directly - use rq_for_each_segment */
|
2009-02-23 11:03:10 +03:00
|
|
|
#define for_each_bio(_bio) \
|
|
|
|
for (; _bio; _bio = _bio->bi_next)
|
2007-09-25 14:35:59 +04:00
|
|
|
#define __rq_for_each_bio(_bio, rq) \
|
2005-04-17 02:20:36 +04:00
|
|
|
if ((rq->bio)) \
|
|
|
|
for (_bio = (rq)->bio; _bio; _bio = _bio->bi_next)
|
|
|
|
|
2007-09-25 14:35:59 +04:00
|
|
|
#define rq_for_each_segment(bvl, _rq, _iter) \
|
|
|
|
__rq_for_each_bio(_iter.bio, _rq) \
|
2013-11-24 05:19:00 +04:00
|
|
|
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
|
2007-09-25 14:35:59 +04:00
|
|
|
|
2013-08-08 01:26:21 +04:00
|
|
|
#define rq_iter_last(bvec, _iter) \
|
2013-11-24 05:19:00 +04:00
|
|
|
(_iter.bio->bi_next == NULL && \
|
2013-08-08 01:26:21 +04:00
|
|
|
bio_iter_last(bvec, _iter.iter))
|
2007-09-25 14:35:59 +04:00
|
|
|
|
2009-11-26 11:16:19 +03:00
|
|
|
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
|
|
|
|
#endif
|
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
extern void rq_flush_dcache_pages(struct request *rq);
|
|
|
|
#else
|
|
|
|
static inline void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
extern int blk_register_queue(struct gendisk *disk);
|
|
|
|
extern void blk_unregister_queue(struct gendisk *disk);
|
2015-11-05 20:41:16 +03:00
|
|
|
extern blk_qc_t generic_make_request(struct bio *bio);
|
2017-11-02 21:29:50 +03:00
|
|
|
extern blk_qc_t direct_make_request(struct bio *bio);
|
2008-04-29 11:54:36 +04:00
|
|
|
extern void blk_rq_init(struct request_queue *q, struct request *rq);
|
2017-04-20 00:01:24 +03:00
|
|
|
extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
extern void blk_put_request(struct request *);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void __blk_put_request(struct request_queue *, struct request *);
|
2017-11-09 21:49:54 +03:00
|
|
|
extern struct request *blk_get_request_flags(struct request_queue *,
|
|
|
|
unsigned int op,
|
2017-11-09 21:49:59 +03:00
|
|
|
blk_mq_req_flags_t flags);
|
2017-06-20 21:15:39 +03:00
|
|
|
extern struct request *blk_get_request(struct request_queue *, unsigned int op,
|
|
|
|
gfp_t gfp_mask);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_requeue_request(struct request_queue *, struct request *);
|
2008-10-01 18:12:15 +04:00
|
|
|
extern int blk_lld_busy(struct request_queue *q);
|
2015-06-26 17:01:13 +03:00
|
|
|
extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data);
|
|
|
|
extern void blk_rq_unprep_clone(struct request *rq);
|
2017-06-03 10:38:04 +03:00
|
|
|
extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
|
2008-09-18 18:45:38 +04:00
|
|
|
struct request *rq);
|
2017-12-18 10:40:44 +03:00
|
|
|
extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
|
2011-03-02 19:08:00 +03:00
|
|
|
extern void blk_delay_queue(struct request_queue *, unsigned long);
|
2017-06-18 07:38:57 +03:00
|
|
|
extern void blk_queue_split(struct request_queue *, struct bio **);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_recount_segments(struct request_queue *, struct bio *);
|
2012-01-12 19:01:28 +04:00
|
|
|
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
|
2012-01-12 19:01:27 +04:00
|
|
|
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2007-08-27 23:38:10 +04:00
|
|
|
extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2008-09-03 01:16:41 +04:00
|
|
|
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
struct scsi_ioctl_command __user *);
|
2006-10-20 10:28:16 +04:00
|
|
|
|
2017-11-09 21:49:59 +03:00
|
|
|
extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
|
2015-11-20 00:29:28 +03:00
|
|
|
extern void blk_queue_exit(struct request_queue *q);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_start_queue(struct request_queue *q);
|
2015-12-28 23:01:22 +03:00
|
|
|
extern void blk_start_queue_async(struct request_queue *q);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_stop_queue(struct request_queue *q);
|
2005-04-17 02:20:36 +04:00
|
|
|
extern void blk_sync_queue(struct request_queue *q);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void __blk_stop_queue(struct request_queue *q);
|
2011-04-18 13:41:33 +04:00
|
|
|
extern void __blk_run_queue(struct request_queue *q);
|
2015-04-17 23:37:20 +03:00
|
|
|
extern void __blk_run_queue_uncond(struct request_queue *q);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_run_queue(struct request_queue *);
|
2011-04-19 15:32:46 +04:00
|
|
|
extern void blk_run_queue_async(struct request_queue *q);
|
2008-08-28 11:17:05 +04:00
|
|
|
extern int blk_rq_map_user(struct request_queue *, struct request *,
|
2008-08-28 11:17:06 +04:00
|
|
|
struct rq_map_data *, void __user *, unsigned long,
|
|
|
|
gfp_t);
|
2006-12-19 13:12:46 +03:00
|
|
|
extern int blk_rq_unmap_user(struct bio *);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
|
|
|
|
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
|
2015-01-18 18:16:31 +03:00
|
|
|
struct rq_map_data *, const struct iov_iter *,
|
|
|
|
gfp_t);
|
2017-04-20 17:02:55 +03:00
|
|
|
extern void blk_execute_rq(struct request_queue *, struct gendisk *,
|
2005-06-20 16:11:09 +04:00
|
|
|
struct request *, int);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
|
2006-01-06 12:00:50 +03:00
|
|
|
struct request *, int, rq_end_io_fn *);
|
2005-11-11 14:30:24 +03:00
|
|
|
|
2017-06-03 10:38:04 +03:00
|
|
|
int blk_status_to_errno(blk_status_t status);
|
|
|
|
blk_status_t errno_to_blk_status(int errno);
|
|
|
|
|
2017-11-02 21:29:54 +03:00
|
|
|
bool blk_poll(struct request_queue *q, blk_qc_t cookie);
|
2015-11-05 20:44:55 +03:00
|
|
|
|
2007-07-24 11:28:11 +04:00
|
|
|
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2014-09-08 03:03:56 +04:00
|
|
|
return bdev->bd_disk->queue; /* this is never NULL */
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2009-04-23 06:05:18 +04:00
|
|
|
/*
|
2009-07-03 12:48:17 +04:00
|
|
|
* blk_rq_pos() : the current sector
|
|
|
|
* blk_rq_bytes() : bytes left in the entire request
|
|
|
|
* blk_rq_cur_bytes() : bytes left in the current segment
|
|
|
|
* blk_rq_err_bytes() : bytes left till the next error boundary
|
|
|
|
* blk_rq_sectors() : sectors left in the entire request
|
|
|
|
* blk_rq_cur_sectors() : sectors left in the current segment
|
2009-04-23 06:05:18 +04:00
|
|
|
*/
|
2009-05-07 17:24:38 +04:00
|
|
|
static inline sector_t blk_rq_pos(const struct request *rq)
|
|
|
|
{
|
2009-05-07 17:24:44 +04:00
|
|
|
return rq->__sector;
|
2009-05-07 17:24:41 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_bytes(const struct request *rq)
|
|
|
|
{
|
2009-05-07 17:24:44 +04:00
|
|
|
return rq->__data_len;
|
2009-05-07 17:24:38 +04:00
|
|
|
}
|
|
|
|
|
2009-05-07 17:24:41 +04:00
|
|
|
static inline int blk_rq_cur_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->bio ? bio_cur_bytes(rq->bio) : 0;
|
|
|
|
}
|
2009-04-23 06:05:18 +04:00
|
|
|
|
2009-07-03 12:48:17 +04:00
|
|
|
extern unsigned int blk_rq_err_bytes(const struct request *rq);
|
|
|
|
|
2009-05-07 17:24:38 +04:00
|
|
|
static inline unsigned int blk_rq_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 17:24:41 +04:00
|
|
|
return blk_rq_bytes(rq) >> 9;
|
2009-05-07 17:24:38 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 17:24:41 +04:00
|
|
|
return blk_rq_cur_bytes(rq) >> 9;
|
2009-05-07 17:24:38 +04:00
|
|
|
}
|
|
|
|
|
2017-12-21 09:43:38 +03:00
|
|
|
static inline unsigned int blk_rq_zone_no(struct request *rq)
|
|
|
|
{
|
|
|
|
return blk_queue_zone_no(rq->q, blk_rq_pos(rq));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
|
|
|
|
{
|
|
|
|
return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
|
|
|
|
}
|
|
|
|
|
2017-01-13 14:29:10 +03:00
|
|
|
/*
|
|
|
|
* Some commands like WRITE SAME have a payload or data transfer size which
|
|
|
|
* is different from the size of the request. Any driver that supports such
|
|
|
|
* commands using the RQF_SPECIAL_PAYLOAD flag needs to use this helper to
|
|
|
|
* calculate the data transfer size.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_rq_payload_bytes(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return rq->special_vec.bv_len;
|
|
|
|
return blk_rq_bytes(rq);
|
|
|
|
}
|
|
|
|
|
2012-09-18 20:19:26 +04:00
|
|
|
static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
|
2016-06-05 22:32:15 +03:00
|
|
|
int op)
|
2012-09-18 20:19:26 +04:00
|
|
|
{
|
2016-08-16 10:59:35 +03:00
|
|
|
if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
|
block: fix max discard sectors limit
linux-v3.8-rc1 and later support for plug for blkdev_issue_discard with
commit 0cfbcafcae8b7364b5fa96c2b26ccde7a3a296a9
(block: add plug for blkdev_issue_discard )
For example,
1) DISCARD rq-1 with size size 4GB
2) DISCARD rq-2 with size size 1GB
If these 2 discard requests get merged, final request size will be 5GB.
In this case, request's __data_len field may overflow as it can store
max 4GB(unsigned int).
This issue was observed while doing mkfs.f2fs on 5GB SD card:
https://lkml.org/lkml/2013/4/1/292
Info: sector size = 512
Info: total sectors = 11370496 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
mkfs process gets stuck in D state and I see the following in the dmesg:
[ 257.789733] __end_that: dev mmcblk0: type=1, flags=122c8081
[ 257.789764] sector 4194304, nr/cnr 2981888/4294959104
[ 257.789764] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
[ 257.794921] request botched: dev mmcblk0: type=1, flags=122c8081
[ 257.794921] sector 4194304, nr/cnr 2981888/4294959104
[ 257.794921] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
This patch fixes this issue.
Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-24 18:52:50 +04:00
|
|
|
return min(q->limits.max_discard_sectors, UINT_MAX >> 9);
|
2012-09-18 20:19:26 +04:00
|
|
|
|
2016-06-05 22:32:15 +03:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_SAME))
|
2012-09-18 20:19:27 +04:00
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
2016-11-30 23:28:59 +03:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_ZEROES))
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
2012-09-18 20:19:26 +04:00
|
|
|
return q->limits.max_sectors;
|
|
|
|
}
|
|
|
|
|
2014-06-05 23:38:39 +04:00
|
|
|
/*
|
|
|
|
* Return maximum size of a request at given offset. Only valid for
|
|
|
|
* file system requests.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_max_size_offset(struct request_queue *q,
|
|
|
|
sector_t offset)
|
|
|
|
{
|
|
|
|
if (!q->limits.chunk_sectors)
|
2014-06-18 09:09:29 +04:00
|
|
|
return q->limits.max_sectors;
|
2014-06-05 23:38:39 +04:00
|
|
|
|
|
|
|
return q->limits.chunk_sectors -
|
|
|
|
(offset & (q->limits.chunk_sectors - 1));
|
|
|
|
}
|
|
|
|
|
2016-07-21 06:40:47 +03:00
|
|
|
static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
|
|
|
|
sector_t offset)
|
2012-09-18 20:19:26 +04:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-01-31 18:57:29 +03:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-18 20:19:26 +04:00
|
|
|
return q->limits.max_hw_sectors;
|
|
|
|
|
2016-08-16 10:59:35 +03:00
|
|
|
if (!q->limits.chunk_sectors ||
|
|
|
|
req_op(rq) == REQ_OP_DISCARD ||
|
|
|
|
req_op(rq) == REQ_OP_SECURE_ERASE)
|
2016-06-05 22:32:15 +03:00
|
|
|
return blk_queue_get_max_sectors(q, req_op(rq));
|
2014-06-05 23:38:39 +04:00
|
|
|
|
2016-07-21 06:40:47 +03:00
|
|
|
return min(blk_max_size_offset(q, offset),
|
2016-06-05 22:32:15 +03:00
|
|
|
blk_queue_get_max_sectors(q, req_op(rq)));
|
2012-09-18 20:19:26 +04:00
|
|
|
}
|
|
|
|
|
2013-09-21 23:57:47 +04:00
|
|
|
static inline unsigned int blk_rq_count_bios(struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int nr_bios = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio, rq)
|
|
|
|
nr_bios++;
|
|
|
|
|
|
|
|
return nr_bios;
|
|
|
|
}
|
|
|
|
|
2009-05-08 06:54:16 +04:00
|
|
|
/*
|
|
|
|
* Request issue related functions.
|
|
|
|
*/
|
|
|
|
extern struct request *blk_peek_request(struct request_queue *q);
|
|
|
|
extern void blk_start_request(struct request *rq);
|
|
|
|
extern struct request *blk_fetch_request(struct request_queue *q);
|
|
|
|
|
2017-11-02 21:29:51 +03:00
|
|
|
void blk_steal_bios(struct bio_list *list, struct request *rq);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2009-04-23 06:05:18 +04:00
|
|
|
* Request completion related functions.
|
|
|
|
*
|
|
|
|
* blk_update_request() completes given number of bytes and updates
|
|
|
|
* the request without completing it.
|
|
|
|
*
|
2009-04-23 06:05:19 +04:00
|
|
|
* blk_end_request() and friends. __blk_end_request() must be called
|
|
|
|
* with the request queue spinlock acquired.
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* Several drivers define their own end_request and call
|
2007-12-12 01:52:28 +03:00
|
|
|
* blk_end_request() for parts of the original function.
|
|
|
|
* This prevents code duplication in drivers.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2017-06-03 10:38:04 +03:00
|
|
|
extern bool blk_update_request(struct request *rq, blk_status_t error,
|
2009-04-23 06:05:18 +04:00
|
|
|
unsigned int nr_bytes);
|
2017-06-03 10:38:04 +03:00
|
|
|
extern void blk_finish_request(struct request *rq, blk_status_t error);
|
|
|
|
extern bool blk_end_request(struct request *rq, blk_status_t error,
|
2009-05-11 12:56:09 +04:00
|
|
|
unsigned int nr_bytes);
|
2017-06-03 10:38:04 +03:00
|
|
|
extern void blk_end_request_all(struct request *rq, blk_status_t error);
|
|
|
|
extern bool __blk_end_request(struct request *rq, blk_status_t error,
|
2009-05-11 12:56:09 +04:00
|
|
|
unsigned int nr_bytes);
|
2017-06-03 10:38:04 +03:00
|
|
|
extern void __blk_end_request_all(struct request *rq, blk_status_t error);
|
|
|
|
extern bool __blk_end_request_cur(struct request *rq, blk_status_t error);
|
2009-04-23 06:05:18 +04:00
|
|
|
|
2006-01-09 18:02:34 +03:00
|
|
|
extern void blk_complete_request(struct request *);
|
2008-09-14 16:55:09 +04:00
|
|
|
extern void __blk_complete_request(struct request *);
|
|
|
|
extern void blk_abort_request(struct request *);
|
2010-07-01 14:49:17 +04:00
|
|
|
extern void blk_unprep_request(struct request *);
|
2006-01-09 18:02:34 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Access functions for manipulating queue properties
|
|
|
|
*/
|
2007-07-24 11:28:11 +04:00
|
|
|
extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
|
2005-06-23 11:08:19 +04:00
|
|
|
spinlock_t *lock, int node_id);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
|
2017-01-03 14:52:44 +03:00
|
|
|
extern int blk_init_allocated_queue(struct request_queue *);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_cleanup_queue(struct request_queue *);
|
|
|
|
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
|
|
|
|
extern void blk_queue_bounce_limit(struct request_queue *, u64);
|
2010-02-26 08:20:38 +03:00
|
|
|
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
|
2014-06-05 23:38:39 +04:00
|
|
|
extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
|
2010-02-26 08:20:39 +03:00
|
|
|
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
|
2017-02-08 16:46:49 +03:00
|
|
|
extern void blk_queue_max_discard_segments(struct request_queue *,
|
|
|
|
unsigned short);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
|
2009-09-30 15:54:20 +04:00
|
|
|
extern void blk_queue_max_discard_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_discard_sectors);
|
2012-09-18 20:19:27 +04:00
|
|
|
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2016-11-30 23:28:59 +03:00
|
|
|
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2009-05-23 01:17:49 +04:00
|
|
|
extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
|
2010-10-13 23:18:03 +04:00
|
|
|
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
|
2009-05-23 01:17:53 +04:00
|
|
|
extern void blk_queue_alignment_offset(struct request_queue *q,
|
|
|
|
unsigned int alignment);
|
2009-07-31 19:49:11 +04:00
|
|
|
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
|
2009-05-23 01:17:53 +04:00
|
|
|
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
|
2009-09-11 23:54:52 +04:00
|
|
|
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
|
2009-05-23 01:17:53 +04:00
|
|
|
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
|
2016-03-30 19:21:08 +03:00
|
|
|
extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
|
2009-06-16 10:23:52 +04:00
|
|
|
extern void blk_set_default_limits(struct queue_limits *lim);
|
2012-01-11 19:27:11 +04:00
|
|
|
extern void blk_set_stacking_limits(struct queue_limits *lim);
|
2009-05-23 01:17:53 +04:00
|
|
|
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
|
|
|
|
sector_t offset);
|
2010-01-11 11:21:49 +03:00
|
|
|
extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2009-05-23 01:17:53 +04:00
|
|
|
extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b);
|
2008-03-04 13:18:17 +03:00
|
|
|
extern void blk_queue_dma_pad(struct request_queue *, unsigned int);
|
2008-07-04 11:30:03 +04:00
|
|
|
extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int);
|
2008-02-19 13:36:53 +03:00
|
|
|
extern int blk_queue_dma_drain(struct request_queue *q,
|
|
|
|
dma_drain_needed_fn *dma_drain_needed,
|
|
|
|
void *buf, unsigned int size);
|
2008-10-01 18:12:15 +04:00
|
|
|
extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
|
2015-08-20 00:24:05 +03:00
|
|
|
extern void blk_queue_virt_boundary(struct request_queue *, unsigned long);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
|
2010-07-01 14:49:17 +04:00
|
|
|
extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_dma_alignment(struct request_queue *, int);
|
2008-01-01 01:37:00 +03:00
|
|
|
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
|
2008-09-14 16:55:09 +04:00
|
|
|
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
|
|
|
|
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
|
2011-05-06 21:34:32 +04:00
|
|
|
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
|
2016-04-12 21:32:46 +03:00
|
|
|
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-02-08 16:46:49 +03:00
|
|
|
/*
|
|
|
|
* Number of physical segments as sent to the device.
|
|
|
|
*
|
|
|
|
* Normally this is the number of discontiguous data segments sent by the
|
|
|
|
* submitter. But for data-less command like discard we might have no
|
|
|
|
* actual data segments submitted, but the driver might have to add it's
|
|
|
|
* own special payload. In that case we still return 1 here so that this
|
|
|
|
* special payload will be mapped.
|
|
|
|
*/
|
2016-12-09 01:20:32 +03:00
|
|
|
static inline unsigned short blk_rq_nr_phys_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return 1;
|
|
|
|
return rq->nr_phys_segments;
|
|
|
|
}
|
|
|
|
|
2017-02-08 16:46:49 +03:00
|
|
|
/*
|
|
|
|
* Number of discard segments (or ranges) the driver needs to fill in.
|
|
|
|
* Each discard bio merged into a request is counted as one segment.
|
|
|
|
*/
|
|
|
|
static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
return max_t(unsigned short, rq->nr_phys_segments, 1);
|
|
|
|
}
|
|
|
|
|
2007-07-24 11:28:11 +04:00
|
|
|
extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
|
2005-04-17 02:20:36 +04:00
|
|
|
extern void blk_dump_rq_flags(struct request *, char *);
|
|
|
|
extern long nr_blockdev_pages(void);
|
|
|
|
|
2011-12-14 03:33:38 +04:00
|
|
|
bool __must_check blk_get_queue(struct request_queue *);
|
2007-07-24 11:28:11 +04:00
|
|
|
struct request_queue *blk_alloc_queue(gfp_t);
|
|
|
|
struct request_queue *blk_alloc_queue_node(gfp_t, int);
|
|
|
|
extern void blk_put_queue(struct request_queue *);
|
2015-06-05 19:57:37 +03:00
|
|
|
extern void blk_set_queue_dying(struct request_queue *);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-03-23 07:42:26 +04:00
|
|
|
/*
|
|
|
|
* block layer runtime pm functions
|
|
|
|
*/
|
2014-12-04 03:00:23 +03:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 07:42:26 +04:00
|
|
|
extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
|
|
|
|
extern int blk_pre_runtime_suspend(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_suspend(struct request_queue *q, int err);
|
|
|
|
extern void blk_pre_runtime_resume(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_resume(struct request_queue *q, int err);
|
2016-02-18 11:54:11 +03:00
|
|
|
extern void blk_set_runtime_active(struct request_queue *q);
|
2013-03-23 07:42:26 +04:00
|
|
|
#else
|
|
|
|
static inline void blk_pm_runtime_init(struct request_queue *q,
|
|
|
|
struct device *dev) {}
|
|
|
|
static inline int blk_pre_runtime_suspend(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
static inline void blk_post_runtime_suspend(struct request_queue *q, int err) {}
|
|
|
|
static inline void blk_pre_runtime_resume(struct request_queue *q) {}
|
|
|
|
static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
|
2016-11-18 17:16:06 +03:00
|
|
|
static inline void blk_set_runtime_active(struct request_queue *q) {}
|
2013-03-23 07:42:26 +04:00
|
|
|
#endif
|
|
|
|
|
2011-07-08 10:19:21 +04:00
|
|
|
/*
|
2011-09-21 12:00:16 +04:00
|
|
|
* blk_plug permits building a queue of related requests by holding the I/O
|
|
|
|
* fragments for a short period. This allows merging of sequential requests
|
|
|
|
* into single larger request. As the requests are moved from a per-task list to
|
|
|
|
* the device's request_queue in a batch, this results in improved scalability
|
|
|
|
* as the lock contention for request_queue lock is reduced.
|
|
|
|
*
|
|
|
|
* It is ok not to disable preemption when adding the request to the plug list
|
|
|
|
* or when attempting a merge, because blk_schedule_flush_list() will only flush
|
|
|
|
* the plug list when the task sleeps by itself. For details, please see
|
|
|
|
* schedule() where blk_schedule_flush_plug() is called.
|
2011-07-08 10:19:21 +04:00
|
|
|
*/
|
2011-03-08 15:19:51 +03:00
|
|
|
struct blk_plug {
|
2011-09-21 12:00:16 +04:00
|
|
|
struct list_head list; /* requests */
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
struct list_head mq_list; /* blk-mq requests */
|
2011-09-21 12:00:16 +04:00
|
|
|
struct list_head cb_list; /* md requires an unplug callback */
|
2011-03-08 15:19:51 +03:00
|
|
|
};
|
2011-07-08 10:19:20 +04:00
|
|
|
#define BLK_MAX_REQUEST_COUNT 16
|
2016-11-04 03:03:53 +03:00
|
|
|
#define BLK_PLUG_FLUSH_SIZE (128 * 1024)
|
2011-07-08 10:19:20 +04:00
|
|
|
|
2012-07-31 11:08:14 +04:00
|
|
|
struct blk_plug_cb;
|
2012-07-31 11:08:15 +04:00
|
|
|
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
|
2011-04-18 11:52:22 +04:00
|
|
|
struct blk_plug_cb {
|
|
|
|
struct list_head list;
|
2012-07-31 11:08:14 +04:00
|
|
|
blk_plug_cb_fn callback;
|
|
|
|
void *data;
|
2011-04-18 11:52:22 +04:00
|
|
|
};
|
2012-07-31 11:08:14 +04:00
|
|
|
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
|
|
|
|
void *data, int size);
|
2011-03-08 15:19:51 +03:00
|
|
|
extern void blk_start_plug(struct blk_plug *);
|
|
|
|
extern void blk_finish_plug(struct blk_plug *);
|
2011-04-15 17:49:07 +04:00
|
|
|
extern void blk_flush_plug_list(struct blk_plug *, bool);
|
2011-03-08 15:19:51 +03:00
|
|
|
|
|
|
|
static inline void blk_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-16 15:27:55 +04:00
|
|
|
if (plug)
|
|
|
|
blk_flush_plug_list(plug, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-15 17:20:10 +04:00
|
|
|
if (plug)
|
2011-04-15 17:49:07 +04:00
|
|
|
blk_flush_plug_list(plug, true);
|
2011-03-08 15:19:51 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
return plug &&
|
|
|
|
(!list_empty(&plug->list) ||
|
|
|
|
!list_empty(&plug->mq_list) ||
|
|
|
|
!list_empty(&plug->cb_list));
|
2011-03-08 15:19:51 +03:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* tag stuff
|
|
|
|
*/
|
2007-07-24 11:28:11 +04:00
|
|
|
extern int blk_queue_start_tag(struct request_queue *, struct request *);
|
|
|
|
extern struct request *blk_queue_find_tag(struct request_queue *, int);
|
|
|
|
extern void blk_queue_end_tag(struct request_queue *, struct request *);
|
2015-01-16 04:32:25 +03:00
|
|
|
extern int blk_queue_init_tags(struct request_queue *, int, struct blk_queue_tag *, int);
|
2007-07-24 11:28:11 +04:00
|
|
|
extern void blk_queue_free_tags(struct request_queue *);
|
|
|
|
extern int blk_queue_resize_tags(struct request_queue *, int);
|
|
|
|
extern void blk_queue_invalidate_tags(struct request_queue *);
|
2015-01-16 04:32:25 +03:00
|
|
|
extern struct blk_queue_tag *blk_init_tags(int, int);
|
2006-08-30 23:48:45 +04:00
|
|
|
extern void blk_free_tags(struct blk_queue_tag *);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-10-04 10:27:25 +04:00
|
|
|
static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
|
|
|
|
int tag)
|
|
|
|
{
|
|
|
|
if (unlikely(bqt == NULL || tag >= bqt->real_max_depth))
|
|
|
|
return NULL;
|
|
|
|
return bqt->tag_index[tag];
|
|
|
|
}
|
2010-09-16 22:51:46 +04:00
|
|
|
|
2017-04-05 20:21:08 +03:00
|
|
|
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
|
|
|
|
extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct page *page);
|
2016-07-19 12:23:33 +03:00
|
|
|
|
|
|
|
#define BLKDEV_DISCARD_SECURE (1 << 0) /* issue a secure erase */
|
2010-09-16 22:51:46 +04:00
|
|
|
|
2010-04-28 17:55:06 +04:00
|
|
|
extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
|
2016-04-16 21:55:28 +03:00
|
|
|
extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
2016-06-09 17:00:36 +03:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, int flags,
|
2016-06-05 22:31:49 +03:00
|
|
|
struct bio **biop);
|
2017-04-05 20:21:08 +03:00
|
|
|
|
|
|
|
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
|
2017-04-05 20:21:10 +03:00
|
|
|
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
|
2017-04-05 20:21:08 +03:00
|
|
|
|
2016-11-30 23:28:58 +03:00
|
|
|
extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
|
2017-04-05 20:21:08 +03:00
|
|
|
unsigned flags);
|
2010-04-28 17:55:09 +04:00
|
|
|
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
2017-04-05 20:21:08 +03:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
|
|
|
|
|
2010-08-18 13:29:10 +04:00
|
|
|
static inline int sb_issue_discard(struct super_block *sb, sector_t block,
|
|
|
|
sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
|
2008-08-05 21:01:53 +04:00
|
|
|
{
|
2010-08-18 13:29:10 +04:00
|
|
|
return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
|
|
|
gfp_mask, flags);
|
2008-08-05 21:01:53 +04:00
|
|
|
}
|
2010-10-28 05:30:04 +04:00
|
|
|
static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
|
2010-10-28 07:44:47 +04:00
|
|
|
sector_t nr_blocks, gfp_t gfp_mask)
|
2010-10-28 05:30:04 +04:00
|
|
|
{
|
|
|
|
return blkdev_issue_zeroout(sb->s_bdev,
|
|
|
|
block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
2017-04-05 20:21:08 +03:00
|
|
|
gfp_mask, 0);
|
2010-10-28 05:30:04 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-11-05 10:36:31 +03:00
|
|
|
extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
|
2008-06-26 15:48:27 +04:00
|
|
|
|
2010-02-26 08:20:37 +03:00
|
|
|
enum blk_default_limits {
|
|
|
|
BLK_MAX_SEGMENTS = 128,
|
|
|
|
BLK_SAFE_MAX_SECTORS = 255,
|
2015-08-13 21:57:57 +03:00
|
|
|
BLK_DEF_MAX_SECTORS = 2560,
|
2010-02-26 08:20:37 +03:00
|
|
|
BLK_MAX_SEGMENT_SIZE = 65536,
|
|
|
|
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
|
|
|
|
};
|
2008-12-03 14:55:08 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#define blkdev_entry_to_request(entry) list_entry((entry), struct request, queuelist)
|
|
|
|
|
2009-05-23 01:17:50 +04:00
|
|
|
static inline unsigned long queue_segment_boundary(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 01:17:51 +04:00
|
|
|
return q->limits.seg_boundary_mask;
|
2009-05-23 01:17:50 +04:00
|
|
|
}
|
|
|
|
|
2015-08-20 00:24:05 +03:00
|
|
|
static inline unsigned long queue_virt_boundary(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.virt_boundary_mask;
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:50 +04:00
|
|
|
static inline unsigned int queue_max_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 01:17:51 +04:00
|
|
|
return q->limits.max_sectors;
|
2009-05-23 01:17:50 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int queue_max_hw_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 01:17:51 +04:00
|
|
|
return q->limits.max_hw_sectors;
|
2009-05-23 01:17:50 +04:00
|
|
|
}
|
|
|
|
|
2010-02-26 08:20:39 +03:00
|
|
|
static inline unsigned short queue_max_segments(struct request_queue *q)
|
2009-05-23 01:17:50 +04:00
|
|
|
{
|
2010-02-26 08:20:39 +03:00
|
|
|
return q->limits.max_segments;
|
2009-05-23 01:17:50 +04:00
|
|
|
}
|
|
|
|
|
2017-02-08 16:46:49 +03:00
|
|
|
static inline unsigned short queue_max_discard_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.max_discard_segments;
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:50 +04:00
|
|
|
static inline unsigned int queue_max_segment_size(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 01:17:51 +04:00
|
|
|
return q->limits.max_segment_size;
|
2009-05-23 01:17:50 +04:00
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:49 +04:00
|
|
|
static inline unsigned short queue_logical_block_size(struct request_queue *q)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int retval = 512;
|
|
|
|
|
2009-05-23 01:17:51 +04:00
|
|
|
if (q && q->limits.logical_block_size)
|
|
|
|
retval = q->limits.logical_block_size;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:49 +04:00
|
|
|
static inline unsigned short bdev_logical_block_size(struct block_device *bdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2009-05-23 01:17:49 +04:00
|
|
|
return queue_logical_block_size(bdev_get_queue(bdev));
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:53 +04:00
|
|
|
static inline unsigned int queue_physical_block_size(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.physical_block_size;
|
|
|
|
}
|
|
|
|
|
2010-10-13 23:18:03 +04:00
|
|
|
static inline unsigned int bdev_physical_block_size(struct block_device *bdev)
|
2009-10-03 22:52:01 +04:00
|
|
|
{
|
|
|
|
return queue_physical_block_size(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:53 +04:00
|
|
|
static inline unsigned int queue_io_min(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_min;
|
|
|
|
}
|
|
|
|
|
2009-10-03 22:52:01 +04:00
|
|
|
static inline int bdev_io_min(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_min(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:53 +04:00
|
|
|
static inline unsigned int queue_io_opt(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_opt;
|
|
|
|
}
|
|
|
|
|
2009-10-03 22:52:01 +04:00
|
|
|
static inline int bdev_io_opt(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_opt(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 01:17:53 +04:00
|
|
|
static inline int queue_alignment_offset(struct request_queue *q)
|
|
|
|
{
|
2009-10-03 22:52:01 +04:00
|
|
|
if (q->limits.misaligned)
|
2009-05-23 01:17:53 +04:00
|
|
|
return -1;
|
|
|
|
|
2009-10-03 22:52:01 +04:00
|
|
|
return q->limits.alignment_offset;
|
2009-05-23 01:17:53 +04:00
|
|
|
}
|
|
|
|
|
2010-01-11 11:21:51 +03:00
|
|
|
static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t sector)
|
2009-12-29 10:35:35 +03:00
|
|
|
{
|
|
|
|
unsigned int granularity = max(lim->physical_block_size, lim->io_min);
|
2014-10-09 02:26:13 +04:00
|
|
|
unsigned int alignment = sector_div(sector, granularity >> 9) << 9;
|
2009-12-29 10:35:35 +03:00
|
|
|
|
2014-10-09 02:26:13 +04:00
|
|
|
return (granularity + lim->alignment_offset - alignment) % granularity;
|
2009-05-23 01:17:53 +04:00
|
|
|
}
|
|
|
|
|
2009-10-03 22:52:01 +04:00
|
|
|
static inline int bdev_alignment_offset(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q->limits.misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->alignment_offset;
|
|
|
|
|
|
|
|
return q->limits.alignment_offset;
|
|
|
|
}
|
|
|
|
|
2009-11-10 13:50:21 +03:00
|
|
|
static inline int queue_discard_alignment(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->limits.discard_misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2010-01-11 11:21:51 +03:00
|
|
|
static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector_t sector)
|
2009-11-10 13:50:21 +03:00
|
|
|
{
|
2012-12-19 19:18:35 +04:00
|
|
|
unsigned int alignment, granularity, offset;
|
2010-01-11 11:21:48 +03:00
|
|
|
|
2011-05-18 12:37:35 +04:00
|
|
|
if (!lim->max_discard_sectors)
|
|
|
|
return 0;
|
|
|
|
|
2012-12-19 19:18:35 +04:00
|
|
|
/* Why are these in bytes, not sectors? */
|
|
|
|
alignment = lim->discard_alignment >> 9;
|
|
|
|
granularity = lim->discard_granularity >> 9;
|
|
|
|
if (!granularity)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Offset of the partition start in 'granularity' sectors */
|
|
|
|
offset = sector_div(sector, granularity);
|
|
|
|
|
|
|
|
/* And why do we do this modulus *again* in blkdev_issue_discard()? */
|
|
|
|
offset = (granularity + alignment - offset) % granularity;
|
|
|
|
|
|
|
|
/* Turn it back into bytes, gaah */
|
|
|
|
return offset << 9;
|
2009-11-10 13:50:21 +03:00
|
|
|
}
|
|
|
|
|
block: split discard into aligned requests
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.
Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.
At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 11:48:50 +04:00
|
|
|
static inline int bdev_discard_alignment(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->discard_alignment;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2012-09-18 20:19:27 +04:00
|
|
|
static inline unsigned int bdev_write_same(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-11-30 23:28:59 +03:00
|
|
|
static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-10-18 09:40:29 +03:00
|
|
|
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_zoned_model(q);
|
|
|
|
|
|
|
|
return BLK_ZONED_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool bdev_is_zoned(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_is_zoned(q);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-01-12 17:58:32 +03:00
|
|
|
static inline unsigned int bdev_zone_sectors(struct block_device *bdev)
|
2016-10-18 09:40:33 +03:00
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
2017-01-12 17:58:32 +03:00
|
|
|
return blk_queue_zone_sectors(q);
|
2017-12-21 09:43:38 +03:00
|
|
|
return 0;
|
|
|
|
}
|
2016-10-18 09:40:33 +03:00
|
|
|
|
2017-12-21 09:43:38 +03:00
|
|
|
static inline unsigned int bdev_nr_zones(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
2016-10-18 09:40:33 +03:00
|
|
|
|
2017-12-21 09:43:38 +03:00
|
|
|
if (q)
|
|
|
|
return blk_queue_nr_zones(q);
|
2016-10-18 09:40:33 +03:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-07-24 11:28:11 +04:00
|
|
|
static inline int queue_dma_alignment(struct request_queue *q)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2008-01-01 18:23:02 +03:00
|
|
|
return q ? q->dma_alignment : 511;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2010-09-15 15:08:27 +04:00
|
|
|
static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
|
2008-08-28 10:05:58 +04:00
|
|
|
unsigned int len)
|
|
|
|
{
|
|
|
|
unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
|
2010-09-15 15:08:27 +04:00
|
|
|
return !(addr & alignment) && !(len & alignment);
|
2008-08-28 10:05:58 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/* assumes size > 256 */
|
|
|
|
static inline unsigned int blksize_bits(unsigned int size)
|
|
|
|
{
|
|
|
|
unsigned int bits = 8;
|
|
|
|
do {
|
|
|
|
bits++;
|
|
|
|
size >>= 1;
|
|
|
|
} while (size > 256);
|
|
|
|
return bits;
|
|
|
|
}
|
|
|
|
|
2005-09-10 11:27:17 +04:00
|
|
|
static inline unsigned int block_size(struct block_device *bdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
return bdev->bd_block_size;
|
|
|
|
}
|
|
|
|
|
2011-05-06 21:34:32 +04:00
|
|
|
static inline bool queue_flush_queueable(struct request_queue *q)
|
|
|
|
{
|
2016-04-13 22:33:19 +03:00
|
|
|
return !test_bit(QUEUE_FLAG_FLUSH_NQ, &q->queue_flags);
|
2011-05-06 21:34:32 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
typedef struct {struct page *v;} Sector;
|
|
|
|
|
|
|
|
unsigned char *read_dev_sector(struct block_device *, sector_t, Sector *);
|
|
|
|
|
|
|
|
static inline void put_dev_sector(Sector p)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(p.v);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2016-02-26 18:40:51 +03:00
|
|
|
static inline bool __bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
return offset ||
|
|
|
|
((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
|
|
|
|
}
|
|
|
|
|
2015-08-20 00:24:05 +03:00
|
|
|
/*
|
|
|
|
* Check if adding a bio_vec after bprv with offset would create a gap in
|
|
|
|
* the SG list. Most drivers don't care about this, but some do.
|
|
|
|
*/
|
|
|
|
static inline bool bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
if (!queue_virt_boundary(q))
|
|
|
|
return false;
|
2016-02-26 18:40:51 +03:00
|
|
|
return __bvec_gap_to_prev(q, bprv, offset);
|
2015-08-20 00:24:05 +03:00
|
|
|
}
|
|
|
|
|
2016-12-17 13:49:09 +03:00
|
|
|
/*
|
|
|
|
* Check if the two bvecs from two bios can be merged to one segment.
|
|
|
|
* If yes, no need to check gap between the two bios since the 1st bio
|
|
|
|
* and the 1st bvec in the 2nd bio can be handled in one segment.
|
|
|
|
*/
|
|
|
|
static inline bool bios_segs_mergeable(struct request_queue *q,
|
|
|
|
struct bio *prev, struct bio_vec *prev_last_bv,
|
|
|
|
struct bio_vec *next_first_bv)
|
|
|
|
{
|
|
|
|
if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
|
|
|
|
return false;
|
|
|
|
if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
|
|
|
|
return false;
|
|
|
|
if (prev->bi_seg_back_size + next_first_bv->bv_len >
|
|
|
|
queue_max_segment_size(q))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-04-14 22:58:29 +03:00
|
|
|
static inline bool bio_will_gap(struct request_queue *q,
|
|
|
|
struct request *prev_rq,
|
|
|
|
struct bio *prev,
|
|
|
|
struct bio *next)
|
2015-09-03 19:28:20 +03:00
|
|
|
{
|
2016-02-26 18:40:52 +03:00
|
|
|
if (bio_has_data(prev) && queue_virt_boundary(q)) {
|
|
|
|
struct bio_vec pb, nb;
|
|
|
|
|
2017-04-14 22:58:29 +03:00
|
|
|
/*
|
|
|
|
* don't merge if the 1st bio starts with non-zero
|
|
|
|
* offset, otherwise it is quite difficult to respect
|
|
|
|
* sg gap limit. We work hard to merge a huge number of small
|
|
|
|
* single bios in case of mkfs.
|
|
|
|
*/
|
|
|
|
if (prev_rq)
|
|
|
|
bio_get_first_bvec(prev_rq->bio, &pb);
|
|
|
|
else
|
|
|
|
bio_get_first_bvec(prev, &pb);
|
|
|
|
if (pb.bv_offset)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't need to worry about the situation that the
|
|
|
|
* merged segment ends in unaligned virt boundary:
|
|
|
|
*
|
|
|
|
* - if 'pb' ends aligned, the merged segment ends aligned
|
|
|
|
* - if 'pb' ends unaligned, the next bio must include
|
|
|
|
* one single bvec of 'nb', otherwise the 'nb' can't
|
|
|
|
* merge with 'pb'
|
|
|
|
*/
|
2016-02-26 18:40:52 +03:00
|
|
|
bio_get_last_bvec(prev, &pb);
|
|
|
|
bio_get_first_bvec(next, &nb);
|
2015-09-03 19:28:20 +03:00
|
|
|
|
2016-12-17 13:49:09 +03:00
|
|
|
if (!bios_segs_mergeable(q, prev, &pb, &nb))
|
|
|
|
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
|
2016-02-26 18:40:52 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
2015-09-03 19:28:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
2017-04-14 22:58:29 +03:00
|
|
|
return bio_will_gap(req->q, req, req->biotail, bio);
|
2015-09-03 19:28:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
2017-04-14 22:58:29 +03:00
|
|
|
return bio_will_gap(req->q, NULL, bio, req->bio);
|
2015-09-03 19:28:20 +03:00
|
|
|
}
|
|
|
|
|
2014-04-08 19:15:35 +04:00
|
|
|
int kblockd_schedule_work(struct work_struct *work);
|
2016-08-25 00:52:48 +03:00
|
|
|
int kblockd_schedule_work_on(int cpu, struct work_struct *work);
|
2017-04-10 18:54:55 +03:00
|
|
|
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-04-02 02:01:41 +04:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2010-06-01 14:23:18 +04:00
|
|
|
/*
|
|
|
|
* This should not be using sched_clock(). A real patch is in progress
|
|
|
|
* to fix this up, until that is in place we need to disable preemption
|
|
|
|
* around sched_clock() in this function and set_io_start_time_ns().
|
|
|
|
*/
|
2010-04-02 02:01:41 +04:00
|
|
|
static inline void set_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 14:23:18 +04:00
|
|
|
preempt_disable();
|
2010-04-02 02:01:41 +04:00
|
|
|
req->start_time_ns = sched_clock();
|
2010-06-01 14:23:18 +04:00
|
|
|
preempt_enable();
|
2010-04-02 02:01:41 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_io_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 14:23:18 +04:00
|
|
|
preempt_disable();
|
2010-04-02 02:01:41 +04:00
|
|
|
req->io_start_time_ns = sched_clock();
|
2010-06-01 14:23:18 +04:00
|
|
|
preempt_enable();
|
2010-04-02 02:01:41 +04:00
|
|
|
}
|
2010-04-09 10:31:19 +04:00
|
|
|
|
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->start_time_ns;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->io_start_time_ns;
|
|
|
|
}
|
2010-04-02 02:01:41 +04:00
|
|
|
#else
|
|
|
|
static inline void set_start_time_ns(struct request *req) {}
|
|
|
|
static inline void set_io_start_time_ns(struct request *req) {}
|
2010-04-09 10:31:19 +04:00
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2010-04-02 02:01:41 +04:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#define MODULE_ALIAS_BLOCKDEV(major,minor) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
|
|
|
|
#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-*")
|
|
|
|
|
2008-06-30 22:04:41 +04:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
|
2014-09-27 03:20:02 +04:00
|
|
|
enum blk_integrity_flags {
|
|
|
|
BLK_INTEGRITY_VERIFY = 1 << 0,
|
|
|
|
BLK_INTEGRITY_GENERATE = 1 << 1,
|
2014-09-27 03:20:03 +04:00
|
|
|
BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2,
|
2014-09-27 03:20:05 +04:00
|
|
|
BLK_INTEGRITY_IP_CHECKSUM = 1 << 3,
|
2014-09-27 03:20:02 +04:00
|
|
|
};
|
2008-06-30 22:04:41 +04:00
|
|
|
|
2014-09-27 03:20:01 +04:00
|
|
|
struct blk_integrity_iter {
|
2008-06-30 22:04:41 +04:00
|
|
|
void *prot_buf;
|
|
|
|
void *data_buf;
|
2014-09-27 03:19:59 +04:00
|
|
|
sector_t seed;
|
2008-06-30 22:04:41 +04:00
|
|
|
unsigned int data_size;
|
2014-09-27 03:19:59 +04:00
|
|
|
unsigned short interval;
|
2008-06-30 22:04:41 +04:00
|
|
|
const char *disk_name;
|
|
|
|
};
|
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *);
|
2008-06-30 22:04:41 +04:00
|
|
|
|
2015-10-21 20:19:33 +03:00
|
|
|
struct blk_integrity_profile {
|
|
|
|
integrity_processing_fn *generate_fn;
|
|
|
|
integrity_processing_fn *verify_fn;
|
|
|
|
const char *name;
|
|
|
|
};
|
2008-06-30 22:04:41 +04:00
|
|
|
|
2015-10-21 20:19:49 +03:00
|
|
|
extern void blk_integrity_register(struct gendisk *, struct blk_integrity *);
|
2008-06-30 22:04:41 +04:00
|
|
|
extern void blk_integrity_unregister(struct gendisk *);
|
2008-10-01 11:38:39 +04:00
|
|
|
extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
|
2010-09-10 22:50:10 +04:00
|
|
|
extern int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
|
|
|
|
struct scatterlist *);
|
|
|
|
extern int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
|
2014-09-27 03:20:06 +04:00
|
|
|
extern bool blk_integrity_merge_rq(struct request_queue *, struct request *,
|
|
|
|
struct request *);
|
|
|
|
extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
|
|
|
|
struct bio *);
|
2008-06-30 22:04:41 +04:00
|
|
|
|
2015-10-21 20:19:49 +03:00
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
2008-10-02 14:53:22 +04:00
|
|
|
{
|
2015-10-21 20:20:18 +03:00
|
|
|
struct blk_integrity *bi = &disk->queue->integrity;
|
2015-10-21 20:19:49 +03:00
|
|
|
|
|
|
|
if (!bi->profile)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return bi;
|
2008-10-02 14:53:22 +04:00
|
|
|
}
|
|
|
|
|
2015-10-21 20:19:49 +03:00
|
|
|
static inline
|
|
|
|
struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
|
2008-10-02 20:47:49 +04:00
|
|
|
{
|
2015-10-21 20:19:49 +03:00
|
|
|
return blk_get_integrity(bdev->bd_disk);
|
2008-10-02 20:47:49 +04:00
|
|
|
}
|
|
|
|
|
2014-09-27 03:19:56 +04:00
|
|
|
static inline bool blk_integrity_rq(struct request *rq)
|
2008-06-30 22:04:41 +04:00
|
|
|
{
|
2014-09-27 03:19:56 +04:00
|
|
|
return rq->cmd_flags & REQ_INTEGRITY;
|
2008-06-30 22:04:41 +04:00
|
|
|
}
|
|
|
|
|
2010-09-10 22:50:10 +04:00
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
q->limits.max_integrity_segments = segs;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned short
|
|
|
|
queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.max_integrity_segments;
|
|
|
|
}
|
|
|
|
|
2015-09-11 18:03:04 +03:00
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(req->bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(next);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
2008-06-30 22:04:41 +04:00
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2012-01-12 12:17:30 +04:00
|
|
|
struct bio;
|
|
|
|
struct block_device;
|
|
|
|
struct gendisk;
|
|
|
|
struct blk_integrity;
|
|
|
|
|
|
|
|
static inline int blk_integrity_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_map_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b,
|
|
|
|
struct scatterlist *s)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline struct blk_integrity *bdev_get_integrity(struct block_device *b)
|
|
|
|
{
|
2014-10-10 02:30:17 +04:00
|
|
|
return NULL;
|
2012-01-12 12:17:30 +04:00
|
|
|
}
|
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-10-21 20:19:49 +03:00
|
|
|
static inline void blk_integrity_register(struct gendisk *d,
|
2012-01-12 12:17:30 +04:00
|
|
|
struct blk_integrity *b)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_integrity_unregister(struct gendisk *d)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline unsigned short queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2014-09-27 03:20:06 +04:00
|
|
|
static inline bool blk_integrity_merge_rq(struct request_queue *rq,
|
|
|
|
struct request *r1,
|
|
|
|
struct request *r2)
|
2012-01-12 12:17:30 +04:00
|
|
|
{
|
2014-10-29 05:27:43 +03:00
|
|
|
return true;
|
2012-01-12 12:17:30 +04:00
|
|
|
}
|
2014-09-27 03:20:06 +04:00
|
|
|
static inline bool blk_integrity_merge_bio(struct request_queue *rq,
|
|
|
|
struct request *r,
|
|
|
|
struct bio *b)
|
2012-01-12 12:17:30 +04:00
|
|
|
{
|
2014-10-29 05:27:43 +03:00
|
|
|
return true;
|
2012-01-12 12:17:30 +04:00
|
|
|
}
|
2015-10-21 20:19:49 +03:00
|
|
|
|
2015-09-11 18:03:04 +03:00
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2008-06-30 22:04:41 +04:00
|
|
|
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2007-10-08 21:26:20 +04:00
|
|
|
struct block_device_operations {
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 17:09:22 +03:00
|
|
|
int (*open) (struct block_device *, fmode_t);
|
2013-05-06 05:52:57 +04:00
|
|
|
void (*release) (struct gendisk *, fmode_t);
|
2016-08-05 17:11:04 +03:00
|
|
|
int (*rw_page)(struct block_device *, sector_t, struct page *, bool);
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 17:09:22 +03:00
|
|
|
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
|
|
|
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-08 22:57:37 +03:00
|
|
|
unsigned int (*check_events) (struct gendisk *disk,
|
|
|
|
unsigned int clearing);
|
|
|
|
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
|
2007-10-08 21:26:20 +04:00
|
|
|
int (*media_changed) (struct gendisk *);
|
2010-05-15 22:09:29 +04:00
|
|
|
void (*unlock_native_capacity) (struct gendisk *);
|
2007-10-08 21:26:20 +04:00
|
|
|
int (*revalidate_disk) (struct gendisk *);
|
|
|
|
int (*getgeo)(struct block_device *, struct hd_geometry *);
|
2010-05-17 09:32:43 +04:00
|
|
|
/* this callback is with swap_lock and sometimes page table lock held */
|
|
|
|
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
|
2007-10-08 21:26:20 +04:00
|
|
|
struct module *owner;
|
2015-10-15 15:10:48 +03:00
|
|
|
const struct pr_ops *pr_ops;
|
2007-10-08 21:26:20 +04:00
|
|
|
};
|
|
|
|
|
2007-08-30 04:34:12 +04:00
|
|
|
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
|
|
|
|
unsigned long);
|
2014-06-05 03:07:46 +04:00
|
|
|
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
|
|
|
|
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
|
|
|
|
struct writeback_control *);
|
2017-12-21 09:43:38 +03:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
bool blk_req_needs_zone_write_lock(struct request *rq);
|
|
|
|
void __blk_req_zone_write_lock(struct request *rq);
|
|
|
|
void __blk_req_zone_write_unlock(struct request *rq);
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
if (blk_req_needs_zone_write_lock(rq))
|
|
|
|
__blk_req_zone_write_lock(rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_unlock(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_ZONE_WRITE_LOCKED)
|
|
|
|
__blk_req_zone_write_unlock(rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_zone_is_write_locked(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->q->seq_zones_wlock &&
|
|
|
|
test_bit(blk_rq_zone_no(rq), rq->q->seq_zones_wlock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
|
|
|
|
{
|
|
|
|
if (!blk_req_needs_zone_write_lock(rq))
|
|
|
|
return true;
|
|
|
|
return !blk_req_zone_is_write_locked(rq);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool blk_req_needs_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_unlock(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline bool blk_req_zone_is_write_locked(struct request *rq)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 22:45:40 +04:00
|
|
|
#else /* CONFIG_BLOCK */
|
2014-06-05 03:06:27 +04:00
|
|
|
|
|
|
|
struct block_device;
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 22:45:40 +04:00
|
|
|
/*
|
|
|
|
* stubs for when the block layer is configured out
|
|
|
|
*/
|
|
|
|
#define buffer_heads_over_limit 0
|
|
|
|
|
|
|
|
static inline long nr_blockdev_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-03-11 22:17:08 +03:00
|
|
|
struct blk_plug {
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void blk_start_plug(struct blk_plug *plug)
|
2011-03-08 15:19:51 +03:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-11 22:17:08 +03:00
|
|
|
static inline void blk_finish_plug(struct blk_plug *plug)
|
2011-03-08 15:19:51 +03:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-11 22:17:08 +03:00
|
|
|
static inline void blk_flush_plug(struct task_struct *task)
|
2011-03-08 15:19:51 +03:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-04-16 15:27:55 +04:00
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *task)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-03-08 15:19:51 +03:00
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-06-05 03:06:27 +04:00
|
|
|
static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
|
|
|
|
sector_t *error_sector)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 22:45:40 +04:00
|
|
|
#endif /* CONFIG_BLOCK */
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|