License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2008-01-29 16:51:59 +03:00
|
|
|
#ifndef BLK_INTERNAL_H
|
|
|
|
#define BLK_INTERNAL_H
|
|
|
|
|
2011-12-14 03:33:37 +04:00
|
|
|
#include <linux/idr.h>
|
2014-09-25 19:23:47 +04:00
|
|
|
#include <linux/blk-mq.h>
|
2018-09-25 23:30:08 +03:00
|
|
|
#include <xen/xen.h>
|
2014-09-25 19:23:47 +04:00
|
|
|
#include "blk-mq.h"
|
2011-12-14 03:33:37 +04:00
|
|
|
|
2008-01-29 16:53:40 +03:00
|
|
|
/* Amount of time in which a process may batch requests */
|
|
|
|
#define BLK_BATCH_TIME (HZ/50UL)
|
|
|
|
|
|
|
|
/* Number of requests a "batching" process may submit */
|
|
|
|
#define BLK_BATCH_REQ 32
|
|
|
|
|
2014-05-14 01:10:52 +04:00
|
|
|
/* Max future timer expiry for timeouts */
|
|
|
|
#define BLK_MAX_TIMEOUT (5 * HZ)
|
|
|
|
|
2017-02-01 01:53:20 +03:00
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
extern struct dentry *blk_debugfs_root;
|
|
|
|
#endif
|
|
|
|
|
2014-09-25 19:23:43 +04:00
|
|
|
struct blk_flush_queue {
|
|
|
|
unsigned int flush_queue_delayed:1;
|
|
|
|
unsigned int flush_pending_idx:1;
|
|
|
|
unsigned int flush_running_idx:1;
|
|
|
|
unsigned long flush_pending_since;
|
|
|
|
struct list_head flush_queue[2];
|
|
|
|
struct list_head flush_data_in_flight;
|
|
|
|
struct request *flush_rq;
|
2015-08-09 10:41:51 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* flush_rq shares tag with this rq, both can't be active
|
|
|
|
* at the same time
|
|
|
|
*/
|
|
|
|
struct request *orig_rq;
|
2014-09-25 19:23:43 +04:00
|
|
|
spinlock_t mq_flush_lock;
|
|
|
|
};
|
|
|
|
|
2008-01-29 16:51:59 +03:00
|
|
|
extern struct kmem_cache *blk_requestq_cachep;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
extern struct kmem_cache *request_cachep;
|
2008-01-29 16:51:59 +03:00
|
|
|
extern struct kobj_type blk_queue_ktype;
|
2011-12-14 03:33:37 +04:00
|
|
|
extern struct ida blk_queue_ida;
|
2008-01-29 16:51:59 +03:00
|
|
|
|
2018-03-08 04:10:12 +03:00
|
|
|
/*
|
|
|
|
* @q->queue_lock is set while a queue is being initialized. Since we know
|
|
|
|
* that no other threads access the queue object before @q->queue_lock has
|
|
|
|
* been set, it is safe to manipulate queue flags without holding the
|
|
|
|
* queue_lock if @q->queue_lock == NULL. See also blk_alloc_queue_node() and
|
|
|
|
* blk_init_allocated_queue().
|
|
|
|
*/
|
|
|
|
static inline void queue_lockdep_assert_held(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->queue_lock)
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_set_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (test_bit(QUEUE_FLAG_INIT_DONE, &q->queue_flags) &&
|
|
|
|
kref_read(&q->kobj.kref))
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_clear_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (test_bit(QUEUE_FLAG_INIT_DONE, &q->queue_flags) &&
|
|
|
|
kref_read(&q->kobj.kref))
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int queue_flag_test_and_clear(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
queue_lockdep_assert_held(q);
|
|
|
|
|
|
|
|
if (test_bit(flag, &q->queue_flags)) {
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int queue_flag_test_and_set(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
queue_lockdep_assert_held(q);
|
|
|
|
|
|
|
|
if (!test_bit(flag, &q->queue_flags)) {
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
|
|
|
queue_lockdep_assert_held(q);
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
|
|
|
queue_lockdep_assert_held(q);
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2014-09-25 19:23:43 +04:00
|
|
|
static inline struct blk_flush_queue *blk_get_flush_queue(
|
2014-09-25 19:23:46 +04:00
|
|
|
struct request_queue *q, struct blk_mq_ctx *ctx)
|
2014-09-25 19:23:43 +04:00
|
|
|
{
|
2016-09-14 17:18:54 +03:00
|
|
|
if (q->mq_ops)
|
|
|
|
return blk_mq_map_queue(q, ctx->cpu)->fq;
|
|
|
|
return q->fq;
|
2014-09-25 19:23:43 +04:00
|
|
|
}
|
|
|
|
|
2011-12-14 03:33:38 +04:00
|
|
|
static inline void __blk_get_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
kobject_get(&q->kobj);
|
|
|
|
}
|
|
|
|
|
2014-09-25 19:23:47 +04:00
|
|
|
struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
|
2018-10-12 13:07:26 +03:00
|
|
|
int node, int cmd_size, gfp_t flags);
|
2014-09-25 19:23:47 +04:00
|
|
|
void blk_free_flush_queue(struct blk_flush_queue *q);
|
2014-09-25 19:23:40 +04:00
|
|
|
|
2012-06-05 07:40:59 +04:00
|
|
|
int blk_init_rl(struct request_list *rl, struct request_queue *q,
|
|
|
|
gfp_t gfp_mask);
|
2017-06-01 00:43:45 +03:00
|
|
|
void blk_exit_rl(struct request_queue *q, struct request_list *rl);
|
2018-08-09 17:53:37 +03:00
|
|
|
void blk_exit_queue(struct request_queue *q);
|
2008-01-29 16:53:40 +03:00
|
|
|
void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
|
|
|
|
struct bio *bio);
|
2012-03-06 01:14:58 +04:00
|
|
|
void blk_queue_bypass_start(struct request_queue *q);
|
|
|
|
void blk_queue_bypass_end(struct request_queue *q);
|
2008-01-29 16:51:59 +03:00
|
|
|
void __blk_queue_free_tags(struct request_queue *q);
|
2015-10-21 20:20:12 +03:00
|
|
|
void blk_freeze_queue(struct request_queue *q);
|
|
|
|
|
|
|
|
static inline void blk_queue_enter_live(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Given that running in generic_make_request() context
|
|
|
|
* guarantees that a live reference against q_usage_counter has
|
|
|
|
* been established, further references under that same context
|
|
|
|
* need not check that the queue has been frozen (marked dead).
|
|
|
|
*/
|
|
|
|
percpu_ref_get(&q->q_usage_counter);
|
|
|
|
}
|
2008-01-29 16:51:59 +03:00
|
|
|
|
2018-09-24 10:43:52 +03:00
|
|
|
static inline bool biovec_phys_mergeable(struct request_queue *q,
|
|
|
|
struct bio_vec *vec1, struct bio_vec *vec2)
|
2018-09-24 10:43:50 +03:00
|
|
|
{
|
2018-09-24 10:43:52 +03:00
|
|
|
unsigned long mask = queue_segment_boundary(q);
|
2018-09-24 10:43:53 +03:00
|
|
|
phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset;
|
|
|
|
phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset;
|
2018-09-24 10:43:52 +03:00
|
|
|
|
|
|
|
if (addr1 + vec1->bv_len != addr2)
|
2018-09-24 10:43:50 +03:00
|
|
|
return false;
|
2018-09-25 23:30:08 +03:00
|
|
|
if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2))
|
2018-09-24 10:43:50 +03:00
|
|
|
return false;
|
2018-09-24 10:43:52 +03:00
|
|
|
if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask))
|
|
|
|
return false;
|
2018-09-24 10:43:50 +03:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-09-24 10:43:49 +03:00
|
|
|
static inline bool __bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
return offset ||
|
|
|
|
((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if adding a bio_vec after bprv with offset would create a gap in
|
|
|
|
* the SG list. Most drivers don't care about this, but some do.
|
|
|
|
*/
|
|
|
|
static inline bool bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
if (!queue_virt_boundary(q))
|
|
|
|
return false;
|
|
|
|
return __bvec_gap_to_prev(q, bprv, offset);
|
|
|
|
}
|
|
|
|
|
2015-10-21 20:20:23 +03:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
void blk_flush_integrity(void);
|
2017-07-04 01:58:43 +03:00
|
|
|
bool __bio_integrity_endio(struct bio *);
|
|
|
|
static inline bool bio_integrity_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
if (bio_integrity(bio))
|
|
|
|
return __bio_integrity_endio(bio);
|
|
|
|
return true;
|
|
|
|
}
|
2018-09-24 10:43:47 +03:00
|
|
|
|
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(req->bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(next);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-10-21 20:20:23 +03:00
|
|
|
static inline void blk_flush_integrity(void)
|
|
|
|
{
|
|
|
|
}
|
2017-07-04 01:58:43 +03:00
|
|
|
static inline bool bio_integrity_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2018-09-24 10:43:47 +03:00
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
2008-01-29 16:51:59 +03:00
|
|
|
|
2015-10-30 15:57:30 +03:00
|
|
|
void blk_timeout_work(struct work_struct *work);
|
2014-05-14 01:10:52 +04:00
|
|
|
unsigned long blk_rq_timeout(unsigned long timeout);
|
2014-04-24 18:51:47 +04:00
|
|
|
void blk_add_timer(struct request *req);
|
2008-09-14 16:55:09 +04:00
|
|
|
void blk_delete_timer(struct request *);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
|
|
|
bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio);
|
|
|
|
bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio);
|
2017-02-08 16:46:49 +03:00
|
|
|
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
|
2015-05-08 20:51:33 +03:00
|
|
|
unsigned int *request_count,
|
|
|
|
struct request **same_queue_rq);
|
2015-10-20 18:13:51 +03:00
|
|
|
unsigned int blk_plug_queued_count(struct request_queue *q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
|
|
|
void blk_account_io_start(struct request *req, bool new_io);
|
|
|
|
void blk_account_io_completion(struct request *req, unsigned int bytes);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 12:08:53 +03:00
|
|
|
void blk_account_io_done(struct request *req, u64 now);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
|
|
|
|
2008-09-14 16:55:09 +04:00
|
|
|
/*
|
|
|
|
* EH timer and IO completion will both attempt to 'grab' the request, make
|
2018-01-10 21:34:25 +03:00
|
|
|
* sure that only one of them succeeds. Steal the bottom bit of the
|
|
|
|
* __deadline field for this.
|
2008-09-14 16:55:09 +04:00
|
|
|
*/
|
|
|
|
static inline int blk_mark_rq_complete(struct request *rq)
|
|
|
|
{
|
2018-01-10 21:34:25 +03:00
|
|
|
return test_and_set_bit(0, &rq->__deadline);
|
2008-09-14 16:55:09 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_clear_rq_complete(struct request *rq)
|
|
|
|
{
|
2018-01-10 21:34:25 +03:00
|
|
|
clear_bit(0, &rq->__deadline);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_rq_is_complete(struct request *rq)
|
|
|
|
{
|
|
|
|
return test_bit(0, &rq->__deadline);
|
2008-09-14 16:55:09 +04:00
|
|
|
}
|
2008-01-29 16:53:40 +03:00
|
|
|
|
2009-04-23 06:05:18 +04:00
|
|
|
/*
|
|
|
|
* Internal elevator interface
|
|
|
|
*/
|
2016-10-20 16:12:13 +03:00
|
|
|
#define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
|
2009-04-23 06:05:18 +04:00
|
|
|
|
2011-01-25 14:43:54 +03:00
|
|
|
void blk_insert_flush(struct request *rq);
|
2010-09-03 13:56:16 +04:00
|
|
|
|
2009-04-23 06:05:18 +04:00
|
|
|
static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
|
|
|
|
{
|
|
|
|
struct elevator_queue *e = q->elevator;
|
|
|
|
|
2016-12-11 01:13:59 +03:00
|
|
|
if (e->type->ops.sq.elevator_activate_req_fn)
|
|
|
|
e->type->ops.sq.elevator_activate_req_fn(q, rq);
|
2009-04-23 06:05:18 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq)
|
|
|
|
{
|
|
|
|
struct elevator_queue *e = q->elevator;
|
|
|
|
|
2016-12-11 01:13:59 +03:00
|
|
|
if (e->type->ops.sq.elevator_deactivate_req_fn)
|
|
|
|
e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
|
2009-04-23 06:05:18 +04:00
|
|
|
}
|
|
|
|
|
2018-05-31 20:11:38 +03:00
|
|
|
int elevator_init(struct request_queue *);
|
2018-05-31 20:11:40 +03:00
|
|
|
int elevator_init_mq(struct request_queue *q);
|
2018-08-21 10:15:03 +03:00
|
|
|
int elevator_switch_mq(struct request_queue *q,
|
|
|
|
struct elevator_type *new_e);
|
2018-05-31 20:11:37 +03:00
|
|
|
void elevator_exit(struct request_queue *, struct elevator_queue *);
|
2018-01-17 22:48:08 +03:00
|
|
|
int elv_register_queue(struct request_queue *q);
|
|
|
|
void elv_unregister_queue(struct request_queue *q);
|
|
|
|
|
2017-08-23 20:10:30 +03:00
|
|
|
struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
|
|
|
|
|
2008-09-14 16:56:33 +04:00
|
|
|
#ifdef CONFIG_FAIL_IO_TIMEOUT
|
|
|
|
int blk_should_fake_timeout(struct request_queue *);
|
|
|
|
ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
|
|
|
|
ssize_t part_timeout_store(struct device *, struct device_attribute *,
|
|
|
|
const char *, size_t);
|
|
|
|
#else
|
|
|
|
static inline int blk_should_fake_timeout(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-01-29 16:04:06 +03:00
|
|
|
int ll_back_merge_fn(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio);
|
|
|
|
int ll_front_merge_fn(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio);
|
2017-02-02 18:54:40 +03:00
|
|
|
struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
|
|
|
|
struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
|
2011-03-21 12:14:27 +03:00
|
|
|
int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
|
|
|
|
struct request *next);
|
2008-01-29 16:04:06 +03:00
|
|
|
void blk_recalc_rq_segments(struct request *rq);
|
2009-07-03 12:48:17 +04:00
|
|
|
void blk_rq_set_mixed_merge(struct request *rq);
|
2012-02-08 12:19:38 +04:00
|
|
|
bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
|
2017-02-08 16:46:48 +03:00
|
|
|
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
|
2008-01-29 16:04:06 +03:00
|
|
|
|
2008-01-29 16:51:59 +03:00
|
|
|
void blk_queue_congestion_threshold(struct request_queue *q);
|
|
|
|
|
2008-03-04 13:23:45 +03:00
|
|
|
int blk_dev_init(void);
|
|
|
|
|
2010-10-25 00:06:02 +04:00
|
|
|
|
2008-01-29 16:51:59 +03:00
|
|
|
/*
|
|
|
|
* Return the threshold (number of used requests) at which the queue is
|
|
|
|
* considered to be congested. It include a little hysteresis to keep the
|
|
|
|
* context switch rate down.
|
|
|
|
*/
|
|
|
|
static inline int queue_congestion_on_threshold(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->nr_congestion_on;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The threshold at which a queue is considered to be uncongested
|
|
|
|
*/
|
|
|
|
static inline int queue_congestion_off_threshold(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->nr_congestion_off;
|
|
|
|
}
|
|
|
|
|
2014-05-20 21:49:02 +04:00
|
|
|
extern int blk_update_nr_requests(struct request_queue *, unsigned int);
|
|
|
|
|
2009-04-24 10:10:11 +04:00
|
|
|
/*
|
|
|
|
* Contribute to IO statistics IFF:
|
|
|
|
*
|
|
|
|
* a) it's attached to a gendisk, and
|
|
|
|
* b) the queue had IO stats enabled when this request was started, and
|
2012-09-18 20:19:25 +04:00
|
|
|
* c) it's a file system request
|
2009-04-24 10:10:11 +04:00
|
|
|
*/
|
2018-08-16 17:51:40 +03:00
|
|
|
static inline bool blk_do_io_stat(struct request *rq)
|
2009-02-02 10:42:32 +03:00
|
|
|
{
|
2010-08-07 20:17:56 +04:00
|
|
|
return rq->rq_disk &&
|
2016-10-20 16:12:13 +03:00
|
|
|
(rq->rq_flags & RQF_IO_STAT) &&
|
2017-01-31 18:57:29 +03:00
|
|
|
!blk_rq_is_passthrough(rq);
|
2009-02-02 10:42:32 +03:00
|
|
|
}
|
|
|
|
|
2017-02-08 16:46:47 +03:00
|
|
|
static inline void req_set_nomerge(struct request_queue *q, struct request *req)
|
|
|
|
{
|
|
|
|
req->cmd_flags |= REQ_NOMERGE;
|
|
|
|
if (req == q->last_merge)
|
|
|
|
q->last_merge = NULL;
|
|
|
|
}
|
|
|
|
|
2018-01-10 00:23:42 +03:00
|
|
|
/*
|
|
|
|
* Steal a bit from this field for legacy IO path atomic IO marking. Note that
|
|
|
|
* setting the deadline clears the bottom bit, potentially clearing the
|
|
|
|
* completed bit. The user has to be OK with this (current ones are fine).
|
|
|
|
*/
|
|
|
|
static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
|
|
|
|
{
|
|
|
|
rq->__deadline = time & ~0x1UL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long blk_rq_deadline(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->__deadline & ~0x1UL;
|
|
|
|
}
|
|
|
|
|
2011-12-14 03:33:40 +04:00
|
|
|
/*
|
|
|
|
* Internal io_context interface
|
|
|
|
*/
|
|
|
|
void get_io_context(struct io_context *ioc);
|
2011-12-14 03:33:42 +04:00
|
|
|
struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
|
2012-03-06 01:15:24 +04:00
|
|
|
struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
|
|
|
|
gfp_t gfp_mask);
|
2011-12-14 03:33:42 +04:00
|
|
|
void ioc_clear_queue(struct request_queue *q);
|
2011-12-14 03:33:40 +04:00
|
|
|
|
2012-03-06 01:15:24 +04:00
|
|
|
int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
|
2011-12-14 03:33:40 +04:00
|
|
|
|
2016-12-15 00:23:43 +03:00
|
|
|
/**
|
|
|
|
* rq_ioc - determine io_context for request allocation
|
|
|
|
* @bio: request being allocated is for this bio (can be %NULL)
|
|
|
|
*
|
|
|
|
* Determine io_context to use for request allocation for @bio. May return
|
|
|
|
* %NULL if %current->io_context doesn't exist.
|
|
|
|
*/
|
|
|
|
static inline struct io_context *rq_ioc(struct bio *bio)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
if (bio && bio->bi_ioc)
|
|
|
|
return bio->bi_ioc;
|
|
|
|
#endif
|
|
|
|
return current->io_context;
|
|
|
|
}
|
|
|
|
|
2011-12-14 03:33:40 +04:00
|
|
|
/**
|
|
|
|
* create_io_context - try to create task->io_context
|
|
|
|
* @gfp_mask: allocation mask
|
|
|
|
* @node: allocation node
|
|
|
|
*
|
2012-03-06 01:15:24 +04:00
|
|
|
* If %current->io_context is %NULL, allocate a new io_context and install
|
|
|
|
* it. Returns the current %current->io_context which may be %NULL if
|
|
|
|
* allocation failed.
|
2011-12-14 03:33:40 +04:00
|
|
|
*
|
|
|
|
* Note that this function can't be called with IRQ disabled because
|
2012-03-06 01:15:24 +04:00
|
|
|
* task_lock which protects %current->io_context is IRQ-unsafe.
|
2011-12-14 03:33:40 +04:00
|
|
|
*/
|
2012-03-06 01:15:24 +04:00
|
|
|
static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
|
2011-12-14 03:33:40 +04:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(irqs_disabled());
|
2012-03-06 01:15:24 +04:00
|
|
|
if (unlikely(!current->io_context))
|
|
|
|
create_task_io_context(current, gfp_mask, node);
|
|
|
|
return current->io_context;
|
2011-12-14 03:33:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Internal throttling interface
|
|
|
|
*/
|
2011-10-19 16:31:18 +04:00
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
extern void blk_throtl_drain(struct request_queue *q);
|
2011-10-19 16:31:18 +04:00
|
|
|
extern int blk_throtl_init(struct request_queue *q);
|
|
|
|
extern void blk_throtl_exit(struct request_queue *q);
|
2017-03-27 20:51:38 +03:00
|
|
|
extern void blk_throtl_register_queue(struct request_queue *q);
|
2011-10-19 16:31:18 +04:00
|
|
|
#else /* CONFIG_BLK_DEV_THROTTLING */
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
|
|
|
static inline void blk_throtl_drain(struct request_queue *q) { }
|
2011-10-19 16:31:18 +04:00
|
|
|
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
|
|
|
|
static inline void blk_throtl_exit(struct request_queue *q) { }
|
2017-03-27 20:51:38 +03:00
|
|
|
static inline void blk_throtl_register_queue(struct request_queue *q) { }
|
2011-10-19 16:31:18 +04:00
|
|
|
#endif /* CONFIG_BLK_DEV_THROTTLING */
|
2017-03-27 20:51:37 +03:00
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
|
|
|
|
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
|
|
|
|
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
|
|
|
|
const char *page, size_t count);
|
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 20:51:41 +03:00
|
|
|
extern void blk_throtl_bio_endio(struct bio *bio);
|
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 01:19:42 +03:00
|
|
|
extern void blk_throtl_stat_add(struct request *rq, u64 time);
|
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 20:51:41 +03:00
|
|
|
#else
|
|
|
|
static inline void blk_throtl_bio_endio(struct bio *bio) { }
|
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 01:19:42 +03:00
|
|
|
static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
|
2017-03-27 20:51:37 +03:00
|
|
|
#endif
|
2011-10-19 16:31:18 +04:00
|
|
|
|
2017-06-19 10:26:21 +03:00
|
|
|
#ifdef CONFIG_BOUNCE
|
|
|
|
extern int init_emergency_isa_pool(void);
|
|
|
|
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
|
|
|
|
#else
|
|
|
|
static inline int init_emergency_isa_pool(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BOUNCE */
|
|
|
|
|
2017-11-30 02:56:35 +03:00
|
|
|
extern void blk_drain_queue(struct request_queue *q);
|
|
|
|
|
2018-07-03 18:15:01 +03:00
|
|
|
#ifdef CONFIG_BLK_CGROUP_IOLATENCY
|
|
|
|
extern int blk_iolatency_init(struct request_queue *q);
|
|
|
|
#else
|
|
|
|
static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
|
|
|
|
#endif
|
|
|
|
|
2018-10-12 13:08:47 +03:00
|
|
|
struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
|
|
|
|
|
2011-10-19 16:31:18 +04:00
|
|
|
#endif /* BLK_INTERNAL_H */
|