dun_bytes needs to be less than or equal to the IV size of the
encryption mode, not just less than or equal to BLK_CRYPTO_MAX_IV_SIZE.
Currently this doesn't matter since blk_crypto_init_key() is never
actually passed invalid values, but we might as well fix this.
Fixes: a892c8d52c ("block: Inline encryption support for blk-mq")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210825055918.51975-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer may call the I/O scheduler .finish_request() callback
without having called the .insert_requests() callback. Make sure that the
mq-deadline I/O statistics are correct if the block layer inserts an I/O
request that bypasses the I/O scheduler. This patch prevents that lower
priority I/O is delayed longer than necessary for mixed I/O priority
workloads.
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Fixes: 08a9ad8bf6 ("block/mq-deadline: Add cgroup support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Tested-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A user space process should not need the CAP_SYS_ADMIN capability set
in order to perform a BLKREPORTZONE ioctl.
Getting the zone report is required in order to get the write pointer.
Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable
that a user space process that can read/write from/to the device, also
can get the write pointer. (Since e.g. writes have to be at the write
pointer.)
Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-3-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE
and BLKFINISHZONE) should be allowed under the same permissions as write().
(write() does not require CAP_SYS_ADMIN).
Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if
the fd was successfully opened with FMODE_WRITE.
(They do not require CAP_SYS_ADMIN).
Currently, zone management send operations require both CAP_SYS_ADMIN
and that the fd was successfully opened with FMODE_WRITE.
Remove the CAP_SYS_ADMIN requirement, so that zone management send
operations match the access control requirement of write(), BLKSECDISCARD
and BLKZEROOUT.
Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-2-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hidden gendisks will never be marked live.
Fixes: 40b3a52ffc ("block: add a sanity check for a live disk in del_gendisk")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824144310.1487816-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__bio_iov_append_get_pages() doesn't put not appended pages on
bio_add_hw_page() failure, so potentially leaking them, fix it. Also, do
the same for __bio_iov_iter_get_pages(), even though it looks like it
can't be triggered by userspace in this case.
Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1edfa6a2ffd66d55e6345a477df5387d2c1415d0.1626653825.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This might have been a neat debug aid when the extended dev_t was
added, but that time is long gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_alloc_ext_minor already returns just a minor number, so no need to
mask the high bits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're missing a description for the 'nr_vecs' parameter. While in there,
clarify that freeing a bio allocated through this function must be done
from process context.
Fixes: 1cbbd31c4ada ("bio: add allocation cache abstraction")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE,
as non-polled IO may complete through hard/soft IRQ and hence isn't
safe for our polled bio alloc cache.
Provide a helper that does just that, and use it in the merging code as
well if we split a bio and turn off polling.
Fixes: be863b9e43 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio alloc cache relies on the fact that a polled bio will complete
in process context, clear the cacheable flag if we disable polling
for a given bio.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a per-cpu bio_set cache for bio allocations, enabling us to quickly
recycle them instead of going through the slab allocator. This cache
isn't IRQ safe, and hence is only really suitable for polled IO.
Very simple - keeps a count of bio's in the cache, and maintains a max
of 512 with a slack of 64. If we get above max + slack, we drop slack
number of bio's.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The memset() used is measurably slower in targeted benchmarks, wasting
about 1% of the total runtime, or 50% of the (later) hot path cached
bio alloc. Get rid of it and fill in the bio manually.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Properly unwind on errors in device_add_disk. This is the initial work
as drivers are not converted yet, which will follow in separate patches.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: major rebase. All bugs are probably mine]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for proper error handling in add_disk.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for proper error handling in add_disk.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that all the sysfs bits are set up before bdev_add is called,
as that will make the upcomding error handling much easier. However
this means the call to disk_update_readahead has to be split as that
requires a bdi. Also remove various sanity checks that don't make
sense now that blk_register_queue only has a single caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210818144542.19305-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Doing all the sysfs file creation before adding the bdev and thus
allowing it to be opened will simplify the about to be added error
handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once bdev_add is called userspace can open the block device. Ensure
that the struct device, which is used for refcounting of the disk
besides various other things, is fully setup at that point.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no real reason these should be separate. Also simplify the
groups assignment a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210818144542.19305-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a sanity check to del_gendisk to do nothing when the disk wasn't
successfully added. This papers over the complete lack of add_disk
error handling, which is about to get fixed gradually.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.
Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Acquire the queue ref dropped in disk_release in __blk_alloc_disk so any
allocate gendisk always has a queue reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass in a request_queue and assign disk->queue in __blk_alloc_disk to
ensure struct gendisk always has a valid ->queue pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This was a leftover from the legacy alloc_disk interface. Switch
the scsi ULPs and dasd to set ->minors directly like all other
drivers and remove the argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the lockdep name to the low-level __blk_alloc_disk helper and
hardcode the name for it given that the number of minors or node_id
are not very useful information. While this passes a pointless
argument for non-lockdep builds that is not really an issue as
disk allocation is a probe time only slow path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function bio_trim has offset and size arguments that are declared
as int.
The callers of this function use sector_t type when passing the offset
and size, e.g. drivers/md/raid1.c:narrow_write_error() and
drivers/md/raid1.c:narrow_write_error().
Change offset and size arguments to sector_t type for bio_trim(). Also,
add WARN_ON_ONCE() to catch their overflow.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEgaYcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppcLD/963Ld4YWLi1Chq6e2iqnUmuxPIgQbWCCD+
RKNf3PRLwjVLsBKLjKHhp9fB9XhpmlXkxkLcYech9D2lFpnVf1tc0ziGCIpsNuGG
X2UO8jfE6/XJ+laCyjTkoMlj2zWBJXSwwKx6JyDDOobYLiuVDUHeAcGVvLY/itvx
tEtd+lXmz7cE41Q4cdoeJdmSOE54BAP5uCO66La5bv2r7xQN/nPWi+yg5UTV3GJB
JuL+8RHyV3d4eiBF9Jg0izdp9vaUxUD3VmOjILmaG2wQy+Pbve9mMCZtTFMvSBcR
Vw9B/fbNVon7YqOsrSCdIsfW066MqnIj55nRRETN6LxTGuzx6lQpJPSRXSDGKkR5
SSckLXPKUcRPaX4Lc/SvgQpzvxhY3b9z3BRrIlxy8DWcZT7qq/bb41O9J4z6+jUn
XIjKzvADLGqUqS/5zowyk/3vFHGnyhjYsRqMmpLCbjjxi5fSBbR+yorm5Vlx8auJ
7iWHuNCGyUY/rMB1pibYhvT1dnNR6qOm/jTdHwjsb/QPDuCoU06TFnXbuSoefJlf
ijfuwKQLgxkikICLHQ0uUHSuGhz1A8CwAZjz4rmTBSiFgQeM09v/pf4r+ymxrSgU
n+yb4DAECIsyK8he3ePahFeJgsb0JMmz3ciSJJkK3im69jdLd28Xp4Vs0tgKmz8e
2hMHgkXFSQ==
=DGKg
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three fixes from Ming Lei that should go into 5.14:
- Fix for a kernel panic when iterating over tags for some cases
where a flush request is present, a regression in this cycle.
- Request timeout fix
- Fix flush request checking"
* tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
blk-mq: fix is_flush_rq
blk-mq: fix kernel panic during iterating over flush request
blk-mq: don't grab rq's refcount in blk_mq_check_expired()
This essentially reverts "block: remove the extra kobject reference in
bd_link_disk_holder". That commit dropped the extra reference because
the condition in the comment can't be true. But it turns out that
comment did not actually describe the problematic situation, so add
back the extra reference and document it properly.
Fixes: fbd9a39542 ("block: remove the extra kobject reference in bd_link_disk_holder")
Reported-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The default IO priority is the best effort (BE) class with the
normal priority level IOPRIO_NORM (4). However, get_task_ioprio()
returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and
get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent
with the defined default and have both of these functions return the
default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when
the user did not define another default IO priority for the task.
In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as
an alias to IOPRIO_NORM to clarify that this default level applies to
the BE priotity class. In include/linux/ioprio.h, define the macro
IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
and use this new macro when setting a priority to the default.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com
[axboe: drop unnecessary lightnvm change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.
Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a request that has a priority level equal to or larger than
IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but
defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This
is not consistent with the warning and the allowed values for priority
levels. Fix this by setting the request new_ioprio field to
IOPRIO_BE_NR - 1, the lowest priority level allowed.
Cc: <stable@vger.kernel.org>
Fixes: aee69d78de ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
following check:
hctx->fq->flush_rq == req
but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:
1) memory re-order in blk_mq_rq_ctx_init():
rq->mq_hctx = data->hctx;
...
refcount_set(&rq->ref, 1);
OR
2) tag re-use and ->rqs[] isn't updated with new request.
Fix the issue by re-writing is_flush_rq() as:
return rq->end_io == flush_end_io;
which turns out simpler to follow and immune to data race since we have
ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).
Fixes: 2e315dc07d ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Cc: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For fixing use-after-free during iterating over requests, we grabbed
request's refcount before calling ->fn in commit 2e315dc07d ("blk-mq:
grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter").
Turns out this way may cause kernel panic when iterating over one flush
request:
1) old flush request's tag is just released, and this tag is reused by
one new request, but ->rqs[] isn't updated yet
2) the flush request can be re-used for submitting one new flush command,
so blk_rq_init() is called at the same time
3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request
is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called,
flush_rq->end_io may not be updated yet, so NULL pointer dereference
is triggered in blk_mq_put_rq_ref().
Fix the issue by calling refcount_set(&flush_rq->ref, 1) after
flush_rq->end_io is set. So far the only other caller of blk_rq_init() is
scsi_ioctl_reset() in which the request doesn't enter block IO stack and
the request reference count isn't used, so the change is safe.
Fixes: 2e315dc07d ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inside blk_mq_queue_tag_busy_iter() we already grabbed request's
refcount before calling ->fn(), so needn't to grab it one more time
in blk_mq_check_expired().
Meantime remove extra request expire check in blk_mq_check_expired().
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
seq_get_buf is a crutch that undoes all the memory safety of the
seq_file interface. Use the normal seq_printf interfaces instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out a helper to deal with a single blkcg_gq to make the code a
little bit easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210810152623.1796144-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the bvec_virt helper to clean up the bio integrity processing a
little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@kernel.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210804095634.460779-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
inode_detach_wb references the "main" bdi of the inode. With the
recent change to move the bdi from the request_queue to the gendisk
this causes a guaranteed use after free when using certain cgroup
configurations. The big itself is older through as any non-default
inode reference (e.g. an open file descriptor) could have injected
this use after free even before that.
Fixes: 52ebea749a ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: syzbot <syzbot+1fb38bb7d3ce0fa3e1c4@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The dev_t is used as the inode hash, so we should only released it
once then block device inode is gone from the inode cache. Move it
to bdev_free_inode to ensure that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After patch 54efd50 (block: make generic_make_request handle
arbitrarily sized bios), the IO through io-throttle may be larger,
and these IOs may be further split into more small IOs. However,
IOPS throttle does not seem to be aware of this change, which
makes the calculation of IOPS of large IOs incomplete, resulting
in disk-side IOPS that does not meet expectations. Maybe we should
fix this problem.
We can reproduce it by set max_sectors_kb of disk to 128, set
blkio.write_iops_throttle to 100, run a dd instance inside blkio
and use iostat to watch IOPS:
dd if=/dev/zero of=/dev/sdb bs=1M count=1000 oflag=direct
As a result, without this change the average IOPS is 1995, with
this change the IOPS is 98.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/65869aaad05475797d63b4c3fed4f529febe3c26.1627876014.git.brookxu@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEWwSAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprd0D/90ziZdNQdPtU+bTYX9mRnLb6zEB+qyCvFP
w0JFk17WoLcFwUm3gTaBPhztWjh9v1O9iInpl+QkzffvBASv3ysjOx0ioHsYjpRq
CTupoH8dPyor8cWagTsX9i4ZoeSlo5x49uUZPiEskVq3ioy0HF5SbASzQrTs/SIZ
6uwI/JTkF/xOHPyvpOk+U7QffcC10mcflfwX3vHFL4FM5DWxWhNWy0Y/FSWfQlWz
HviWwGjX1uqsoggFIfgUXy32E3oJM6FNVSNcP60dF+wpPtQ4ufz6hRf/epwKjOKm
B8rQotlEhD9EHY37u8aCkdnDwK+ILRnl4VKw4zWYSsjwkrpfvAd78XaPTYUYoMEb
IulTtSokENK1BNw4XLTyh9KzrxU90Z9lip6Khv4s+cg3Xs3MUC75M8TO/UH1mx4H
Fsg7c86bZSwEcGj9McRhFAg0PRcTWsjsIFmI43WME3w6FkVCxB7lmSR55nmNCTXC
ZkaLXBKtaaJfu8ehMyKDz6xK39GKW1GZIlYD3+MMKep4gJb3UB4Oin8XYoeLQquU
28fQfk2zsmdPaJAnBK4s3Jlq5cQZ7I+oMHlQ8cGkNtJmFHs5mpFVdMa4gLt9YpMX
8UZrrweQOEFWgfxDIOlPbYN2vjXaCRYWgBmKPa7QI9awNJHTsOH09YzBhjWBauIb
8RMC1Ur7Mg==
=LfC0
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes for block that should go into 5.14:
- Revert the mq-deadline cgroup addition. More work is needed on this
front, let's revert it for now and get it right before having it in
a released kernel (Tejun)
- blk-iocost lockdep fix (Ming)
- nbd double completion fix (Xie)
- Fix for non-idling when clearing the shared tag flag (Yu)"
* tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
nbd: Aovid double completion of a request
blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
Revert "block/mq-deadline: Add cgroup support"
blk-iocost: fix lockdep warning on blkcg->lock
We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.
If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.
Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdev_resize_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdev_del_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdev_add_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Partition scanning only happens on the whole device, so pass a
struct gendisk instead of the whole device block_device to the scanners.
This allows to simplify printing the device name in various places as the
disk name is available in disk->name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20210810154512.1809898-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 08a9ad8bf6 ("block/mq-deadline: Add cgroup support")
and a follow-up commit c06bc5a3fb ("block/mq-deadline: Remove a
WARN_ON_ONCE() call"). The added cgroup support has the following issues:
* It breaks cgroup interface file format rule by adding custom elements to a
nested key-value file.
* It registers mq-deadline as a cgroup-aware policy even though all it's
doing is collecting per-cgroup stats. Even if we need these stats, this
isn't the right way to add them.
* It hasn't been reviewed from cgroup side.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.
Replace the test with a static key to avoid the potential cache miss.
[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tanner Love <tannerlove@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com
blkcg->lock depends on q->queue_lock which may depend on another driver
lock required in irq context, one example is dm-thin:
Chain exists of:
&pool->lock#3 --> &q->queue_lock --> &blkcg->lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&blkcg->lock);
local_irq_disable();
lock(&pool->lock#3);
lock(&q->queue_lock);
<Interrupt>
lock(&pool->lock#3);
Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write().
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup fix from Tejun Heo:
"One commit to fix a possible A-A deadlock around u64_stats_sync on
32bit machines caused by updating it without disabling IRQ when it may
be read from IRQ context"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.
Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb209 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just retrieve the bdi from the disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
.. and rename the function to disk_update_readahead. This is in
preparation for moving the BDI from the request_queue to the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't leak the detaіls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that device mapper has been changed to register the disk once
it is fully ready all this code is unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
device mapper needs to register holders before it is ready to do I/O.
Currently it does so by registering the disk early, which can leave
the disk and queue in a weird half state where the queue is registered
with the disk, except for sysfs and the elevator. And this state has
been a bit promlematic before, and will get more so when sorting out
the responsibilities between the queue and the disk.
Support registering holders on an initialized but not registered disk
instead by delaying the sysfs registration until the disk is registered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Invert they way the holder relations are tracked. This very
slightly reduces the memory overhead for partitioned devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 0d02129e76 ("block: merge struct block_device and struct
hd_struct") there is no way for the bdev to go away as long as there is
a holder, so remove the extra references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the block holder code into a separate file as it is not in any way
related to the other block_dev.c code, and add a new selectable config
option for it so that we don't have to build it without any remapped
drivers selected.
The Kconfig symbol contains a _DEPRECATED suffix to match the comments
added in commit 49731baa41
("block: restore multiple bd_link_disk_holder() support").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kyber ioscheduler calls trace_block_rq_insert() *after* the request
is added to the queue but the documentation for trace_block_rq_insert()
says that the call should be made *before* the request is added to the
queue. Move the tracepoint for the kyber ioscheduler so that it is
consistent with the documentation.
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Link: https://lore.kernel.org/r/20210804194913.10497-1-vincent.fu@samsung.com
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.
This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix the following kernel-doc warning that appears when building with W=1:
block/partitions/ldm.c:31: warning: expecting prototype for ldm().
Prototype was for ldm_debug() instead
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805173447.3249906-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If queue is dying while iolatency_set_limit() is in progress,
blk_get_queue() won't increment the refcount of the queue. However,
blk_put_queue() will still decrement the refcount later, which will
cause the refcout to be unbalanced.
Thus error out in such case to fix the problem.
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
define any attribute since the "mq" sysfs directory contains only
sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
unused and unnecessary. Remove all this unused code.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor disk_check_events() and move some code into disk_event_uevent().
Then add disk_force_media_change(), a helper which will be used by
devices to force issuing a DISK_EVENT_MEDIA_CHANGE event.
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-6-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.
$ grep . /sys/class/block/*/diskseq
/sys/class/block/loop0/diskseq:13
/sys/class/block/loop1/diskseq:14
/sys/class/block/loop2/diskseq:5
/sys/class/block/loop3/diskseq:6
/sys/class/block/ram0/diskseq:1
/sys/class/block/ram1/diskseq:2
/sys/class/block/vda/diskseq:7
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cmdline-parser.c is only used by the cmdline faux partition format,
so merge the code into that and avoid an indirect call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210728053756.409654-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the disk_name function now that all users are gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_name for partition 0 just copies out the disk_name field. Replace
the call to disk_name with a %s format specifier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Printk ->disk_name directly for the disk and use the %pg format specifier
for the block device, which is equivalent to a bdevname call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0
on x86-64 host machine. I got the following warning:
block/genhd.c: In function ‘diskstats_show’:
block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger
than 1280 bytes [-Wframe-larger-than=]
1227 | }
By Reduced the stack footprint by using the %pg printk specifier instead
of disk_name to remove the need for the on-stack buffer.
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we've stopped using inode references for anything meaninful
in the block layer get rid of the helper to put it and just open code
the call to iput on the block_device inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of acquiring an inode reference on open make sure partitions
always hold device model references to the disk while alive, and switch
open to grab only a device model reference to the opened block device.
If that is a partition the disk reference is transitively held by the
partition already.
Link: https://lore.kernel.org/r/20210722075402.983367-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the allocation of bd_meta_info after initializing the struct device
to avoid the special bdput error handling path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unhash the whole device inode early in del_gendisk. This allows to
remove the first GENHD_FL_UP check in the open path as we simply
won't find a just removed inode. The second non-racy check after
taking open_mutex is still kept.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210722075402.983367-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a lockdep assert instead of the outdated locking comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rewrite the actual bounce buffering loop in __blk_queue_bounce to that
the memcpy_to_bvec helper can be used to perform the data copies.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use memcpy_from_bvec instead of open coding the logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use memcpy_to_bvec instead of opencoding the logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helpers instead of open coding the copy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use memzero_bvec to zero each segment in the bio instead of manually
mapping and zeroing the data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210727055646.118787-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set ret to 0 after the initial permission checks to avoid leaking -EPERM
for commands without data transfer.
Link: https://lore.kernel.org/r/20210731074027.1185545-3-hch@lst.de
Fixes: 75ca56409e ("scsi: bsg: Move the whole request execution into the SCSI/transport handlers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Remove the amount of indirect calls by making the handler responsible for
the entire execution of the request.
Link: https://lore.kernel.org/r/20210729064845.1044147-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.
Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Use the per-device cdev_device_interface to store the bsg data in the char
device inode, and thus remove the need to embedd the bsg_class_device
structure in the request_queue.
Link: https://lore.kernel.org/r/20210729064845.1044147-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEEEz8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpglXD/9CGREpOf1W5oqOScpTygjehwrRnAYisQv6
Oca/qGHBa61BTN3taAJc4NMwl+IwFBER2kdTcOyz8hNmyAUPyRmFND0mG2vGTzQA
P9+ekiRKCJ1aRLsnyBL0JBbmvdoPMBHz39P165vMWMrVmnlpcPKoYDS0itHtYYNP
VD5Y3A9ACGMDglipDmL+3tsXQo/AoJqRO8WGMUBY2qJ0lasYuCbPpzq0kHzXi6kE
0X64bg6JOZVd3wdyWywKahW3ntsVNLswRUBzLVrnjwE29UuBGWgF+/vwyW/Ob0yS
ojafKvehCYnV8Q7IatASOtbwGLvLKgpJZXf7VUEsYnSD6SnmoZctjMjRdyLhNWut
lD86Y+eWjQM0pUsOVPykfrV2hd9CrhjyRFskcbI0SJRlMOl0Lstl/X17efDWcDmz
1/V8ub3gKA3HF2Gc/QKhPJDClxM7SaWnsAO3Rk+qJ6bT4EiiRg2GewI1C7YNpmGW
ty1fqcQE36JtSWadH4KL/evmX258ROfn3QT1nut2jpNsd1RQ+hHBcjcfeOx6n1GX
ALxT8LnmlVYbAUwQvXJcqFcft8K3JoB5ZXT74lat/CAbIKhfEUeSUiqnQcQ8kJLW
MTKviuZ9eJHO6/E7vw08ARDR0PmpSFqvc6rK9DiIM/kmVDz8OdLMovTqzX/hIzUT
7IfyHzQbwg==
=5FG2
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- gendisk freeing fix (Christoph)
- blk-iocost wake ordering fix (Tejun)
- tag allocation error handling fix (John)
- loop locking fix. While this isn't the prettiest fix in the world,
nobody has any good alternatives for 5.14. Something to likely
revisit for 5.15. (Tetsuo)
* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
block: delay freeing the gendisk
blk-iocost: fix operation ordering in iocg_wake_fn()
blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
loop: reintroduce global lock for safe loop_validate_file() traversal