When multiple iovecs reference the same page, each get_user_page call
will add a reference to the page. But once we've created the bio that
information gets lost and only a single reference will be dropped after
I/O completion. Use the same_page information returned from
__bio_try_merge_page to drop additional references to pages that were
already present in the bio.
Based on a patch from Ming Lei.
Link: https://lkml.org/lkml/2019/4/23/64
Fixes: 576ed913 ("block: use bio_add_page in bio_iov_iter_get_pages")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently have an input same_page parameter to __bio_try_merge_page
to prohibit merging in the same page. The rationale for that is that
some callers need to account for every page added to a bio. Instead of
letting these callers call twice into the merge code to account for the
new vs existing page cases, just turn the paramter into an output one that
returns if a merge in the same page occured and let them act accordingly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All these files have some form of the usual GPLv2 boilerplate. Switch
them to use SPDX tags instead.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Share the bi_size update by moving the done label up, and duplicate
the bv_len update in the two callers to get rid of the bvec_merge
label.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We are never called with file system pages by defintions for the
passthrough interface, and we also never undo any addition later
these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The same page optimization is a rather odd corner case, which is not
used outside bio.c and which really should not be used outside of bio.c
either - we have better highlevel helpers like the rq/bio mapping
helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The refcount has been increased for pages retrieved from non-bvec iov iter
via __bio_iov_iter_get_pages(), so don't need to do that again.
Otherwise, IO pages are leaked easily.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Fixes: 7321ecbfc7 ("block: change how we get page references in bio_iov_iter_get_pages")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_add_page() and __bio_add_page() are capable of adding pages into
bio, and now we have at least two such usages alreay:
- __bio_iov_bvec_add_pages()
- nvmet_bdev_execute_rw().
So update comments on these two helpers.
The thing is a bit special for __bio_try_merge_page(), given the caller
needs to know if the new added page is same with the last added page,
then it isn't safe to pass multi-page in case that 'same_page' is true,
so adds warning on potential misuse, and updates comment on
__bio_try_merge_page().
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAly8rGYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGmZMH/1IRB0E1Qmzz8yzw
wj79UuRGYPqxDDSWW+wNc8sU4Ic7iYirn9APHAztCdQqsjmzU/OVLfSa3JhdBe5w
THo7pbGKBqEDcWnKfNk/21jXFNLZ1vr9BoQv2DGU2MMhHAyo/NZbalo2YVtpQPmM
OCRth5n+LzvH7rGrX7RYgWu24G9l3NMfgtaDAXBNXesCGFAjVRrdkU5CBAaabvtU
4GWh/nnutndOOLdByL3x+VZ3H3fIBnbNjcIGCglvvqzk7h3hrfGEl4UCULldTxcM
IFsfMUhSw1ENy7F6DHGbKIG90cdCJcrQ8J/ziEzjj/KLGALluutfFhVvr6YCM2J6
2RgU8CY=
=CfY1
-----END PGP SIGNATURE-----
Merge tag 'v5.1-rc6' into for-5.2/block
Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
comment, and is trivial. The other one is a conflict due to a later fix
in the bio multi-page work, and needs a bit more care.
* tag 'v5.1-rc6': (770 commits)
Linux 5.1-rc6
block: make sure that bvec length can't be overflow
block: kill all_q_node in request_queue
x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
mm/kmemleak.c: fix unused-function warning
init: initialize jump labels before command line option parsing
kernel/watchdog_hld.c: hard lockup message should end with a newline
kcov: improve CONFIG_ARCH_HAS_KCOV help text
mm: fix inactive list balancing between NUMA nodes and cgroups
mm/hotplug: treat CMA pages as unmovable
proc: fixup proc-pid-vm test
proc: fix map_files test on F29
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
mm: swapoff: shmem_unuse() stop eviction without igrab()
mm: swapoff: take notice of completion sooner
mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
mm: swapoff: shmem_find_swap_entries() filter out other types
slab: store tagged freelist for off-slab slabmgmt
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently have to call nth_page when iterating over pages inside a
bio_vec. Jens complained a while ago that this is fairly expensive.
To mitigate this we can check that that the actual page structures
are contiguous when adding them to the bio, and just do check pointer
arithmetics later on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of needing a special macro to iterate over all pages in
a bvec just do a second passs over the whole bio. This also matches
what we do on the release side. The release side helper is moved
up to where we need the get helper to clearly express the symmetry.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No caller uses bio_iov_iter_get_pages multiple times on a given bio,
and that funtionality isn't all that useful. Removing it will make
some future changes a little easier and also simplifies the function
a bit.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return early on error, and add an unlikely annotation for that case.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When bio_add_pc_page() fails in bio_copy_user_iov() we should free
the page we just allocated otherwise we are leaking it.
Cc: linux-block@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the introduction of BIO_NO_PAGE_REF we've used up all available bits
in bio::bi_flags.
Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call
to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX
causing crashes.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now block IO stack is basically ready for supporting multi-page bvec,
however it isn't enabled on passthrough IO.
One reason is that passthrough IO is dispatched to LLD directly and bio
split is bypassed, so the bio has to be built correctly for dispatch to
LLD from the beginning.
Implement multi-page support for passthrough IO by limitting each bvec
as block device's segment and applying all kinds of queue limit in
blk_add_pc_page(). Then we don't need to calculate segments any more for
passthrough IO any more, turns out code is simplified much.
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the added page is merged to last same page in bio_add_pc_page(),
the user may need to put this page for avoiding page leak.
bio_map_user_iov() needs this kind of handling, and now it deals with
it by itself in hack style.
Moves the handling of put page into __bio_add_pc_page(), so
bio_map_user_iov() may be simplified a bit, and maybe more users
can benefit from this change.
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now the check for deciding if one page is mergeable to current bvec
becomes a bit complicated, and we need to reuse the code before
adding pc page.
So move the check in one dedicated helper.
No function change.
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
REQ_PC is out of date, so replace it with passthrough IO.
Also remove the local variable of 'prev' since we can reuse
the top local variable of 'bvec'.
No function change.
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
XEN has special page merge requirement, see xen_biovec_phys_mergeable().
We can't merge pages into one bvec simply for XEN.
So move XEN's specific check on page merge into __bio_try_merge_page(),
then abvoid to break XEN by multi-page bvec.
Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
with NO_REF, then we don't need to add a page reference for the pages
that we add.
Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
not to drop a reference to these pages.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. For now, we grab a reference to those pages,
and release them normally on IO completion. This isn't really needed
for the normal case of O_DIRECT from/to a file, but some of the more
esoteric use cases (like splice(2)) will unconditionally put the
pipe buffer pages when the buffers are released. Until we can manage
that case properly, ITER_BVEC pages are treated like normal pages
in terms of reference counting.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch pulls the trigger for multi-page bvecs.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwtClAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpicLEACpQalDy7tUp+0/5VUMTiYksBqimtCoOu59
K9BsrRwdXnhAxBdD3a6cn442axKelg5NozdTAXFTNAFkuDUci0eBvVZFNBhuRaqY
Cp/ub8rF81viivDvF5kO8wWC745Zj63/BQjethXrTssVI3ZOV4lK+haiuJeXOegN
LwM64P5lHlhQkOn/FV5oDSWyrlffMrjtcqJ22Em0mxeHgneZXI0wVeJgGbbY8e33
GWGUb+sYpDM41ZDl7vyVSIDNHKzYMSbN7hdLh3fD+EWVFwI/F+lZU/LHbO+lKt1f
LU/mPXLrIkEvzhFwwGLQy10u6lo1US6sMoKfFcpKu8KRY4p7DyvoCLyGH2PtK2sR
+vX1LlWPHJsN9x/5TlfOnXrR0qChzqwMU3tgzQaCF7HOEx+3Xt7HpKRffJmVMNZP
anMqevN2OfjvpcBEs2jktAqiwmBZIPSQJ9OMqkPJIalIW4h3qDKXRKttpmWTWMeV
sDWNGj3hpukWba01vTxYOkz8/V58+ikzM26UAjTmTU9YvQ+TZBmu+azAuisedhqE
b66gXz8YLp6r10psSBnh1IThvhuDyjmofouWGuSWJRcEngzXbL6jDQhgqWWCzKUn
cW8Cs4ymvSwE5Qwwgs4wY8XYwyl5L9QGgfwLx+toMvSKq/G+ONA6FQ1Crp7zx4jq
bnNqy1iWNg==
=KYFj
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
- Dead code removal for loop/sunvdc (Chengguang)
- Mark BIDI support for bsg as deprecated, logging a single dmesg
warning if anyone is actually using it (Christoph)
- blkcg cleanup, killing a dead function and making the tryget_closest
variant easier to read (Dennis)
- Floppy fixes, one fixing a regression in swim3 (Finn)
- lightnvm use-after-free fix (Gustavo)
- gdrom leak fix (Wenwen)
- a set of drbd updates (Lars, Luc, Nathan, Roland)
* tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-block: (28 commits)
block/swim3: Fix regression on PowerBook G3
block/swim3: Fix -EBUSY error when re-opening device after unmount
block/swim3: Remove dead return statement
block/amiflop: Don't log error message on invalid ioctl
gdrom: fix a memory leak bug
lightnvm: pblk: fix use-after-free bug
block: sunvdc: remove redundant code
block: loop: remove redundant code
bsg: deprecate BIDI support in bsg
blkcg: remove unused __blkg_release_rcu()
blkcg: clean up blkg_tryget_closest()
drbd: Change drbd_request_detach_interruptible's return type to int
drbd: Avoid Clang warning about pointless switch statment
drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")
drbd: skip spurious timeout (ping-timeo) when failing promote
drbd: don't retry connection if peers do not agree on "authentication" settings
drbd: fix print_st_err()'s prototype to match the definition
drbd: avoid spurious self-outdating with concurrent disconnect / down
drbd: do not block when adjusting "disk-options" while IO is frozen
drbd: fix comment typos
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
2p4aHs6ktw==
=TRdW
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.
Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.
This contains:
- Use atomic counters instead of semaphores for mtip32xx (Arnd)
- Cleanup of the mtip32xx request setup (Christoph)
- Fix for circular locking dependency in loop (Jan, Tetsuo)
- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements
- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements
- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.
- Conversion of the last few drivers that used the legacy path to
blk-mq (me)
- Removal of legacy IO path in SCSI (me, Christoph)
- Removal of legacy IO stack and schedulers (me)
- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.
- Killing of (now) unused block exports (Christoph)
- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
- Support for zoned testing with null_blk (Masato)
- sx8 conversion to per-host tag sets (Christoph)
- IO priority improvements (Damien)
- mq-deadline zoned fix (Damien)
- Ref count blkcg series (Dennis)
- Lots of blk-mq improvements and speedups (me)
- sbitmap scalability improvements (me)
- Make core inflight IO accounting per-cpu (Mikulas)
- Export timeout setting in sysfs (Weiping)
- Cleanup the direct issue path (Jianchao)
- Export blk-wbt internals in block debugfs for easier debugging
(Ming)
- Lots of other fixes and improvements"
* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...
The implementation of blkg_tryget_closest() wasn't super obvious and
became a point of suspicion when debugging [1]. So let's clean it up so
it's obviously not the problem.
Also add missing RCU read locking to bio_clone_blkg_association(), which
got exposed by adding the RCU read lock held check in
blkg_tryget_closest().
[1] https://lore.kernel.org/linux-block/a7e97e4b-0dd8-3a54-23b7-a0f27b17fde8@kernel.dk/
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need to zero fill the bio if not using kernel allocated pages.
Fixes: f3587d76da ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: Todd Aiken <taiken@mvtech.ca>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want to convert to per-cpu in_flight counters.
The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.
time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).
io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or %NULL.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that a bio only holds a blkg reference, so clean up is simply
putting back that reference. Remove bio_disassociate_task() as it just
calls bio_disassociate_blkg() and call the latter directly.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.
Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.
This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg(). In
this patch, wbc_init_bio() now requires a bio to have a device
associated with it.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously, blkg association was handled by controller specific code in
blk-throttle and blk-iolatency. However, because a blkg represents a
relationship between a blkcg and a request_queue, it makes sense to keep
the blkg->q and bio->bi_disk->queue consistent.
This patch moves association into the bio_set_dev macro(). This should
cover the majority of cases where the device is set/changed keeping the
two pointers consistent. Fallback code is added to
blkcg_bio_issue_check() to catch any missing paths.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The next patch changes the macro bio_set_dev() to associate a bio with a
blkg based on the device set. However, dm creates a static bio to be
used as the basis for cloning empty flush bios on creation. The
bio_set_dev() call in alloc_dev() will cause problems with the next
patch adding association to bio_set_dev() because the call is before the
bdev is associated with a gendisk (bd_disk is %NULL). To get around
this, set the device on the static bio every time and use that to clone
to the other bios.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are 3 ways blkg association can happen: association with the
current css, with the page css (swap), or from the wbc css (writeback).
This patch handles how association is done for the first case where we
are associating bsaed on the current css. If there is already a blkg
associated, the css will be reused and association will be redone as the
request_queue may have changed.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are several scenarios where blkg_lookup_create() can fail such as
the blkcg dying, request_queue is dying, or simply being OOM. Most
handle this by simply falling back to the q->root_blkg and calling it a
day.
This patch implements the notion of closest blkg. During
blkg_lookup_create(), if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest() is introduced and used
during association so a bio is always attached to a blkg.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio_blkcg() function turns out to be inconsistent and consequently
dangerous to use. The first part returns a blkcg where a reference is
owned by the bio meaning it does not need to be rcu protected. However,
the third case, the last line, is problematic:
return css_to_blkcg(task_css(current, io_cgrp_id));
This can race against task migration and the cgroup dying. It is also
semantically different as it must be called rcu protected and is
susceptible to failure when trying to get a reference to it.
This patch adds association ahead of calling bio_blkcg() rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg(). In blk-iolatency, association is
moved above the bio_blkcg() call to ensure it will not return %NULL.
BFQ uses the old bio_blkcg() function, but I do not want to address it
in this series due to the complexity. I have created a private version
documenting the inconsistency and noting not to use it.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio->bi_ioc is never set so always NULL. Remove references to it in
bio_disassociate_task() and in rq_ioc() and delete this field from
struct bio. With this change, rq_ioc() always returns
current->io_context without the need for a bio argument. Further
simplify the code and make it more readable by also removing this
helper, which also allows to simplify blk_mq_sched_assign_ioc() by
removing its bio argument.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.
Fixes: 43b62ce3ff ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixed up subject, and ordered stores.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvl3xoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsoUEACecROMzIey+QOJ9o0YSaK5tlBWoC1OyK72
ncHCyxw+uW+qWa4Csosib7BRomz3E0nsQzf0b3MJiiZJGXnbfxRP0GWDrriK+635
UeVmWQfU2Ga2QmXGpnlSSF7mHHZWOWG7hUonGUBgCOiN2tLymHrnjmmPf7TuDuav
CliIgjan124MamohXyjcIa0EtySZmkTlJAxOtoOlxXwitLWnd28Vb3kOhG4dHXQs
lKO94R2souY00PX/e/8GtkAf1Em+5yJea/LAcOW8+qHA1WYzDqp1/TG/STLX0MVP
rgqg4E8j1LzCCmunaneIuSeo6yQhZPF2OHMGD8ERn+5c5RY0J4wsBAW2ZO0aqmBP
nfv+ZdZa71NU+pEEkOe79B5l1JiQNwTCu1ay/kfSGQpAnd2ejnY6QqiLnfCEgqWG
mcZi/JKcUst8I+MCV9vSjzl9GvMY7Wn5Fy031T4Qog2F0ZPns3FyNLuOLXXjkH7C
h5uliwuvLV3TpVWgWfVh8R7/j3M0dLndl5G5EYHwkb7It59r4rmNfgm1llZPFhWq
TENDvq5vLJaPXh5NIyxR1TowJNfgVembr5XiSJ7nOt8IFfJvkcfVIbQqRcpBk8vG
ygde6xKwGSxsfRoIhNTFU0mlphFpNgske600oZkm6PZq6DnivS9/4PD77nIH7p5v
6kR8aG43uQ==
=Nry6
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181109' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
- Two fixes for an ubd regression, one for missing locking, and one for
a missing initialization of a field. The latter was an old latent
bug, but it's now visible and triggers (Me, Anton Ivanov)
- Set of NVMe fixes via Christoph, but applied manually due to a git
tree mixup (Christoph, Sagi)
- Fix for a discard split regression, in three patches (Ming)
- Update libata git trees (Geert)
- SPDX identifier for sata_rcar (Kuninori Morimoto)
- Virtual boundary merge fix (Johannes)
- Preemptively clear memory we are going to pass to userspace, in case
the driver does a short read (Keith)
* tag 'for-linus-20181109' of git://git.kernel.dk/linux-block:
block: make sure writesame bio is aligned with logical block size
block: cleanup __blkdev_issue_discard()
block: make sure discard bio is aligned with logical block size
Revert "nvmet-rdma: use a private workqueue for delete"
nvme: make sure ns head inherits underlying device limits
nvmet: don't try to add ns to p2p map unless it actually uses it
sata_rcar: convert to SPDX identifiers
ubd: fix missing initialization of io_req
block: Clear kernel memory before copying to user
MAINTAINERS: Fix remaining pointers to obsolete libata.git
ubd: fix missing lock around request issue
block: respect virtual boundary mask in bvecs
If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.
Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>