WSL2-Linux-Kernel/block
Tejun Heo af9452dfdb block: fix rq-qos breakage from skipping rq_qos_done_bio()
[ Upstream commit aa1b46dcdc ]

a647a524a4 ("block: don't call rq_qos_ops->done_bio if the bio isn't
tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
While this fixed a potential oops, it also broke blk-iocost by skipping the
done_bio callback for merged bios.

Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
distinguishing the former from the latter. rq_qos_done_bio() is not called
for bios which wenth through rq_qos_merge(). This royally confuses
blk-iocost as the merged bios never finish and are considered perpetually
in-flight.

One reliably reproducible failure mode is an intermediate cgroup geting
stuck active preventing its children from being activated due to the
leaf-only rule, leading to loss of control. The following is from
resctl-bench protection scenario which emulates isolating a web server like
workload from a memory bomb run on an iocost configuration which should
yield a reasonable level of protection.

  # cat /sys/block/nvme2n1/device/model
  Samsung SSD 970 PRO 512GB
  # cat /sys/fs/cgroup/io.cost.model
  259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
  # cat /sys/fs/cgroup/io.cost.qos
  259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
  # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
  ...
  Memory Hog Summary
  ==================

  IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
              W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m

  Isolation and Request Latency Impact Distributions:

                min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
  isol%       15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
  lat-imp%        0     0     0     0     0  4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6

  Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%

The isolation result of 58.12% is close to what this device would show
without any IO control.

Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
calling rq_qos_done_bio() on them too. For consistency and clarity, rename
BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
rq_qos_done_bio() so that it's next to the code paths that set the flags.

With the patch applied, the above same benchmark shows:

  # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
  ...
  Memory Hog Summary
  ==================

  IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
              W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m

  Isolation and Request Latency Impact Distributions:

                min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
  isol%       84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42  2.81
  lat-imp%        0     0     0     0     0  2.81  5.73 11.11 13.92 17.53 22.61  4.10  4.68

  Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a647a524a4 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
Cc: stable@vger.kernel.org # v5.15+
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-12 16:34:57 +02:00
..
partitions block: drop unused includes in <linux/genhd.h> 2022-03-16 14:23:46 +01:00
Kconfig SCSI misc on 20210902 2021-09-02 15:09:46 -07:00
Kconfig.iosched Revert "block/mq-deadline: Add cgroup support" 2021-08-11 13:47:26 -06:00
Makefile block-5.15-2021-09-11 2021-09-11 10:19:51 -07:00
badblocks.c
bdev.c block: simplify the block device syncing code 2022-04-27 14:38:50 +02:00
bfq-cgroup.c bfq: Make sure bfqg for which we are queueing requests is online 2022-06-09 10:23:19 +02:00
bfq-iosched.c bfq: Get rid of __bio_blkcg() usage 2022-06-09 10:23:19 +02:00
bfq-iosched.h bfq: Get rid of __bio_blkcg() usage 2022-06-09 10:23:19 +02:00
bfq-wf2q.c block/bfq_wf2q: correct weight to ioprio 2022-04-08 14:23:55 +02:00
bio-integrity.c block: bio-integrity: Advance seed correctly for larger interval sizes 2022-02-08 18:34:05 +01:00
bio.c block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-07-12 16:34:57 +02:00
blk-cgroup-rwstat.c
blk-cgroup-rwstat.h
blk-cgroup.c block: fix bio_clone_blkg_association() to associate with proper blkcg_gq 2022-06-09 10:23:32 +02:00
blk-core.c block: release rq qos structures for queue without disk 2022-03-23 09:16:41 +01:00
blk-crypto-fallback.c
blk-crypto-internal.h
blk-crypto.c blk-crypto: fix check for too-large dun_bytes 2021-08-25 06:45:00 -06:00
blk-exec.c
blk-flush.c block: Fix fsync always failed if once failed 2022-01-27 11:05:25 +01:00
blk-integrity.c block: flush the integrity workqueue in blk_integrity_unregister 2021-09-14 20:03:30 -06:00
blk-ioc.c
blk-iocost.c iocost: don't reset the inuse weight of under-weighted debtors 2022-05-09 09:14:31 +02:00
blk-iolatency.c block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-07-12 16:34:57 +02:00
blk-ioprio.c
blk-ioprio.h
blk-lib.c
blk-map.c block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern 2022-03-08 19:12:31 +01:00
blk-merge.c block: don't merge across cgroup boundaries if blkcg is enabled 2022-04-08 14:22:59 +02:00
blk-mq-cpumap.c
blk-mq-debugfs-zoned.c
blk-mq-debugfs.c block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output 2021-10-04 06:58:39 -06:00
blk-mq-debugfs.h
blk-mq-pci.c
blk-mq-rdma.c
blk-mq-sched.c block: limit request dispatch loop duration 2022-04-08 14:22:59 +02:00
blk-mq-sched.h
blk-mq-sysfs.c block: remove blk-mq-sysfs dead code 2021-08-02 13:37:29 -06:00
blk-mq-tag.c blk-mq: avoid to iterate over stale request 2021-09-12 19:32:43 -06:00
blk-mq-tag.h
blk-mq-virtio.c
blk-mq.c block: Fix handling of offline queues in blk_mq_alloc_request_hctx() 2022-06-22 14:22:02 +02:00
blk-mq.h blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() 2021-12-01 09:04:56 +01:00
blk-pm.c scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume() 2022-01-27 11:04:15 +01:00
blk-pm.h
blk-rq-qos.c
blk-rq-qos.h block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-07-12 16:34:57 +02:00
blk-settings.c block: Fix partition check for host-aware zoned block devices 2021-10-27 06:58:01 -06:00
blk-stat.c
blk-stat.h
blk-sysfs.c block: don't delete queue kobject before its children 2022-04-08 14:23:07 +02:00
blk-throttle.c blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() 2021-09-07 08:36:56 -06:00
blk-timeout.c
blk-wbt.c blk-wbt: prevent NULL pointer dereference in wb_timer_fn 2021-11-18 19:16:34 +01:00
blk-wbt.h
blk-zoned.c block: Hold invalidate_lock in BLKRESETZONE ioctl 2021-11-18 19:17:15 +01:00
blk.h block: bump max plugged deferred size from 16 to 32 2021-11-18 19:16:16 +01:00
bounce.c block: use memcpy_from_bvec in __blk_queue_bounce 2021-08-02 13:37:28 -06:00
bsg-lib.c scsi: bsg-lib: Fix commands without data transfer in bsg_transport_sg_io_fn() 2021-08-01 13:21:40 -04:00
bsg.c scsi: bsg: Fix device unregistration 2021-09-14 00:22:15 -04:00
disk-events.c block: return errors from disk_alloc_events 2021-08-23 12:55:45 -06:00
elevator.c block/wbt: fix negative inflight counter when remove scsi device 2022-02-23 12:03:15 +01:00
fops.c block: hold ->invalidate_lock in blkdev_fallocate 2021-09-24 11:06:58 -06:00
genhd.c block: Fix the maximum minor value is blk_alloc_ext_minor() 2022-04-08 14:24:11 +02:00
holder.c block: drop unused includes in <linux/genhd.h> 2022-03-16 14:23:46 +01:00
ioctl.c block/compat_ioctl: fix range check in BLKGETSIZE 2022-04-27 14:39:02 +02:00
ioprio.c block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2) 2021-12-14 10:57:16 +01:00
keyslot-manager.c
kyber-iosched.c kyber: avoid q->disk dereferences in trace points 2021-10-15 21:02:57 -06:00
mq-deadline.c block: fix async_depth sysfs interface for mq-deadline 2022-01-27 11:05:25 +01:00
opal_proto.h
sed-opal.c
t10-pi.c block: use bvec_kmap_local in t10_pi_type1_{prepare,complete} 2021-08-02 13:37:28 -06:00