WSL2-Linux-Kernel

История

Glauber Costa 3932a86b4b cfq: fix starvation of asynchronous writes While debugging timeouts happening in my application workload (ScyllaDB), I have observed calls to open() taking a long time, ranging everywhere from 2 seconds - the first ones that are enough to time out my application - to more than 30 seconds. The problem seems to happen because XFS may block on pending metadata updates under certain circumnstances, and that's confirmed with the following backtrace taken by the offcputime tool (iovisor/bcc): ffffffffb90c57b1 finish_task_switch ffffffffb97dffb5 schedule ffffffffb97e310c schedule_timeout ffffffffb97e1f12 __down ffffffffb90ea821 down ffffffffc046a9dc xfs_buf_lock ffffffffc046abfb _xfs_buf_find ffffffffc046ae4a xfs_buf_get_map ffffffffc046babd xfs_buf_read_map ffffffffc0499931 xfs_trans_read_buf_map ffffffffc044a561 xfs_da_read_buf ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 ffffffffc0452b90 xfs_dir2_leaf_lookup_int ffffffffc0452e0f xfs_dir2_leaf_lookup ffffffffc044d9d3 xfs_dir_lookup ffffffffc047d1d9 xfs_lookup ffffffffc0479e53 xfs_vn_lookup ffffffffb925347a path_openat ffffffffb9254a71 do_filp_open ffffffffb9242a94 do_sys_open ffffffffb9242b9e sys_open ffffffffb97e42b2 entry_SYSCALL_64_fastpath 00007fb0698162ed [unknown] Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very high "Dispatch wait" times, on the dozens of seconds range and consistent with the open() times I have saw in that run. Still from the blktrace output, we can after searching a bit, identify the request that wasn't dispatched: 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1] 8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1] 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1] <==== 'I' means Inserted (into the IO scheduler) ===================================> 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0] <==== Only 15s later the CFQ scheduler dispatches the request ======================> As we can see above, in this particular example CFQ took 15 seconds to dispatch this request. Going back to the full trace, we can see that the xfsaild queue had plenty of opportunity to run, and it was selected as the active queue many times. It would just always be preempted by something else (example): 8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt 8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0 where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB IO dispatchers. The requests preempting the xfsaild queue are synchronous requests. That's a characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests. While it can be argued that preempting ASYNC requests in favor of SYNC is part of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's goal. Moreover, unless I am misunderstanding something, that breaks the expectation set by the "fifo_expire_async" tunable, which in my system is set to the default. Looking at the code, it seems to me that the issue is that after we make an async queue active, there is no guarantee that it will execute any request. When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC requests in flight. An incoming request from another queue can also preempt it in such situation before we have the chance to execute anything (as seen in the trace above). This patch sets the must_dispatch flag if we notice that we have requests that are already fifo_expired. This flag is always cleared after cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin the queue for subsequent requests (unless they are themselves expired) Care is taken during preempt to still allow rt requests to preempt us regardless. Testing my workload with this patch applied produces much better results. From the application side I see no timeouts, and the open() latency histogram generated by systemtap looks much better, with the worst outlier at 131ms: Latency histogram of xfs_buf_lock acquisition (microseconds): value \|-------------------------------------------------- count 0 \| 11 1 \|@@@@ 161 2 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 4 \|@ 54 8 \| 36 16 \| 7 32 \| 0 64 \| 0 ~ 1024 \| 0 2048 \| 0 4096 \| 1 8192 \| 1 16384 \| 2 32768 \| 0 65536 \| 0 131072 \| 1 262144 \| 0 524288 \| 0 Signed-off-by: Glauber Costa <glauber@scylladb.com> CC: Jens Axboe <axboe@kernel.dk> CC: linux-block@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Jens Axboe <axboe@fb.com>		2016-09-23 10:01:24 -06:00
..
partitions	block: atari: Return early for unsupported sector size	2016-07-13 09:31:44 -07:00
Kconfig	blk-mq: abstract tag allocation out into sbitmap library	2016-09-17 08:38:44 -06:00
Kconfig.iosched	…
Makefile	Initial roundup of 4.5 merge window patches	2016-01-23 18:45:06 -08:00
badblocks.c	block, badblocks: introduce devm_init_badblocks	2016-01-09 08:39:04 -08:00
bio-integrity.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
bio.c	block: export bio_free_pages to other modules	2016-09-22 07:48:03 -06:00
blk-cgroup.c	block/blk-cgroup.c: Declare local symbols static	2016-06-14 09:09:33 -06:00
blk-core.c	block: add poll_considered statistic	2016-09-14 08:41:21 -06:00
blk-exec.c	block: Fix spelling in a source code comment	2016-07-20 21:28:22 -06:00
blk-flush.c	block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH	2016-06-07 13:41:38 -06:00
blk-integrity.c	block, libnvdimm, nvme: provide a built-in blk_integrity nop profile	2015-10-21 14:43:45 -06:00
blk-ioc.c	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd	2015-11-06 17:50:42 -08:00
blk-lib.c	Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block	2016-07-26 15:37:51 -07:00
blk-map.c	block: simplify and export blk_rq_append_bio	2016-07-20 17:38:32 -06:00
blk-merge.c	block: make sure a big bio is split into at most 256 bvecs	2016-08-24 08:17:24 -06:00
blk-mq-cpu.c	…
blk-mq-cpumap.c	blk-mq: Avoid memoryless numa node encoded in hctx numa_node	2015-12-03 09:56:27 -07:00
blk-mq-sysfs.c	blk-mq: register device instead of disk	2016-09-21 07:56:16 -06:00
blk-mq-tag.c	sbitmap: randomize initial alloc_hint values	2016-09-17 08:39:14 -06:00
blk-mq-tag.h	sbitmap: randomize initial alloc_hint values	2016-09-17 08:39:14 -06:00
blk-mq.c	blk-mq: add flag for drivers wanting blocking ->queue_rq()	2016-09-22 14:28:38 -06:00
blk-mq.h	sbitmap: push per-cpu last_tag into sbitmap_queue	2016-09-17 08:39:10 -06:00
blk-settings.c	block: kill off q->flush_flags	2016-04-13 13:33:19 -06:00
blk-softirq.c	…
blk-sysfs.c	blk-mq: register device instead of disk	2016-09-21 07:56:16 -06:00
blk-tag.c	…
blk-throttle.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
blk-timeout.c	block: remove REQ_NO_TIMEOUT flag	2015-12-22 09:38:34 -07:00
blk.h	block: simplify and export blk_rq_append_bio	2016-07-20 17:38:32 -06:00
bounce.c	Merge branch 'for-linus' of git://git.kernel.dk/linux-block	2015-09-19 18:57:09 -07:00
bsg-lib.c	…
bsg.c	…
cfq-iosched.c	cfq: fix starvation of asynchronous writes	2016-09-23 10:01:24 -06:00
cmdline-parser.c	…
compat_ioctl.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
deadline-iosched.c	block: do not merge requests without consulting with io scheduler	2016-07-20 21:35:12 -06:00
elevator.c	block: Fix secure erase	2016-08-16 09:16:51 -06:00
genhd.c	block: fix bdi vs gendisk lifetime mismatch	2016-08-04 14:19:16 -06:00
ioctl.c	DAX error handling for 4.7	2016-05-26 19:34:26 -07:00
ioprio.c	block: fix use-after-free in sys_ioprio_get()	2016-07-01 08:39:24 -06:00
noop-iosched.c	elevator: use list_{first,prev,next}_entry	2015-11-16 15:21:48 -07:00
partition-generic.c	block/partition-generic.c: Remove a set-but-not-used variable	2016-06-14 09:09:15 -06:00
scsi_ioctl.c	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM	2015-11-06 17:50:42 -08:00
t10-pi.c	block: Consolidate static integrity profile properties	2015-10-21 14:42:38 -06:00