WSL2-Linux-Kernel

История

Filipe Manana 4a0123bdb0 btrfs: fallback to blocking mode when doing async dio over multiple extents commit `ca93e44bfb` upstream Some users recently reported that MariaDB was getting a read corruption when using io_uring on top of btrfs. This started to happen in 5.16, after commit `51bd9563b6` ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). That changed btrfs to use the new iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling iomap_dio_rw(). This was necessary to fix deadlocks when the iovector corresponds to a memory mapped file region. That type of scenario is exercised by test case generic/647 from fstests. For this MariaDB scenario, we attempt to read 16K from file offset X using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each with a size of 4K, and what happens is the following: 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw(); 2) iomap creates a struct iomap_dio object, its reference count is initialized to 1 and its ->size field is initialized to 0; 3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds the first 4K extent, and setups an iomap for this extent consisting of a single page; 4) At iomap_dio_bio_iter(), we are able to access the first page of the buffer (struct iov_iter) with bio_iov_iter_get_pages() without triggering a page fault; 5) iomap submits a bio for this 4K extent (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments the refcount on the struct iomap_dio object to 2; The ->size field of the struct iomap_dio object is incremented to 4K; 6) iomap calls btrfs_iomap_begin() again, this time with a file offset of X + 4K. There we setup an iomap for the next extent that also has a size of 4K; 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(), which tries to access the next page (2nd page) of the buffer. This triggers a page fault and returns -EFAULT; 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and the struct iomap_dio object has a ->size value of 4K (we submitted a bio for an extent already). The 'wait_for_completion' variable is not set to true, because our iocb has IOCB_NOWAIT set; 9) At the bottom of __iomap_dio_rw(), we decrement the reference count of the struct iomap_dio object from 2 to 1. Because we were not the only ones holding a reference on it and 'wait_for_completion' is set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which just returns it up the callchain, up to io_uring; 10) The bio submitted for the first extent (step 5) completes and its bio endio function, iomap_dio_bio_end_io(), decrements the last reference on the struct iomap_dio object, resulting in calling iomap_dio_complete_work() -> iomap_dio_complete(). 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K and return 4K (the amount of io done) to iomap_dio_complete_work(); 12) iomap_dio_complete_work() calls the iocb completion callback, iocb->ki_complete() with a second argument value of 4K (total io done) and the iocb with the adjust ki_pos of X + 4K. This results in completing the read request for io_uring, leaving it with a result of 4K bytes read, and only the first page of the buffer filled in, while the remaining 3 pages, corresponding to the other 3 extents, were not filled; 13) For the application, the result is unexpected because if we ask to read N bytes, it expects to get N bytes read as long as those N bytes don't cross the EOF (i_size). MariaDB reports this as an error, as it's not expecting a short read, since it knows it's asking for read operations fully within the i_size boundary. This is typical in many applications, but it may also be questionable if they should react to such short reads by issuing more read calls to get the remaining data. Nevertheless, the short read happened due to a change in btrfs regarding how it deals with page faults while in the middle of a read operation, and there's no reason why btrfs can't have the previous behaviour of returning the whole data that was requested by the application. The problem can also be triggered with the following simple program: /* Get O_DIRECT / #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <liburing.h> int main(int argc, char argv[]) { char foo_path; struct io_uring ring; struct io_uring_sqe sqe; struct io_uring_cqe cqe; struct iovec iovec; int fd; long pagesize; void write_buf; void read_buf; ssize_t ret; int i; if (argc != 2) { fprintf(stderr, "Use: %s <directory>\n", argv[0]); return 1; } foo_path = malloc(strlen(argv[1]) + 5); if (!foo_path) { fprintf(stderr, "Failed to allocate memory for file path\n"); return 1; } strcpy(foo_path, argv[1]); strcat(foo_path, "/foo"); / * Create file foo with 2 extents, each with a size matching * the page size. Then allocate a buffer to read both extents * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing * the read with io_uring, access the first page of the buffer * to fault it in, so that during the read we only trigger a * page fault when accessing the second page of the buffer. / fd = open(foo_path, O_CREAT \| O_TRUNC \| O_WRONLY \| O_DIRECT, 0666); if (fd == -1) { fprintf(stderr, "Failed to create file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); ret = posix_memalign(&write_buf, pagesize, 2 pagesize); if (ret) { fprintf(stderr, "Failed to allocate write buffer\n"); return 1; } memset(write_buf, 0xab, pagesize); memset(write_buf + pagesize, 0xcd, pagesize); /* Create 2 extents, each with a size matching page size. / for (i = 0; i < 2; i++) { ret = pwrite(fd, write_buf + i pagesize, pagesize, i * pagesize); if (ret != pagesize) { fprintf(stderr, "Failed to write to file, ret = %ld errno %d (%s)\n", ret, errno, strerror(errno)); return 1; } ret = fsync(fd); if (ret != 0) { fprintf(stderr, "Failed to fsync file\n"); return 1; } } close(fd); fd = open(foo_path, O_RDONLY \| O_DIRECT); if (fd == -1) { fprintf(stderr, "Failed to open file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } ret = posix_memalign(&read_buf, pagesize, 2 * pagesize); if (ret) { fprintf(stderr, "Failed to allocate read buffer\n"); return 1; } /* * Fault in only the first page of the read buffer. * We want to trigger a page fault for the 2nd page of the * read buffer during the read operation with io_uring * (O_DIRECT and IOCB_NOWAIT). / memset(read_buf, 0, 1); ret = io_uring_queue_init(1, &ring, 0); if (ret != 0) { fprintf(stderr, "Failed to create io_uring queue\n"); return 1; } sqe = io_uring_get_sqe(&ring); if (!sqe) { fprintf(stderr, "Failed to get io_uring sqe\n"); return 1; } iovec.iov_base = read_buf; iovec.iov_len = 2 pagesize; io_uring_prep_readv(sqe, fd, &iovec, 1, 0); ret = io_uring_submit_and_wait(&ring, 1); if (ret != 1) { fprintf(stderr, "Failed at io_uring_submit_and_wait()\n"); return 1; } ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { fprintf(stderr, "Failed at io_uring_wait_cqe()\n"); return 1; } printf("io_uring read result for file foo:\n\n"); printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize); printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n", memcmp(read_buf, write_buf, 2 * pagesize)); io_uring_cqe_seen(&ring, cqe); io_uring_queue_exit(&ring); return 0; } When running it on an unpatched kernel: $ gcc io_uring_test.c -luring $ mkfs.btrfs -f /dev/sda $ mount /dev/sda /mnt/sda $ ./a.out /mnt/sda io_uring read result for file foo: cqe->res == 4096 (expected 8192) memcmp(read_buf, write_buf) == -205 (expected 0) After this patch, the read always returns 8192 bytes, with the buffer filled with the correct data. Although that reproducer always triggers the bug in my test vms, it's possible that it will not be so reliable on other environments, as that can happen if the bio for the first extent completes and decrements the reference on the struct iomap_dio object before we do the atomic_dec_and_test() on the reference at __iomap_dio_rw(). Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag set) over a range that spans multiple extents (or a mix of extents and holes). This avoids returning success to the caller when we only did partial IO, which is not optimal for writes and for reads it's actually incorrect, as the caller doesn't expect to get less bytes read than it has requested (unless EOF is crossed), as previously mentioned. This is also the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()), even though it doesn't use IOMAP_DIO_PARTIAL. A test case for fstests will follow soon. Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/ CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2022-05-01 17:22:34 +02:00
..
9p	Revert "fs/9p: search open fids first"	2022-02-08 18:34:04 +01:00
adfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
affs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
afs	afs: Fix mmap	2021-12-22 09:32:45 +01:00
autofs	autofs: fix wait name hash calculation in autofs_wait()	2021-10-20 21:09:02 -04:00
befs	isystem: ship and use stdarg.h	2021-08-19 09:02:55 +09:00
bfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
btrfs	btrfs: fallback to blocking mode when doing async dio over multiple extents	2022-05-01 17:22:34 +02:00
cachefiles	cachefiles: Change %p in format strings to something else	2021-08-27 13:34:02 +01:00
ceph	ceph: fix memory leak in ceph_readdir when note_last_dentry returns error	2022-04-13 20:59:10 +02:00
cifs	cifs: Check the IOCB_DIRECT flag, not O_DIRECT	2022-04-27 14:38:56 +02:00
coda	…
configfs	configfs: fix a race in configfs_{,un}register_subsystem()	2022-03-02 11:48:02 +01:00
cramfs	…
crypto	fscrypt: allow 256-bit master keys with AES-256-XTS	2021-11-18 19:16:11 +01:00
debugfs	debugfs: lockdown: Allow reading debugfs files that are not world readable	2022-01-27 11:03:55 +01:00
devpts	fsnotify: fix fsnotify hooks in pseudo filesystems	2022-02-01 17:27:01 +01:00
dlm	fs: dlm: filter user dlm messages for kernel locks	2022-01-27 11:04:23 +01:00
ecryptfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
efivarfs	…
efs	…
erofs	iomap: Add done_before argument to iomap_dio_rw	2022-05-01 17:22:32 +02:00
exfat	exfat: fix i_blocks for files truncated over 4 GiB	2022-03-08 19:12:32 +01:00
exportfs	…
ext2	ext2: correct max file size computing	2022-04-08 14:23:35 +02:00
ext4	iomap: Add done_before argument to iomap_dio_rw	2022-05-01 17:22:32 +02:00
f2fs	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable	2022-05-01 17:22:28 +02:00
fat	linux-kselftest-kunit-5.15-rc1	2021-09-02 12:32:12 -07:00
freevxfs	…
fscache	fscache: Remove an unused static variable	2021-10-04 22:13:12 +01:00
fuse	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable	2022-05-01 17:22:28 +02:00
gfs2	gfs2: Fix mmap + page fault deadlocks for direct I/O	2022-05-01 17:22:33 +02:00
hfs	hfs: add lock nesting notation to hfs_find_init	2021-07-15 10:13:49 -07:00
hfsplus	hfsplus: report create_date to kstat.btime	2021-07-01 11:06:06 -07:00
hostfs	hostfs: support splice_write	2021-08-26 22:28:02 +02:00
hpfs	hpfs: use iomap_fiemap to implement ->fiemap	2021-07-27 11:00:36 +02:00
hugetlbfs	mm, hugetlb: allow for "high" userspace addresses	2022-04-27 14:38:57 +02:00
iomap	iomap: Add done_before argument to iomap_dio_rw	2022-05-01 17:22:32 +02:00
isofs	isofs: Fix out of bound access for corrupted isofs image	2021-11-12 15:05:50 +01:00
jbd2	jbd2: fix a potential race while discarding reserved buffers after an abort	2022-04-27 14:39:02 +02:00
jffs2	jffs2: fix memory leak in jffs2_scan_medium	2022-04-08 14:22:53 +02:00
jfs	jfs: prevent NULL deref in diFree	2022-04-13 20:59:13 +02:00
kernfs	kernfs: don't create a negative dentry if inactive node exists	2021-10-04 10:27:18 +02:00
ksmbd	ksmbd: don't align last entry offset in smb2 query directory	2022-02-23 12:03:18 +01:00
lockd	lockd: fix failure to cleanup client locks	2022-02-05 12:38:57 +01:00
minix	minix: fix bug when opening a file with O_DIRECT	2022-04-13 20:59:10 +02:00
netfs	netfs: fix parameter of cleanup()	2021-12-29 12:28:59 +01:00
nfs	NFSv4: fix open failure with O_ACCMODE flag	2022-04-13 20:59:15 +02:00
nfs_common	nfs: Fix kerneldoc warning shown up by W=1	2021-10-04 22:02:17 +01:00
nfsd	NFSD: Fix nfsd_breaker_owns_lease() return values	2022-04-08 14:23:57 +02:00
nilfs2	Merge branch 'akpm' (patches from Andrew)	2021-09-08 12:55:35 -07:00
nls	…
notify	fanotify: Fix stale file descriptor in copy_event_to_user()	2022-02-05 12:38:59 +01:00
ntfs	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable	2022-05-01 17:22:28 +02:00
ntfs3	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable	2022-05-01 17:22:28 +02:00
ocfs2	ocfs2: fix crash when mount with quota enabled	2022-04-08 14:22:56 +02:00
omfs	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
openpromfs	…
orangefs	orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()	2022-01-20 09:13:13 +01:00
overlayfs	ovl: fix NULL pointer dereference in copy up warning	2022-02-05 12:38:59 +01:00
proc	proc: bootconfig: Add null pointer check	2022-04-08 14:24:12 +02:00
pstore	pstore: Don't use semaphores in always-atomic-context code	2022-04-08 14:23:01 +02:00
qnx4	qnx4: work around gcc false positive warning bug	2021-09-21 08:36:48 -07:00
qnx6	…
quota	quota: make dquot_quota_sync return errors from ->sync_fs	2022-02-23 12:03:06 +01:00
ramfs	fs: move ramfs_aops to libfs	2021-06-29 10:53:48 -07:00
reiserfs	Kbuild updates for v5.15	2021-09-03 15:33:47 -07:00
romfs	…
smbfs_common	cifs: Fix crash on unload of cifs_arc4.ko	2021-12-14 10:57:12 +01:00
squashfs	squashfs: use bvec_virt	2021-08-16 10:50:32 -06:00
sysfs	sysfs: Allow deferred execution of iomem_get_mapping()	2021-08-06 13:05:28 +02:00
sysv	mm: require ->set_page_dirty to be explicitly wired up	2021-06-29 10:53:48 -07:00
tracefs	tracefs: Set the group ownership in apply_options() not parse_options()	2022-03-02 11:48:05 +01:00
ubifs	ubifs: rename_whiteout: correct old_dir size computing	2022-04-08 14:24:08 +02:00
udf	udf: Fix NULL ptr deref when converting from inline format	2022-02-01 17:27:00 +01:00
ufs	isystem: ship and use stdarg.h	2021-08-19 09:02:55 +09:00
unicode	.gitignore: prefix local generated files with a slash	2021-05-02 00:43:35 +09:00
vboxsf	vboxfs: fix broken legacy mount signature checking	2021-09-27 11:26:21 -07:00
verity	fs-verity: fix signed integer overflow with i_size near S64_MAX	2021-09-22 10:56:34 -07:00
xfs	iomap: Add done_before argument to iomap_dio_rw	2022-05-01 17:22:32 +02:00
zonefs	iomap: Add done_before argument to iomap_dio_rw	2022-05-01 17:22:32 +02:00
Kconfig	4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate	2021-09-12 10:10:21 -07:00
Kconfig.binfmt	binfmt: remove support for em86 (alpha only)	2021-07-25 22:33:03 -07:00
Makefile	4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate	2021-09-12 10:10:21 -07:00
aio.c	aio: Fix incorrect usage of eventfd_signal_allowed()	2021-12-14 10:57:22 +01:00
anon_inodes.c	…
attr.c	fs: handle circular mappings correctly	2021-11-25 09:48:46 +01:00
bad_inode.c	vfs: add rcu argument to ->get_acl() callback	2021-08-18 22:08:24 +02:00
binfmt_aout.c	binfmt: a.out: Fix bogus semicolon	2021-09-05 10:15:05 -07:00
binfmt_elf.c	coredump: Use the vma snapshot in fill_files_note	2022-04-08 14:24:18 +02:00
binfmt_elf_fdpic.c	coredump: Snapshot the vmas in do_coredump	2022-04-08 14:24:17 +02:00
binfmt_flat.c	binfmt: remove in-tree usage of MAP_EXECUTABLE	2021-06-29 10:53:50 -07:00
binfmt_misc.c	…
binfmt_script.c	…
buffer.c	mm: fs: fix lru_cache_disabled race in bh_lru	2022-04-08 14:22:54 +02:00
char_dev.c	…
compat_binfmt_elf.c	…
coredump.c	coredump: Use the vma snapshot in fill_files_note	2022-04-08 14:24:18 +02:00
d_path.c	d_path: make 'prepend()' fill up the buffer exactly on overflow	2021-09-02 10:07:29 -07:00
dax.c	New code for 5.15:	2021-08-31 11:13:35 -07:00
dcache.c	…
direct-io.c	…
drop_caches.c	fs: drop_caches: fix skipping over shadow cache inodes	2021-09-03 09:58:10 -07:00
eventfd.c	eventfd: Export eventfd_wake_count to modules	2021-09-06 07:20:56 -04:00
eventpoll.c	ARM development updates for 5.15:	2021-09-09 13:25:49 -07:00
exec.c	exec: Force single empty string when argv is empty	2022-04-08 14:23:01 +02:00
fcntl.c	Merge branch 'akpm' (patches from Andrew)	2021-09-03 10:08:28 -07:00
fhandle.c	…
file.c	fs: fix fd table size alignment properly	2022-04-08 14:23:54 +02:00
file_table.c	…
filesystems.c	fs: simplify get_filesystem_list / get_all_fs_names	2021-08-23 01:25:40 -04:00
fs-writeback.c	Merge branch 'akpm' (patches from Andrew)	2021-09-03 10:08:28 -07:00
fs_context.c	vfs: fs_context: fix up param length parsing in legacy_parse_param	2022-01-20 09:13:14 +01:00
fs_parser.c	namei: Standardize callers of filename_lookup()	2021-09-07 16:07:47 -04:00
fs_pin.c	…
fs_struct.c	…
fs_types.c	…
fsopen.c	…
init.c	…
inode.c	fs: export an inode_update_time helper	2021-11-25 09:49:08 +01:00
internal.h	block: simplify the block device syncing code	2022-04-27 14:38:50 +02:00
io-wq.c	io-wq: drop wqe lock before creating new worker	2021-12-22 09:32:51 +01:00
io-wq.h	io-wq: provide a way to limit max number of workers	2021-08-29 07:55:55 -06:00
io_uring.c	io_uring: use nospec annotation for more indexes	2022-04-20 09:34:17 +02:00
ioctl.c	New code for 5.15:	2021-08-31 11:06:32 -07:00
kernel_read_file.c	vfs: check fd has read access in kernel_read_file_from_fd()	2021-10-18 20:22:03 -10:00
libfs.c	fs: remove noop_set_page_dirty()	2021-06-29 10:53:48 -07:00
locks.c	Revert "memcg: enable accounting for file lock caches"	2021-09-07 11:21:48 -07:00
mbcache.c	…
mount.h	…
mpage.c	…
namei.c	VFS: filename_create(): fix incorrect intent.	2022-04-27 14:38:57 +02:00
namespace.c	fs/mount_setattr: always cleanup mount_kattr	2022-01-05 12:42:39 +01:00
no-block.c	…
nsfs.c	…
open.c	mm, thp: fix incorrect unmap behavior for private pages	2021-11-18 19:17:17 +01:00
pipe.c	watch_queue: Fix lack of barrier/sync/lock between post and read	2022-03-16 14:23:44 +01:00
pnode.c	…
pnode.h	…
posix_acl.c	ovl: enable RCU'd ->get_acl()	2021-08-18 22:08:24 +02:00
proc_namespace.c	…
read_write.c	fs: clean up after mandatory file locking support removal	2021-08-24 07:52:45 -04:00
readdir.c	…
remap_range.c	fs: remove mandatory file locking support	2021-08-23 06:15:36 -04:00
select.c	select: Fix indefinitely sleeping task in poll_schedule_timeout()	2022-01-29 10:58:25 +01:00
seq_file.c	seq_file: disallow extremely large seq buffer allocations	2021-07-19 17:18:48 -07:00
signalfd.c	signalfd: use wake_up_pollfree()	2021-12-14 10:57:15 +01:00
splice.c	…
stack.c	…
stat.c	stat: fix inconsistency between struct stat and struct compat_stat	2022-04-27 14:38:57 +02:00
statfs.c	…
super.c	vfs: make freeze_super abort when sync_filesystem returns error	2022-02-23 12:03:05 +01:00
sync.c	vfs: make sync_filesystem return errors from ->sync_fs	2022-04-27 14:38:50 +02:00
timerfd.c	timerfd: Provide timerfd_resume()	2021-08-10 17:57:22 +02:00
userfaultfd.c	userfaultfd: fix a race between writeprotect and exit_mmap()	2021-10-18 20:22:02 -10:00
utimes.c	…
xattr.c	…