WSL2-Linux-Kernel

История

Filipe Manana d116a0b0e0 btrfs: fix race between direct IO write and fsync when using same fd commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream. If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's VFS lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's VFS lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's VFS lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:983! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Code: 50 d6 86 c0 e8 (...) RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Call Trace: <TASK> ? __die_body.cold+0x14/0x24 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x6a/0x90 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? exc_invalid_op+0x50/0x70 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? __seccomp_filter+0x31d/0x4f0 __x64_sys_fdatasync+0x4f/0x90 do_syscall_64+0x82/0x160 ? do_futex+0xcb/0x190 ? __x64_sys_futex+0x10e/0x1d0 ? switch_fpu_return+0x4f/0xd0 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of task A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. / #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void fsync_loop(void arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char argv[]) { long pagesize; void write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY \| O_CREAT \| O_TRUNC \| O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi@web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2024-09-12 11:07:53 +02:00
..
9p	fs/9p: drop inodes immediately on non-.L too	2024-05-17 11:50:55 +02:00
adfs	…
affs	…
afs	afs: fix __afs_break_callback() / afs_drop_open_mmap() race	2024-09-04 13:23:23 +02:00
autofs	autofs: fix memory leak of waitqueues in autofs_catatonic_mode	2023-09-23 11:09:54 +02:00
befs	…
bfs	…
btrfs	btrfs: fix race between direct IO write and fsync when using same fd	2024-09-12 11:07:53 +02:00
cachefiles	cachefiles: fix memory leak in cachefiles_add_cache()	2024-03-06 14:38:50 +00:00
ceph	ceph: fix incorrect kmalloc size of pagevec mempool	2024-08-19 05:45:26 +02:00
cifs	cifs: Check the lease context if we actually got a lease	2024-09-12 11:07:50 +02:00
coda	…
configfs	…
cramfs	…
crypto	…
debugfs	debugfs: fix automount d_fsdata usage	2024-01-25 14:52:27 -08:00
devpts	…
dlm	dlm: fix plock lookup when using multiple lockspaces	2023-09-19 12:22:52 +02:00
ecryptfs	ecryptfs: Fix buffer size for tag 66 packet	2024-06-16 13:39:16 +02:00
efivarfs	efivarfs: force RO when remounting if SetVariable is not supported	2024-01-25 14:52:33 -08:00
efs	…
erofs	erofs: apply proper VMA alignment for memory mapped files on THP	2024-03-15 10:48:15 -04:00
exfat	exfat: support dynamic allocate bh for exfat_entry_set_cache	2024-03-01 13:21:56 +01:00
exportfs	exportfs: use pr_debug for unreachable debug statements	2024-04-10 16:19:21 +02:00
ext2	ext2: Verify bitmap and itable block numbers before using them	2024-08-19 05:45:12 +02:00
ext4	ext4: fix possible tid_t sequence overflows	2024-09-12 11:07:48 +02:00
f2fs	f2fs: fix to do sanity check in update_sit_entry	2024-09-04 13:23:26 +02:00
fat	fat: fix uninitialized field in nostale filehandles	2024-04-10 16:18:35 +02:00
freevxfs	…
fscache	…
fuse	fuse: use unsigned type for getxattr/listxattr size truncation	2024-09-12 11:07:44 +02:00
gfs2	gfs2: setattr_chown: Add missing initialization	2024-09-04 13:23:22 +02:00
hfs	hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()	2024-08-19 05:45:12 +02:00
hfsplus	hfsplus: fix to avoid false alarm of circular locking	2024-08-19 05:44:50 +02:00
hostfs	…
hpfs	…
hugetlbfs	fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super	2024-03-06 14:38:50 +00:00
iomap	iomap: update ki_pos a little later in iomap_dio_complete	2023-12-08 08:48:05 +01:00
isofs	isofs: handle CDs with bad root inode but good Joliet root directory	2024-04-13 13:01:44 +02:00
jbd2	jbd2: avoid memleak in jbd2_journal_write_metadata_buffer	2024-08-19 05:45:39 +02:00
jffs2	jffs2: Fix potential illegal address access in jffs2_free_inode	2024-07-18 13:07:29 +02:00
jfs	jfs: Fix array-index-out-of-bounds in diFree	2024-08-19 05:45:23 +02:00
kernfs	fs/kernfs/dir: obey S_ISGID	2024-02-23 08:54:51 +01:00
ksmbd	ksmbd: Unlock on in ksmbd_tcp_set_interfaces()	2024-09-12 11:07:51 +02:00
lockd	nfsd: stop setting ->pg_stats for unused stats	2024-09-04 13:23:30 +02:00
minix	…
netfs	…
nfs	NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations	2024-09-12 11:07:50 +02:00
nfs_common	…
nfsd	nfsd: make svc_stat per-network namespace instead of global	2024-09-04 13:23:31 +02:00
nilfs2	nilfs2: protect references to superblock parameters exposed in sysfs	2024-09-12 11:07:52 +02:00
nls	fs/nls: make load_nls() take a const parameter	2023-09-19 12:22:27 +02:00
notify	fsnotify: clear PARENT_WATCHED flags lazily	2024-09-12 11:07:41 +02:00
ntfs	…
ntfs3	fs/ntfs3: Check more cases when directory is corrupted	2024-09-12 11:07:49 +02:00
ocfs2	ocfs2: add bounds checking to ocfs2_check_dir_entry()	2024-07-27 10:46:16 +02:00
omfs	…
openpromfs	openpromfs: finish conversion to the new mount API	2024-06-16 13:39:16 +02:00
orangefs	orangefs: fix out-of-bounds fsid access	2024-07-18 13:07:29 +02:00
overlayfs	ima: detect changes to the backing overlay file	2023-11-28 16:56:29 +00:00
proc	sysctl: always initialize i_uid/i_gid	2024-08-19 05:45:28 +02:00
pstore	pstore/zone: Add a null pointer check to the psz_kmsg_read	2024-04-13 13:01:43 +02:00
qnx4	…
qnx6	…
quota	quota: Remove BUG_ON from dqget()	2024-09-04 13:23:24 +02:00
ramfs	shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs	2023-07-23 13:47:33 +02:00
reiserfs	reiserfs: Check the return value from __getblk()	2023-09-19 12:22:30 +02:00
romfs	…
smbfs_common	…
squashfs	Squashfs: sanity check symbolic link size	2024-09-12 11:07:50 +02:00
sysfs	fs: sysfs: Fix reference leak in sysfs_break_active_protection()	2024-04-27 17:05:28 +02:00
sysv	sysv: don't call sb_bread() with pointers_lock held	2024-04-13 13:01:44 +02:00
tracefs	tracefs: Add missing lockdown check to tracefs_create_dir()	2023-09-23 11:10:02 +02:00
ubifs	ubifs: Set page uptodate in the correct place	2024-04-10 16:18:35 +02:00
udf	udf: Avoid excessive partition lengths	2024-09-12 11:07:46 +02:00
ufs	…
unicode	…
vboxsf	vboxsf: Avoid an spurious warning if load_nls_xxx() fails	2024-04-10 16:19:38 +02:00
verity	fsverity: skip PKCS#7 parser when keyring is empty	2023-09-19 12:22:52 +02:00
xfs	xfs: fix log recovery buffer allocation for the legacy h_size fixup	2024-08-19 05:45:49 +02:00
zonefs	zonefs: Improve error handling	2024-03-01 13:21:43 +01:00
Kconfig	NFSD: Remove CONFIG_NFSD_V3	2024-04-10 16:19:01 +02:00
Kconfig.binfmt	…
Makefile	…
aio.c	fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion	2024-04-10 16:18:46 +02:00
anon_inodes.c	…
attr.c	attr: block mode changes of symlinks	2023-09-23 11:10:01 +02:00
bad_inode.c	…
binfmt_aout.c	…
binfmt_elf.c	…
binfmt_elf_fdpic.c	fs: binfmt_elf_efpic: don't use missing interpreter's properties	2024-09-04 13:23:24 +02:00
binfmt_flat.c	binfmt_flat: Fix corruption when not offsetting data start	2024-08-19 05:45:52 +02:00
binfmt_misc.c	binfmt_misc: cleanup on filesystem umount	2024-09-04 13:23:22 +02:00
binfmt_script.c	…
buffer.c	…
char_dev.c	…
compat_binfmt_elf.c	…
coredump.c	…
d_path.c	…
dax.c	…
dcache.c	fs: better handle deep ancestor chains in is_subdir()	2024-07-27 10:46:13 +02:00
direct-io.c	…
drop_caches.c	…
eventfd.c	eventfd: prevent underflow for eventfd semaphores	2023-09-19 12:22:30 +02:00
eventpoll.c	epoll: be better about file lifetimes	2024-06-16 13:39:15 +02:00
exec.c	exec: Fix ToCToU between perm check and set-uid/gid usage	2024-08-19 05:45:51 +02:00
fcntl.c	…
fhandle.c	do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak	2024-03-26 18:21:14 -04:00
file.c	fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE	2024-09-04 13:23:17 +02:00
file_table.c	…
filesystems.c	…
fs-writeback.c	writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs	2023-11-20 11:08:13 +01:00
fs_context.c	fs: avoid empty option when generating legacy mount string	2023-07-23 13:47:34 +02:00
fs_parser.c	…
fs_pin.c	…
fs_struct.c	…
fs_types.c	…
fsopen.c	…
init.c	…
inode.c	vfs: Don't evict inode under the inode lru traversing context	2024-09-04 13:23:16 +02:00
internal.h	nfs: use vfs setgid helper	2023-08-30 16:18:19 +02:00
ioctl.c	lsm: new security_file_ioctl_compat() hook	2024-02-23 08:54:25 +01:00
kernel_read_file.c	…
libfs.c	…
locks.c	filelock: Fix fcntl/close race recovery compat path	2024-07-27 10:46:17 +02:00
mbcache.c	…
mount.h	…
mpage.c	…
namei.c	rename(): fix the locking of subdirectories	2024-02-23 08:54:26 +01:00
namespace.c	fs: indicate request originates from old mount API	2024-01-25 14:52:35 -08:00
no-block.c	…
nsfs.c	…
open.c	ftruncate: pass a signed offset	2024-07-05 09:14:50 +02:00
pipe.c	fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()	2024-04-10 16:19:42 +02:00
pnode.c	…
pnode.h	…
posix_acl.c	…
proc_namespace.c	…
read_write.c	…
readdir.c	…
remap_range.c	…
select.c	fs/select: rework stack allocation hack for clang	2024-03-26 18:21:15 -04:00
seq_file.c	…
signalfd.c	…
splice.c	…
stack.c	…
stat.c	…
statfs.c	statfs: enforce statfs[64] structure initialization	2023-05-24 17:36:54 +01:00
super.c	fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT	2024-08-19 05:45:27 +02:00
sync.c	…
timerfd.c	…
userfaultfd.c	Fix userfaultfd_api to return EINVAL as expected	2024-07-18 13:07:42 +02:00
utimes.c	…
xattr.c	…