WSL2-Linux-Kernel/fs
Filipe Manana d116a0b0e0 btrfs: fix race between direct IO write and fsync when using same fd
commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream.

If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
   assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
   points to memory allocated on stack by the direct IO task and it may be
   used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
   while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
   doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
   file's private to an on stack allocated private with the member
   'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
   structure associated to the file which has 'fsync_skip_inode_lock' set
   to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
   NULL since it had no prior private and our private was stack allocated.
   Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
   assertion that checks the inode's VFS lock is held fails, since task B
   never locked it and task A has already unlocked it.

The stack trace produced is the following:

   assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:983!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
   CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
   Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
   RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
   Code: 50 d6 86 c0 e8 (...)
   RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
   RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
   RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
   R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
   R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
   FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
   Call Trace:
    <TASK>
    ? __die_body.cold+0x14/0x24
    ? die+0x2e/0x50
    ? do_trap+0xca/0x110
    ? do_error_trap+0x6a/0x90
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? exc_invalid_op+0x50/0x70
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? asm_exc_invalid_op+0x1a/0x20
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? __seccomp_filter+0x31d/0x4f0
    __x64_sys_fdatasync+0x4f/0x90
    do_syscall_64+0x82/0x160
    ? do_futex+0xcb/0x190
    ? __x64_sys_futex+0x10e/0x1d0
    ? switch_fpu_return+0x4f/0xd0
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

   $ cat repro.c
   /* Get the O_DIRECT definition. */
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <stdio.h>
   #include <stdlib.h>
   #include <unistd.h>
   #include <stdint.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <pthread.h>

   static int fd;

   static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
   {
       while (count > 0) {
           ssize_t ret;

           ret = pwrite(fd, buf, count, offset);
           if (ret < 0) {
               if (errno == EINTR)
                   continue;
               return ret;
           }
           count -= ret;
           buf += ret;
       }
       return 0;
   }

   static void *fsync_loop(void *arg)
   {
       while (1) {
           int ret;

           ret = fsync(fd);
           if (ret != 0) {
               perror("Fsync failed");
               exit(6);
           }
       }
   }

   int main(int argc, char *argv[])
   {
       long pagesize;
       void *write_buf;
       pthread_t fsyncer;
       int ret;

       if (argc != 2) {
           fprintf(stderr, "Use: %s <file path>\n", argv[0]);
           return 1;
       }

       fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
       if (fd == -1) {
           perror("Failed to open/create file");
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       if (pagesize == -1) {
           perror("Failed to get page size");
           return 2;
       }

       ret = posix_memalign(&write_buf, pagesize, pagesize);
       if (ret) {
           perror("Failed to allocate buffer");
           return 3;
       }

       ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
       if (ret != 0) {
           fprintf(stderr, "Failed to create writer thread: %d\n", ret);
           return 4;
       }

       while (1) {
           ret = do_write(fd, write_buf, pagesize, 0);
           if (ret != 0) {
               perror("Write failed");
               exit(5);
           }
       }

       return 0;
   }

   $ mkfs.btrfs -f /dev/sdi
   $ mount /dev/sdi /mnt/sdi
   $ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12 11:07:53 +02:00
..
9p fs/9p: drop inodes immediately on non-.L too 2024-05-17 11:50:55 +02:00
adfs
affs
afs afs: fix __afs_break_callback() / afs_drop_open_mmap() race 2024-09-04 13:23:23 +02:00
autofs autofs: fix memory leak of waitqueues in autofs_catatonic_mode 2023-09-23 11:09:54 +02:00
befs
bfs
btrfs btrfs: fix race between direct IO write and fsync when using same fd 2024-09-12 11:07:53 +02:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-03-06 14:38:50 +00:00
ceph ceph: fix incorrect kmalloc size of pagevec mempool 2024-08-19 05:45:26 +02:00
cifs cifs: Check the lease context if we actually got a lease 2024-09-12 11:07:50 +02:00
coda
configfs
cramfs
crypto
debugfs debugfs: fix automount d_fsdata usage 2024-01-25 14:52:27 -08:00
devpts
dlm dlm: fix plock lookup when using multiple lockspaces 2023-09-19 12:22:52 +02:00
ecryptfs ecryptfs: Fix buffer size for tag 66 packet 2024-06-16 13:39:16 +02:00
efivarfs efivarfs: force RO when remounting if SetVariable is not supported 2024-01-25 14:52:33 -08:00
efs
erofs erofs: apply proper VMA alignment for memory mapped files on THP 2024-03-15 10:48:15 -04:00
exfat exfat: support dynamic allocate bh for exfat_entry_set_cache 2024-03-01 13:21:56 +01:00
exportfs exportfs: use pr_debug for unreachable debug statements 2024-04-10 16:19:21 +02:00
ext2 ext2: Verify bitmap and itable block numbers before using them 2024-08-19 05:45:12 +02:00
ext4 ext4: fix possible tid_t sequence overflows 2024-09-12 11:07:48 +02:00
f2fs f2fs: fix to do sanity check in update_sit_entry 2024-09-04 13:23:26 +02:00
fat fat: fix uninitialized field in nostale filehandles 2024-04-10 16:18:35 +02:00
freevxfs
fscache
fuse fuse: use unsigned type for getxattr/listxattr size truncation 2024-09-12 11:07:44 +02:00
gfs2 gfs2: setattr_chown: Add missing initialization 2024-09-04 13:23:22 +02:00
hfs hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode() 2024-08-19 05:45:12 +02:00
hfsplus hfsplus: fix to avoid false alarm of circular locking 2024-08-19 05:44:50 +02:00
hostfs
hpfs
hugetlbfs fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super 2024-03-06 14:38:50 +00:00
iomap iomap: update ki_pos a little later in iomap_dio_complete 2023-12-08 08:48:05 +01:00
isofs isofs: handle CDs with bad root inode but good Joliet root directory 2024-04-13 13:01:44 +02:00
jbd2 jbd2: avoid memleak in jbd2_journal_write_metadata_buffer 2024-08-19 05:45:39 +02:00
jffs2 jffs2: Fix potential illegal address access in jffs2_free_inode 2024-07-18 13:07:29 +02:00
jfs jfs: Fix array-index-out-of-bounds in diFree 2024-08-19 05:45:23 +02:00
kernfs fs/kernfs/dir: obey S_ISGID 2024-02-23 08:54:51 +01:00
ksmbd ksmbd: Unlock on in ksmbd_tcp_set_interfaces() 2024-09-12 11:07:51 +02:00
lockd nfsd: stop setting ->pg_stats for unused stats 2024-09-04 13:23:30 +02:00
minix
netfs
nfs NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations 2024-09-12 11:07:50 +02:00
nfs_common
nfsd nfsd: make svc_stat per-network namespace instead of global 2024-09-04 13:23:31 +02:00
nilfs2 nilfs2: protect references to superblock parameters exposed in sysfs 2024-09-12 11:07:52 +02:00
nls fs/nls: make load_nls() take a const parameter 2023-09-19 12:22:27 +02:00
notify fsnotify: clear PARENT_WATCHED flags lazily 2024-09-12 11:07:41 +02:00
ntfs
ntfs3 fs/ntfs3: Check more cases when directory is corrupted 2024-09-12 11:07:49 +02:00
ocfs2 ocfs2: add bounds checking to ocfs2_check_dir_entry() 2024-07-27 10:46:16 +02:00
omfs
openpromfs openpromfs: finish conversion to the new mount API 2024-06-16 13:39:16 +02:00
orangefs orangefs: fix out-of-bounds fsid access 2024-07-18 13:07:29 +02:00
overlayfs ima: detect changes to the backing overlay file 2023-11-28 16:56:29 +00:00
proc sysctl: always initialize i_uid/i_gid 2024-08-19 05:45:28 +02:00
pstore pstore/zone: Add a null pointer check to the psz_kmsg_read 2024-04-13 13:01:43 +02:00
qnx4
qnx6
quota quota: Remove BUG_ON from dqget() 2024-09-04 13:23:24 +02:00
ramfs shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs 2023-07-23 13:47:33 +02:00
reiserfs reiserfs: Check the return value from __getblk() 2023-09-19 12:22:30 +02:00
romfs
smbfs_common
squashfs Squashfs: sanity check symbolic link size 2024-09-12 11:07:50 +02:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-27 17:05:28 +02:00
sysv sysv: don't call sb_bread() with pointers_lock held 2024-04-13 13:01:44 +02:00
tracefs tracefs: Add missing lockdown check to tracefs_create_dir() 2023-09-23 11:10:02 +02:00
ubifs ubifs: Set page uptodate in the correct place 2024-04-10 16:18:35 +02:00
udf udf: Avoid excessive partition lengths 2024-09-12 11:07:46 +02:00
ufs
unicode
vboxsf vboxsf: Avoid an spurious warning if load_nls_xxx() fails 2024-04-10 16:19:38 +02:00
verity fsverity: skip PKCS#7 parser when keyring is empty 2023-09-19 12:22:52 +02:00
xfs xfs: fix log recovery buffer allocation for the legacy h_size fixup 2024-08-19 05:45:49 +02:00
zonefs zonefs: Improve error handling 2024-03-01 13:21:43 +01:00
Kconfig NFSD: Remove CONFIG_NFSD_V3 2024-04-10 16:19:01 +02:00
Kconfig.binfmt
Makefile
aio.c fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion 2024-04-10 16:18:46 +02:00
anon_inodes.c
attr.c attr: block mode changes of symlinks 2023-09-23 11:10:01 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c
binfmt_elf_fdpic.c fs: binfmt_elf_efpic: don't use missing interpreter's properties 2024-09-04 13:23:24 +02:00
binfmt_flat.c binfmt_flat: Fix corruption when not offsetting data start 2024-08-19 05:45:52 +02:00
binfmt_misc.c binfmt_misc: cleanup on filesystem umount 2024-09-04 13:23:22 +02:00
binfmt_script.c
buffer.c
char_dev.c
compat_binfmt_elf.c
coredump.c
d_path.c
dax.c
dcache.c fs: better handle deep ancestor chains in is_subdir() 2024-07-27 10:46:13 +02:00
direct-io.c
drop_caches.c
eventfd.c eventfd: prevent underflow for eventfd semaphores 2023-09-19 12:22:30 +02:00
eventpoll.c epoll: be better about file lifetimes 2024-06-16 13:39:15 +02:00
exec.c exec: Fix ToCToU between perm check and set-uid/gid usage 2024-08-19 05:45:51 +02:00
fcntl.c
fhandle.c do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak 2024-03-26 18:21:14 -04:00
file.c fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE 2024-09-04 13:23:17 +02:00
file_table.c
filesystems.c
fs-writeback.c writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs 2023-11-20 11:08:13 +01:00
fs_context.c fs: avoid empty option when generating legacy mount string 2023-07-23 13:47:34 +02:00
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c vfs: Don't evict inode under the inode lru traversing context 2024-09-04 13:23:16 +02:00
internal.h nfs: use vfs setgid helper 2023-08-30 16:18:19 +02:00
ioctl.c lsm: new security_file_ioctl_compat() hook 2024-02-23 08:54:25 +01:00
kernel_read_file.c
libfs.c
locks.c filelock: Fix fcntl/close race recovery compat path 2024-07-27 10:46:17 +02:00
mbcache.c
mount.h
mpage.c
namei.c rename(): fix the locking of subdirectories 2024-02-23 08:54:26 +01:00
namespace.c fs: indicate request originates from old mount API 2024-01-25 14:52:35 -08:00
no-block.c
nsfs.c
open.c ftruncate: pass a signed offset 2024-07-05 09:14:50 +02:00
pipe.c fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() 2024-04-10 16:19:42 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
remap_range.c
select.c fs/select: rework stack allocation hack for clang 2024-03-26 18:21:15 -04:00
seq_file.c
signalfd.c
splice.c
stack.c
stat.c
statfs.c statfs: enforce statfs[64] structure initialization 2023-05-24 17:36:54 +01:00
super.c fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT 2024-08-19 05:45:27 +02:00
sync.c
timerfd.c
userfaultfd.c Fix userfaultfd_api to return EINVAL as expected 2024-07-18 13:07:42 +02:00
utimes.c
xattr.c