WSL2-Linux-Kernel/fs
Dave Chinner 2c6e24ce1a xfs: prevent deadlock trying to cover an active log
Recent analysis of a deadlocked XFS filesystem from a kernel
crash dump indicated that the filesystem was stuck waiting for log
space. The short story of the hang on the RHEL6 kernel is this:

	- the tail of the log is pinned by an inode
	- the inode has been pushed by the xfsaild
	- the inode has been flushed to it's backing buffer and is
	  currently flush locked and hence waiting for backing
	  buffer IO to complete and remove it from the AIL
	- the backing buffer is marked for write - it is on the
	  delayed write queue
	- the inode buffer has been modified directly and logged
	  recently due to unlinked inode list modification
	- the backing buffer is pinned in memory as it is in the
	  active CIL context.
	- the xfsbufd won't start buffer writeback because it is
	  pinned
	- xfssyncd won't force the log because it sees the log as
	  needing to be covered and hence wants to issue a dummy
	  transaction to move the log covering state machine along.

Hence there is no trigger to force the CIL to the log and hence
unpin the inode buffer and therefore complete the inode IO, remove
it from the AIL and hence move the tail of the log along, allowing
transactions to start again.

Mainline kernels also have the same deadlock, though the signature
is slightly different - the inode buffer never reaches the delayed
write lists because xfs_buf_item_push() sees that it is pinned and
hence never adds it to the delayed write list that the xfsaild
flushes.

There are two possible solutions here. The first is to simply force
the log before trying to cover the log and so ensure that the CIL is
emptied before we try to reserve space for the dummy transaction in
the xfs_log_worker(). While this might work most of the time, it is
still racy and is no guarantee that we don't get stuck in
xfs_trans_reserve waiting for log space to come free. Hence it's not
the best way to solve the problem.

The second solution is to modify xfs_log_need_covered() to be aware
of the CIL. We only should be attempting to cover the log if there
is no current activity in the log - covering the log is the process
of ensuring that the head and tail in the log on disk are identical
(i.e. the log is clean and at idle). Hence, by definition, if there
are items in the CIL then the log is not at idle and so we don't
need to attempt to cover it.

When we don't need to cover the log because it is active or idle, we
issue a log force from xfs_log_worker() - if the log is idle, then
this does nothing.  However, if the log is active due to there being
items in the CIL, it will force the items in the CIL to the log and
unpin them.

In the case of the above deadlock scenario, instead of
xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to
cover the log, it will instead force the log, thereby unpinning the
inode buffer, allowing IO to be issued and complete and hence
removing the inode that was pinning the tail of the log from the
AIL. At that point, everything will start moving along again. i.e.
the xfs_log_worker turns back into a watchdog that can alleviate
deadlocks based around pinned items that prevent the tail of the log
from being moved...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17 10:56:17 -05:00
..
9p fs/9p: avoid accessing utsname after namespace has been torn down 2013-08-26 10:28:46 -05:00
adfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
affs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
afs afs: get rid of redundant ->d_name.len checks 2013-09-07 19:54:55 -04:00
autofs4 autofs4 - fix device ioctl mount lookup 2013-09-08 22:07:47 -04:00
befs [readdir] convert befs 2013-06-29 12:56:55 +04:00
bfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
btrfs Merge branch 'akpm' (patches from Andrew Morton) 2013-09-12 15:44:27 -07:00
cachefiles CacheFiles: Implement interface to check cache consistency 2013-09-06 09:17:30 +01:00
ceph ceph: use d_invalidate() to invalidate aliases 2013-09-06 12:55:29 -07:00
cifs cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache 2013-09-13 16:24:49 -05:00
coda helper for reading ->d_count 2013-07-05 18:59:33 +04:00
configfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-07-14 11:42:26 -07:00
cramfs [readdir] convert f2fs 2013-06-29 12:56:46 +04:00
debugfs debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs) 2013-07-31 12:16:31 -04:00
devpts fs: Limit sys_mount to only request filesystem modules (Part 2). 2013-03-07 01:08:55 -08:00
dlm dlm: remove signal blocking 2013-08-12 15:22:43 -05:00
ecryptfs ecryptfs: avoid ctx initialization race 2013-09-06 16:58:18 -07:00
efivarfs efivarfs: we can use simple_lookup() now 2013-07-14 17:48:35 +04:00
efs efs: iget_locked() doesn't return an ERR_PTR() 2013-08-24 12:10:22 -04:00
exofs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
exportfs exportfs: don't assume that ->iterate() won't feed us too long entries 2013-09-07 19:54:55 -04:00
ext2 truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-09-06 09:06:02 -07:00
ext4 Merge branch 'akpm' (patches from Andrew Morton) 2013-09-12 15:44:27 -07:00
f2fs f2fs: optimize gc for better performance 2013-09-05 13:50:32 +09:00
fat truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
freevxfs [readdir] convert freevxfs 2013-06-29 12:56:53 +04:00
fscache lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt 2013-09-11 15:59:36 -07:00
fuse truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
gfs2 Merge branch 'akpm' (patches from Andrew Morton) 2013-09-12 15:44:27 -07:00
hfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
hfsplus truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
hostfs um: hostfs: Fix writeback 2013-09-07 10:38:29 +02:00
hpfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
hppfs clean up scary strncpy(dst, src, strlen(src)) uses 2013-07-03 16:07:41 -07:00
hugetlbfs cope with potentially long ->d_dname() output for shmem/hugetlb 2013-08-24 12:10:17 -04:00
isofs isofs: Refuse RW mount of the filesystem instead of making it RO 2013-07-31 22:14:50 +02:00
jbd jbd: use a single printk for jbd_debug() 2013-08-09 10:49:00 +02:00
jbd2 jbd2: Fix endian mixing problems in the checksumming code 2013-08-28 14:59:58 -04:00
jffs2 [readdir] convert jffs2 2013-06-29 12:56:47 +04:00
jfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
lockd LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs 2013-08-05 15:03:46 -04:00
logfs Lots of bug fixes, cleanups and optimizations. In the bug fixes 2013-07-02 09:39:34 -07:00
minix truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
ncpfs ncpfs: fix error return code in ncp_parse_options() 2013-07-09 10:33:25 -07:00
nfs Merge git://git.kvack.org/~bcrl/aio-next 2013-09-13 10:55:58 -07:00
nfs_common
nfsd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-12 15:01:38 -07:00
nilfs2 truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
nls
notify fsnotify: update comments concerning locking scheme 2013-07-09 10:33:20 -07:00
ntfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
ocfs2 Merge git://git.kvack.org/~bcrl/aio-next 2013-09-13 10:55:58 -07:00
omfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
openpromfs [readdir] convert openpromfs 2013-06-29 12:56:32 +04:00
proc thp: account anon transparent huge pages into NR_ANON_PAGES 2013-09-12 15:38:03 -07:00
pstore pstore/ram: (really) fix undefined usage of rounddown_pow_of_two 2013-08-30 15:57:01 -07:00
qnx4 [readdir] convert qnx4 2013-06-29 12:56:38 +04:00
qnx6 [readdir] convert qnx6 2013-06-29 12:56:39 +04:00
quota fs: convert fs shrinkers to new scan/count API 2013-09-10 18:56:31 -04:00
ramfs initmpfs: move rootfs code from fs/ramfs/ to init/ 2013-09-11 15:59:37 -07:00
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-09-06 09:06:02 -07:00
romfs [readdir] convert romfs 2013-06-29 12:56:29 +04:00
squashfs Squashfs: add corruption check for type in squashfs_readdir() 2013-09-06 04:57:54 +01:00
sysfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-07 14:36:57 -07:00
sysv truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
ubifs Just one patch which fixes the power-cut recovery testing mode. 2013-09-16 15:36:55 -04:00
udf Merge git://git.kvack.org/~bcrl/aio-next 2013-09-13 10:55:58 -07:00
ufs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
xfs xfs: prevent deadlock trying to cover an active log 2013-10-17 10:56:17 -05:00
Kconfig efivarfs: Move to fs/efivarfs 2013-04-17 13:25:09 +01:00
Kconfig.binfmt fs: make binfmt support for #! scripts modular and removable 2013-04-30 17:04:04 -07:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
aio.c aio: rcu_read_lock protection for new rcu_dereference calls 2013-09-09 12:29:35 -04:00
anon_inodes.c fs/anon_inode: Introduce a new lib function anon_inode_getfile_private() 2013-07-16 09:32:17 -04:00
attr.c
bad_inode.c [readdir] ->readdir() is gone 2013-06-29 12:57:04 +04:00
binfmt_aout.c mm: remove free_area_cache 2013-07-10 18:11:34 -07:00
binfmt_elf.c mm: remove free_area_cache 2013-07-10 18:11:34 -07:00
binfmt_elf_fdpic.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_em86.c
binfmt_flat.c new helper: read_code() 2013-04-29 15:40:23 -04:00
binfmt_misc.c binfmt_misc: reuse string_unescape_inplace() 2013-04-30 17:04:03 -07:00
binfmt_script.c
binfmt_som.c
bio-integrity.c fs/bio-integrity: fix a potential mem leak 2013-09-11 15:58:21 -07:00
bio.c Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-09-03 18:25:03 -07:00
block_dev.c a trivial writeback fix 2013-09-13 23:06:40 -04:00
buffer.c mm: vmscan: take page buffers dirty and locked state into account 2013-07-03 16:07:29 -07:00
char_dev.c
compat.c [readdir] constify ->actor 2013-06-29 12:57:05 +04:00
compat_binfmt_elf.c
compat_ioctl.c compat.c: LOOP_CLR_FD is taken care of in loop.c itself... 2013-06-29 12:46:44 +04:00
coredump.c coredump: add new %P variable in core_pattern 2013-09-11 15:59:01 -07:00
coredump.h
dcache.c vfs: fix typo in comment in recent dentry work 2013-09-15 07:11:01 -04:00
dcookies.c consolidate compat lookup_dcookie() 2013-03-03 23:00:23 -05:00
direct-io.c direct-io: Use return from cmpxchg to decide of assignment happened 2013-09-09 10:47:42 -07:00
drop_caches.c shrinker: add node awareness 2013-09-10 18:56:31 -04:00
eventfd.c
eventpoll.c epoll: add a reschedule point in ep_free() 2013-09-11 15:58:50 -07:00
exec.c exec: cleanup the error handling in search_binary_handler() 2013-09-11 15:59:09 -07:00
fcntl.c vfs: add missing check for __O_TMPFILE in fcntl_init() 2013-08-05 18:25:32 +04:00
fhandle.c
file.c don't bother with deferred freeing of fdtables 2013-05-01 17:31:42 -04:00
file_table.c fs/file_table.c:fput(): make comment more truthful 2013-09-11 15:59:01 -07:00
filesystems.c fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
fs-writeback.c a trivial writeback fix 2013-09-13 23:06:40 -04:00
fs_struct.c constify path_get/path_put and fs_struct.c stuff 2013-03-01 23:51:07 -05:00
generic_acl.c
inode.c fs: convert inode and dentry shrinking to be node aware 2013-09-10 18:56:31 -04:00
internal.h fs: convert inode and dentry shrinking to be node aware 2013-09-10 18:56:31 -04:00
ioctl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
ioprio.c
libfs.c make simple_lookup() usable for filesystems that set ->s_d_op 2013-07-14 17:43:25 +04:00
locks.c locks: move file_lock_list to a set of percpu hlist_heads and convert file_lock_lock to an lglock 2013-07-08 13:36:42 +04:00
mbcache.c fs: convert fs shrinkers to new scan/count API 2013-09-10 18:56:31 -04:00
mount.h get rid of full-hash scan on detaching vfsmounts 2013-04-09 14:12:52 -04:00
mpage.c
namei.c Add missing unlocks to error paths of mountpoint_last. 2013-09-10 18:56:29 -04:00
namespace.c initmpfs: move rootfs code from fs/ramfs/ to init/ 2013-09-11 15:59:37 -07:00
no-block.c
open.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2013-09-07 14:35:32 -07:00
pipe.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
pnode.c vfs: Fix invalid ida_remove() call 2013-05-31 15:16:33 -04:00
pnode.h vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces 2013-08-26 18:42:15 -07:00
posix_acl.c
proc_namespace.c
read_write.c aio: Kill aio_rw_vect_retry() 2013-07-30 11:53:12 -04:00
readdir.c [readdir] constify ->actor 2013-06-29 12:57:05 +04:00
select.c net: rename include/net/ll_poll.h to include/net/busy_poll.h 2013-07-10 17:08:27 -07:00
seq_file.c seq_file: add seq_list_*_percpu helpers 2013-07-08 13:36:41 +04:00
signalfd.c switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE 2013-03-03 22:58:46 -05:00
splice.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-07-03 09:10:19 -07:00
stack.c
stat.c quota: provide interface for readding allocated space into reserved space 2013-08-17 09:32:32 -04:00
statfs.c
super.c super: fix for destroy lrus 2013-09-10 18:56:32 -04:00
sync.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
timerfd.c timerfd: Add alarm timers 2013-05-29 12:57:34 -07:00
utimes.c
xattr.c
xattr_acl.c