WSL2-Linux-Kernel/kernel
Andrea Arcangeli 1c2fb7a4c2 ksm: fix deadlock with munlock in exit_mmap
Rawhide users have reported hang at startup when cryptsetup is run: the
same problem can be simply reproduced by running a program int main() {
mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

The problem is that exit_mmap() applies munlock_vma_pages_all() to
clean up VM_LOCKED areas, and its current implementation (stupidly)
tries to fault in absent pages, for example where PROT_NONE prevented
them being faulted in when mlocking.  Whereas the "ksm: fix oom
deadlock" patch, knowing there's a race by which KSM might try to fault
in pages after exit_mmap() had finally zapped the range, backs out of
such faults doing nothing when its ksm_test_exit() notices mm_users 0.

So revert that part of "ksm: fix oom deadlock" which moved the
ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
and remove those ksm_test_exit() checks from the page fault paths, so
allowing the munlocking to proceed without interference.

ksm_exit, if there are rmap_items still chained on this mm slot, takes
mmap_sem write side: so preventing KSM from working on an mm while
exit_mmap runs.  And KSM will bail out as soon as it notices that
mm_users is already zero, thanks to its internal ksm_test_exit checks.
So that when a task is killed by OOM killer or the user, KSM will not
indefinitely prevent it from running exit_mmap to release its memory.

This does break a part of what "ksm: fix oom deadlock" was trying to
achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
when ksmd itself has to cancel a KSM page, it is possible that the
first OOM-kill victim would be the KSM process being faulted: then its
memory won't be freed until a second victim has been selected (freeing
memory for the unmerging fault to complete).

But the OOM killer is already liable to kill a second victim once the
intended victim's p->mm goes to NULL: so there's not much point in
rejecting this KSM patch before fixing that OOM behaviour.  It is very
much more important to allow KSM users to boot up, than to haggle over
an unlikely and poorly supported OOM case.

We also intend to fix munlocking to not fault pages: at which point
this patch _could_ be reverted; though that would be controversial, so
we hope to find a better solution.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Justin M. Forbes <jforbes@redhat.com>
Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:32 -07:00
..
gcov
irq Merge branch 'irq-threaded-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:21:31 -07:00
power vt: remove power stuff from kernel/power 2009-09-19 13:13:25 -07:00
time Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-18 09:15:24 -07:00
trace Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-21 09:15:07 -07:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.preempt
Makefile Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-21 09:15:07 -07:00
acct.c
async.c
audit.c
audit.h
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
cgroup.c const: mark remaining inode_operations as const 2009-09-22 07:17:24 -07:00
cgroup_debug.c
cgroup_freezer.c
compat.c
configs.c
cpu.c Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-15 09:19:38 -07:00
cpuset.c
cred-internals.h
cred.c CRED: Allow put_cred() to cope with a NULL groups list 2009-09-15 09:10:57 +10:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
extable.c
fork.c ksm: fix deadlock with munlock in exit_mmap 2009-09-22 07:17:32 -07:00
freezer.c
futex.c Merge branch 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:16:22 -07:00
futex_compat.c
groups.c
hrtimer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-18 09:15:24 -07:00
hung_task.c
itimer.c
kallsyms.c
kexec.c
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c
kmod.c Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:24:03 -07:00
kprobes.c
ksysfs.c
kthread.c
latencytop.c
lockdep.c
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
module.c tracing: Remove markers 2009-09-18 21:22:08 +02:00
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c
ns_cgroup.c
nsproxy.c
panic.c
params.c
perf_event.c perf: Tidy up after the big rename 2009-09-21 14:34:11 +02:00
pid.c
pid_namespace.c
pm_qos_params.c
posix-cpu-timers.c
posix-timers.c
printk.c cleanup console_print() 2009-09-14 17:41:42 -07:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c
rcupdate.c rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
rcutorture.c rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
rcutree.c rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
rcutree.h rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
rcutree_plugin.h rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
rcutree_trace.c rcu: Fix whitespace inconsistencies 2009-09-19 08:53:22 +02:00
relay.c
res_counter.c
resource.c
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-21 09:15:07 -07:00
sched_clock.c sched_clock: Make it NMI safe 2009-09-18 20:47:30 +02:00
sched_cpupri.c
sched_cpupri.h
sched_debug.c sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_fair.c Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-21 09:06:17 -07:00
sched_features.h sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_idletask.c sched: Simplify sys_sched_rr_get_interval() system call 2009-09-21 09:53:55 +02:00
sched_rt.c sched: Simplify sys_sched_rr_get_interval() system call 2009-09-21 09:53:55 +02:00
sched_stats.h
seccomp.c
semaphore.c
signal.c
slow-work.c
smp.c
softirq.c softirq: add BLOCK_IOPOLL to softirq_to_name 2009-09-17 15:53:44 -04:00
softlockup.c
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
sys_ni.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
sysctl.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
sysctl_check.c
taskstats.c
test_kprobes.c
time.c time: Prevent 32 bit overflow with set_normalized_timespec() 2009-09-15 10:17:30 +02:00
timeconst.pl
timer.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
tracepoint.c
tsacct.c
uid16.c
up.c
user.c
user_namespace.c
utsname.c
utsname_sysctl.c
wait.c
workqueue.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:23:18 -07:00