WSL2-Linux-Kernel/fs
Mel Gorman 04f2cbe356 hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork().  At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.

We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork().  Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after.  A failure here would be very undesirable.

Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad.  The following situation is allowed to occur today.

1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
   had taken care to ensure the pages existed

This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap().  When the parent performs COW, it
will try to satisfy the allocation without using reserves.  If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure.  If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.

To summarise the new behaviour:

1. If the original mapper performs COW on a private mapping with multiple
   references, it will attempt to allocate a hugepage from the pool or
   the buddy allocator without using the existing reserves. On fail, VMAs
   mapping the same area are traversed and the page being COW'd is unmapped
   where found. It will then steal the original page as the last mapper in
   the normal way.

2. The VMAs the pages were unmapped from are flagged to note that pages
   with data no longer exist. Future no-page faults on those VMAs will
   terminate the process as otherwise it would appear that data was corrupted.
   A warning is printed to the console that this situation occured.

2. If the child performs COW first, it will attempt to satisfy the COW
   from the pool if there are enough pages or via the buddy allocator if
   overcommit is allowed and the buddy allocator can satisfy the request. If
   it fails, the child will be killed.

If the pool is large enough, existing applications will not notice that
the reserves were a factor.  Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().

[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:16 -07:00
..
9p
adfs
affs
afs
autofs
autofs4
befs
bfs
cifs
coda device create: coda: convert device_create to device_create_drvdata 2008-07-21 21:54:41 -07:00
configfs configfs: Allow ->make_item() and ->make_group() to return detailed errors. 2008-07-17 15:21:29 -07:00
cramfs
debugfs debugfs: Implement debugfs_remove_recursive() 2008-07-21 21:54:59 -07:00
devpts
dlm configfs: Allow ->make_item() and ->make_group() to return detailed errors. 2008-07-17 15:21:29 -07:00
ecryptfs
efs
exportfs
ext2
ext3
ext4
fat
freevxfs
fuse
gfs2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw 2008-07-15 10:38:46 -07:00
hfs
hfsplus
hostfs
hpfs
hppfs
hugetlbfs hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-07-24 10:47:16 -07:00
isofs
jbd
jbd2
jffs2
jfs
lockd Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux 2008-07-20 21:21:46 -07:00
minix
msdos
ncpfs
nfs Merge branch 'bkl-removal' into next 2008-07-15 18:34:58 -04:00
nfs_common
nfsd Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux 2008-07-20 21:21:46 -07:00
nls
ntfs
ocfs2 configfs: Allow ->make_item() and ->make_group() to return detailed errors. 2008-07-17 15:21:29 -07:00
openpromfs
partitions driver core: remove KOBJ_NAME_LEN define 2008-07-21 21:54:52 -07:00
proc mm/vmstat.c: proper externs 2008-07-24 10:47:14 -07:00
qnx4
ramfs
reiserfs
romfs
smbfs
sysfs driver core: Suppress sysfs warnings for device_rename(). 2008-07-21 21:55:01 -07:00
sysv
ubifs UBIFS: include to compilation 2008-07-15 17:35:24 +03:00
udf
ufs
vfat
xfs
Kconfig Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 2008-07-17 10:55:51 -07:00
Kconfig.binfmt
Makefile Merge branch 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6 2008-07-16 15:02:57 -07:00
aio.c
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf.c execve filename: document and export via auxiliary vector 2008-07-22 09:59:40 -07:00
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c
bio.c
block_dev.c
buffer.c Merge branch 'generic-ipi' into generic-ipi-for-linus 2008-07-15 21:55:59 +02:00
char_dev.c
compat.c
compat_binfmt_elf.c
compat_ioctl.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6 2008-07-19 00:30:39 -07:00
dcache.c fix soft lock up at NFS mount via per-SB LRU-list of unused dentries 2008-07-24 10:47:15 -07:00
dcookies.c
direct-io.c
dnotify.c
dquot.c
drop_caches.c
eventfd.c
eventpoll.c
exec.c mm: remove double indirection on tlb parameter to free_pgd_range() & Co 2008-07-24 10:47:15 -07:00
fcntl.c
fifo.c
file.c
file_table.c
filesystems.c
fs-writeback.c
generic_acl.c
inode.c
inotify.c
inotify_user.c
internal.h
ioctl.c
ioprio.c
libfs.c
locks.c
mbcache.c
mpage.c
namei.c
namespace.c
nfsctl.c
no-block.c
open.c
pipe.c
pnode.c
pnode.h
posix_acl.c
quota.c
quota_v1.c
quota_v2.c
read_write.c
read_write.h
readdir.c
select.c
seq_file.c
signalfd.c
splice.c
stack.c
stat.c
super.c fix soft lock up at NFS mount via per-SB LRU-list of unused dentries 2008-07-24 10:47:15 -07:00
sync.c
timerfd.c
utimes.c
xattr.c
xattr_acl.c