WSL2-Linux-Kernel

История

Robert Ho 855af072b6 mm, proc: fix region lost in /proc/self/smaps Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 Actually, this bug is not only for NVDIMM/DAX but also for any other file systems. This simple test case abstracted from nvml can easily reproduce this bug in common environment: -------------------------- testcase.c ----------------------------- int is_pmem_proc(const void addr, size_t len) { const char caddr = addr; FILE fp; if ((fp = fopen("/proc/self/smaps", "r")) == NULL) { printf("!/proc/self/smaps"); return 0; } int retval = 0; / assume false until proven otherwise / char line[PROCMAXLEN]; / for fgets() / char lo = NULL; /* beginning of current range in smaps file / char hi = NULL; /* end of current range in smaps file / int needmm = 0; / looking for mm flag for current range / while (fgets(line, PROCMAXLEN, fp) != NULL) { static const char vmflags[] = "VmFlags:"; static const char mm[] = " wr"; / check for range line / if (sscanf(line, "%p-%p", &lo, &hi) == 2) { if (needmm) { / last range matched, but no mm flag found / printf("never found mm flag.\n"); break; } else if (caddr < lo) { / never found the range for caddr / printf("#######no match for addr %p.\n", caddr); break; } else if (caddr < hi) { / start address is in this range / size_t rangelen = (size_t)(hi - caddr); / remember that matching has started / needmm = 1; / calculate remaining range to search for / if (len > rangelen) { len -= rangelen; caddr += rangelen; printf("matched %zu bytes in range " "%p-%p, %zu left over.\n", rangelen, lo, hi, len); } else { len = 0; printf("matched all bytes in range " "%p-%p.\n", lo, hi); } } } else if (needmm && strncmp(line, vmflags, sizeof(vmflags) - 1) == 0) { if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) { printf("mm flag found.\n"); if (len == 0) { / entire range matched / retval = 1; break; } needmm = 0; / saw what was needed / } else { / mm flag not set for some or all of range / printf("range has no mm flag.\n"); break; } } } fclose(fp); printf("returning %d.\n", retval); return retval; } void Addr; size_t Size; /* * worker -- the work each thread performs / static void worker(void arg) { int ret = (int )arg; ret = is_pmem_proc(Addr, Size); return NULL; } int main(int argc, char argv[]) { if (argc < 2 \|\| argc > 3) { printf("usage: %s file [env].\n", argv[0]); return -1; } int fd = open(argv[1], O_RDWR); struct stat stbuf; fstat(fd, &stbuf); Size = stbuf.st_size; Addr = mmap(0, stbuf.st_size, PROT_READ\|PROT_WRITE, MAP_PRIVATE, fd, 0); close(fd); pthread_t threads[NTHREAD]; int ret[NTHREAD]; / kick off NTHREAD threads / for (int i = 0; i < NTHREAD; i++) pthread_create(&threads[i], NULL, worker, &ret[i]); / wait for all the threads to complete / for (int i = 0; i < NTHREAD; i++) pthread_join(threads[i], NULL); / verify that all the threads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the main process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, e.g: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the end address of the current VMA. m_start will then look up a vma which with vma_start < last_vm_end and moves on to the next vma if we found the same or an overlapping vma. This will guarantee that we will not miss an exclusive vma but we can still miss one if the previous vma was shrunk. This is acceptable because guaranteeing "never miss a vma" is simply not feasible. User has to cope with some inconsistencies if the file is not read in one go. [mhocko@suse.com: changelog fixes] Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Robert Hu <robert.hu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2016-10-07 18:46:30 -07:00
..
9p	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
adfs	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
affs	get rid of 'parent' argument of ->d_compare()	2016-07-31 16:37:25 -04:00
afs	rxrpc: Rewrite the data and ack handling code	2016-09-08 11:10:12 +01:00
autofs4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2016-10-06 09:52:23 -07:00
befs	fs/befs/io.c:befs_bread(): remove unneeded initialization to NULL	2016-05-23 17:04:14 -07:00
bfs	…
btrfs	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs	2016-09-23 13:39:37 -07:00
cachefiles	cachefiles: Fix race between inactivating and culling a cache object	2016-08-03 13:33:26 -04:00
ceph	ceph: do not modify fi->frag in need_reset_readdir()	2016-09-05 14:30:35 +02:00
cifs	Move check for prefix path to within cifs_get_root()	2016-09-09 23:58:07 -05:00
coda	drop redundant ->owner initializations	2016-05-29 19:08:00 -04:00
configfs	configfs: Return -EFBIG from configfs_write_bin_file.	2016-09-16 12:58:28 +02:00
cramfs	…
crypto	fscrypto: require write access to mount to set encryption policy	2016-09-10 01:18:57 -04:00
debugfs	debugfs: propagate release() call result	2016-09-27 12:45:57 +02:00
devpts	devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts	2016-09-23 11:31:31 +02:00
dlm	dlm: fix malfunction of dlm_tool caused by debugfs changes	2016-08-26 13:22:14 -05:00
ecryptfs	ecryptfs: don't allow mmap when the lower fs doesn't support it	2016-07-08 10:35:28 -05:00
efivarfs	fs/efivarfs: Fix double kfree() in error path	2016-09-09 16:08:48 +01:00
efs	fs/efs/super.c: fix return value	2016-05-20 17:58:30 -07:00
exofs	block, fs, mm, drivers: use bio set/get op accessors	2016-06-07 13:41:38 -06:00
exportfs	…
ext2	ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings	2016-10-07 18:46:28 -07:00
ext4	ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings	2016-10-07 18:46:28 -07:00
f2fs	In this round, we've investigated how f2fs deals with errors given by our fault	2016-10-06 15:30:40 -07:00
fat	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
freevxfs	freevxfs: update Kconfig information	2016-06-13 10:20:39 +02:00
fscache	Merge branch 'd_real' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc	2016-06-30 23:34:49 -04:00
fuse	fuse: limit xattr returned size	2016-10-03 11:06:05 +02:00
gfs2	We've only got six GFS2 patches for this merge window. In patch order:	2016-10-04 13:42:13 -07:00
hfs	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
hfsplus	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
hostfs	hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common()	2016-08-04 00:18:10 +02:00
hpfs	get rid of 'parent' argument of ->d_compare()	2016-07-31 16:37:25 -04:00
hugetlbfs	mm: remove unnecessary condition in remove_inode_hugepages	2016-10-07 18:46:29 -07:00
isofs	get rid of 'parent' argument of ->d_compare()	2016-07-31 16:37:25 -04:00
jbd2	The major change this cycle is deleting ext4's copy of the file system	2016-07-26 18:35:55 -07:00
jffs2	vfs: make the string hashes salt the hash	2016-06-10 20:21:46 -07:00
jfs	jfs: Simplify code	2016-09-06 12:17:24 -05:00
kernfs	kernfs: don't depend on d_find_any_alias() when generating notifications	2016-08-31 14:48:52 +02:00
lockd	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-07-28 12:59:05 -07:00
logfs	Merge branch 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-06 09:49:02 -04:00
minix	…
ncpfs	get rid of 'parent' argument of ->d_compare()	2016-07-31 16:37:25 -04:00
nfs	mm: remove page_file_index	2016-10-07 18:46:28 -07:00
nfs_common	…
nfsd	nfsd: don't return an unhashed lock stateid after taking mutex	2016-08-12 16:10:25 -04:00
nilfs2	nilfs2: move ioctl interface and disk layout to uapi separately	2016-08-02 19:35:21 -04:00
nls	…
notify	fsnotify: clean up spinlock assertions	2016-10-07 18:46:26 -07:00
ntfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-07-28 12:59:05 -07:00
ocfs2	ocfs2: fix undefined struct variable in inode.h	2016-10-07 18:46:26 -07:00
omfs	…
openpromfs	…
orangefs	Revert "orangefs: bump minimum userspace version"	2016-10-03 15:07:36 -04:00
overlayfs	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security	2016-10-04 14:48:27 -07:00
proc	mm, proc: fix region lost in /proc/self/smaps	2016-10-07 18:46:30 -07:00
pstore	ramoops: move spin_lock_init after kmalloc error checking	2016-09-08 15:01:13 -07:00
qnx4	…
qnx6	…
quota	quota: fill in Q_XGETQSTAT inode information for inactive quotas	2016-08-15 17:43:31 +02:00
ramfs	ipc/shm: fix crash if CONFIG_SHMEM is not set	2016-09-19 15:36:17 -07:00
reiserfs	reiserfs: Unlock superblock before calling reiserfs_quota_on_mount()	2016-09-16 17:20:59 +02:00
romfs	…
squashfs	fs: have ll_rw_block users pass in op and flags separately	2016-06-07 13:41:38 -06:00
sysfs	sysfs print name of undiscoverable attribute group	2016-09-27 12:24:29 +02:00
sysv	vfs: make the string hashes salt the hash	2016-06-10 20:21:46 -07:00
tracefs	tracefs: ->d_parent is never NULL or negative...	2016-05-29 16:22:07 -04:00
ubifs	ubifs: Fix xattr generic handler usage	2016-08-23 23:02:52 +02:00
udf	udf: don't bother with full-page write optimisations in adinicb case	2016-09-19 10:47:01 +02:00
ufs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-07-28 12:59:05 -07:00
xfs	ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings	2016-10-07 18:46:28 -07:00
Kconfig	mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE	2016-10-07 18:46:29 -07:00
Kconfig.binfmt	ARM: 8594/1: enable binfmt_flat on systems with an MMU	2016-08-12 16:47:05 +01:00
Makefile	fs: introduce iomap infrastructure	2016-06-21 09:23:11 +10:00
aio.c	aio: mark AIO pseudo-fs noexec	2016-09-15 15:49:28 -07:00
anon_inodes.c	…
attr.c	vfs: Don't modify inodes with a uid or gid unknown to the vfs	2016-07-05 15:06:46 -05:00
bad_inode.c	switch ->setxattr() to passing dentry and inode separately	2016-05-27 20:09:16 -04:00
binfmt_aout.c	fs: fix binfmt_aout.c build error	2016-05-28 16:34:59 -07:00
binfmt_elf.c	x86/coredump: Use pr_reg size, rather that TIF_IA32 flag	2016-09-14 21:28:10 +02:00
binfmt_elf_fdpic.c	elf_fdpic_transfer_args_to_stack(): make it generic	2016-07-25 16:51:49 +10:00
binfmt_em86.c	fs/binfmt_em86.c: fix incompatible pointer type	2016-08-02 19:35:15 -04:00
binfmt_flat.c	binfmt_flat: allow compressed flat binary format to work on MMU systems	2016-07-28 13:29:12 +10:00
binfmt_misc.c	binfmt_misc for-linus on 20160727	2016-08-07 10:13:14 -04:00
binfmt_script.c	…
block_dev.c	fs/block_dev: fix potential NULL ptr deref in freeze_bdev()	2016-08-25 08:38:26 -06:00
buffer.c	xfs: update for 4.8-rc1	2016-07-27 09:53:35 -07:00
char_dev.c	chardev: add missing line break in pr_warn	2016-07-14 16:21:53 +09:00
compat.c	Fix a number of bugs, most notably a potential stale data exposure	2016-05-24 12:55:26 -07:00
compat_binfmt_elf.c	…
compat_ioctl.c	[media] cec: add compat32 ioctl support	2016-06-28 10:00:13 -03:00
coredump.c	coredump: fix dumping through pipes	2016-06-07 22:07:09 -04:00
dax.c	thp: reduce usage of huge zero page's atomic counter	2016-10-07 18:46:28 -07:00
dcache.c	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
dcookies.c	…
direct-io.c	direct-io: use bio set/get op accessors	2016-06-07 13:41:38 -06:00
drop_caches.c	…
eventfd.c	…
eventpoll.c	fs: poll/select/recvmmsg: use timespec64 for timeout events	2016-05-19 19:12:14 -07:00
exec.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu	2016-08-04 18:04:44 -04:00
fcntl.c	…
fhandle.c	…
file.c	…
file_table.c	…
filesystems.c	…
fs-writeback.c	mm, writeback: flush plugged IO in wakeup_flusher_threads()	2016-08-09 19:58:06 -06:00
fs_pin.c	…
fs_struct.c	…
inode.c	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-08-07 10:01:14 -04:00
internal.h	iomap: expose iomap_apply outside iomap.c	2016-09-19 11:24:49 +10:00
ioctl.c	vfs: cap dedupe request structure size at PAGE_SIZE	2016-09-15 13:29:52 -07:00
iomap.c	Merge branch 'iomap-4.9-dax' into for-next	2016-10-03 09:53:59 +11:00
libfs.c	lockless next_positive()	2016-06-20 17:11:29 -04:00
locks.c	File locking related changes for v4.9	2016-10-04 13:36:19 -07:00
mbcache.c	…
mount.h	mnt: Add a per mount namespace limit on the number of mounts	2016-09-30 12:46:48 -05:00
mpage.c	block/mm: make bdev_ops->rw_page() take a bool for read/write	2016-08-07 14:41:02 -06:00
namei.c	fs: return EPERM on immutable inode	2016-08-07 10:03:31 -04:00
namespace.c	mnt: Add a per mount namespace limit on the number of mounts	2016-09-30 12:46:48 -05:00
no-block.c	…
nsfs.c	nsfs: Simplify __ns_get_path	2016-09-22 20:06:20 -05:00
open.c	binfmt_misc for-linus on 20160727	2016-08-07 10:13:14 -04:00
pipe.c	mm: memcontrol: only mark charged pages with PageKmemcg	2016-08-09 10:14:10 -07:00
pnode.c	mnt: Add a per mount namespace limit on the number of mounts	2016-09-30 12:46:48 -05:00
pnode.h	mnt: Add a per mount namespace limit on the number of mounts	2016-09-30 12:46:48 -05:00
posix_acl.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2016-07-29 15:54:19 -07:00
proc_namespace.c	…
read_write.c	x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2	2016-07-15 10:30:26 +02:00
readdir.c	restore killability of old mutex_lock_killable(&inode->i_mutex) users	2016-05-26 00:13:25 -04:00
select.c	fs: poll/select/recvmmsg: use timespec64 for timeout events	2016-05-19 19:12:14 -07:00
seq_file.c	seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char	2016-10-07 18:46:30 -07:00
signalfd.c	…
splice.c	…
stack.c	…
stat.c	…
statfs.c	…
super.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2016-07-29 15:54:19 -07:00
sync.c	…
timerfd.c	timerfd: Reject ALARM timerfds without CAP_WAKE_ALARM	2016-06-09 23:42:38 +02:00
userfaultfd.c	mm: introduce fault_env	2016-07-26 16:19:19 -07:00
utimes.c	fs: return EPERM on immutable inode	2016-08-07 10:03:31 -04:00
xattr.c	vfs: Don't modify inodes with a uid or gid unknown to the vfs	2016-07-05 15:06:46 -05:00