From 23aaed6659df9adfabe9c583e67a36b54e21df46 Mon Sep 17 00:00:00 2001 From: Shiraz Hashim Date: Thu, 5 Feb 2015 12:25:06 -0800 Subject: [PATCH 1/7] mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). Userspace applications get the wrong data, so the effect is like just confusing users (if the applications just display the data) or sometimes killing the processes (if the applications do something with misunderstanding virtual addresses due to the wrong data.) For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual address range at wrong index. Eventually userspace may get wrong pagemap data for a task. Corresponding to a VM_PFNMAP marked vma region, kernel may report mappings from subsequent vma regions. User space in turn may account more pages (than really are) to the task. In my case I was using procmem, procrack (Android utility) which uses pagemap interface to account RSS pages of a task. Due to this bug it was giving a wrong picture for vmas (with VM_PFNMAP set). Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas") Signed-off-by: Shiraz Hashim Acked-by: Naoya Horiguchi Cc: [3.10+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/pagewalk.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ad83195521f2..b264bda46e1b 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -199,7 +199,10 @@ int walk_page_range(unsigned long addr, unsigned long end, */ if ((vma->vm_start <= addr) && (vma->vm_flags & VM_PFNMAP)) { - next = vma->vm_end; + if (walk->pte_hole) + err = walk->pte_hole(addr, next, walk); + if (err) + break; pgd = pgd_offset(walk->mm, next); continue; } From 21d543f225a8c828afd92988b51393b85e9e967c Mon Sep 17 00:00:00 2001 From: Kim Phillips Date: Thu, 5 Feb 2015 12:25:09 -0800 Subject: [PATCH 2/7] .mailmap: update Konstantin Khlebnikov's email address get_maintainer.pl returns k.khlebnikov@samsung.com via git history, for which emails get rejected: RCPT TO: 550 5.1.1 Recipient address rejected: User unknown Use his other address that passes vger's mxverify: RCPT TO: 250 2.1.5 OK ir10si13843754pbc.62 - gsmtp and add his old email address in the wrong email address field. Signed-off-by: Kim Phillips Acked-by: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index d357e1bd2a43..0d971cfb0772 100644 --- a/.mailmap +++ b/.mailmap @@ -73,6 +73,7 @@ Juha Yrjola Juha Yrjola Kay Sievers Kenneth W Chen +Konstantin Khlebnikov Koushik Kuninori Morimoto Leonid I Ananiev From 944b687491322f5395f47e58707c13abd2cfa3fd Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 5 Feb 2015 12:25:12 -0800 Subject: [PATCH 3/7] mm: export "high_memory" symbol on !MMU The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but only one of them is exported, which leads to module build errors in drivers that work fine built-in: ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined! ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined! ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined! ERROR: "high_memory" [crypto/tcrypt.ko] undefined! ERROR: "high_memory" [crypto/cts.ko] undefined! This exports the symbol to get these to work on NOMMU as well. Signed-off-by: Arnd Bergmann Cc: Kirill A. Shutemov Acked-by: Greg Ungerer Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/nommu.c b/mm/nommu.c index b51eadf6d952..28bd8c4dff6f 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -59,6 +59,7 @@ #endif void *high_memory; +EXPORT_SYMBOL(high_memory); struct page *mem_map; unsigned long max_mapnr; unsigned long highest_memmap_pfn; From f5e03a4989e80a86f8b514659dca8539132e6e09 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 5 Feb 2015 12:25:14 -0800 Subject: [PATCH 4/7] memcg, shmem: fix shmem migration to use lrucare It has been reported that 965GM might trigger VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage) in mem_cgroup_migrate when shmem wants to replace a swap cache page because of shmem_should_replace_page (the page is allocated from an inappropriate zone). shmem_replace_page expects that the oldpage is not on LRU list and calls mem_cgroup_migrate without lrucare. This is obviously incorrect because swapcache pages might be on the LRU list (e.g. swapin readahead page). Fix this by enabling lrucare for the migration in shmem_replace_page. Also clarify that lrucare should be used even if one of the pages might be on LRU list. The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even without that the migration code might leave the old page on an inappropriate memcg' LRU which is not that critical because the page would get removed with its last reference but it is still confusing. Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") Signed-off-by: Michal Hocko Reported-by: Chris Wilson Reported-by: Dave Airlie Acked-by: Hugh Dickins Acked-by: Johannes Weiner Cc: [3.17+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- mm/shmem.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 683b4782019b..2f6893c2f01b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5773,7 +5773,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list) * mem_cgroup_migrate - migrate a charge to another page * @oldpage: currently charged page * @newpage: page to transfer the charge to - * @lrucare: both pages might be on the LRU already + * @lrucare: either or both pages might be on the LRU already * * Migrate the charge from @oldpage to @newpage. * diff --git a/mm/shmem.c b/mm/shmem.c index 73ba1df7c8ba..993e6ba689cc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1013,7 +1013,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, */ oldpage = newpage; } else { - mem_cgroup_migrate(oldpage, newpage, false); + mem_cgroup_migrate(oldpage, newpage, true); lru_cache_add_anon(newpage); *pagep = newpage; } From 81cca6fb1271b91f06b008976ea132bec2f14d86 Mon Sep 17 00:00:00 2001 From: Sudip Mukherjee Date: Thu, 5 Feb 2015 12:25:17 -0800 Subject: [PATCH 5/7] MAINTAINERS: remove SUPERH website The mentioned website only displays information about buying and selling domains. Signed-off-by: Sudip Mukherjee Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index aaa039dee999..8d37ce11cc56 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9241,7 +9241,6 @@ F: drivers/net/ethernet/dlink/sundance.c SUPERH L: linux-sh@vger.kernel.org -W: http://www.linux-sh.org Q: http://patchwork.kernel.org/project/linux-sh/list/ S: Orphan F: Documentation/sh/ From 7ef3ff2fea8bf5e4a21cef47ad87710a3d0fdb52 Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Thu, 5 Feb 2015 12:25:20 -0800 Subject: [PATCH 6/7] nilfs2: fix deadlock of segment constructor over I_SYNC flag Nilfs2 eventually hangs in a stress test with fsstress program. This issue was caused by the following deadlock over I_SYNC flag between nilfs_segctor_thread() and writeback_sb_inodes(): nilfs_segctor_thread() nilfs_segctor_thread_construct() nilfs_segctor_unlock() nilfs_dispose_list() iput() iput_final() evict() inode_wait_for_writeback() * wait for I_SYNC flag writeback_sb_inodes() * set I_SYNC flag on inode->i_state __writeback_single_inode() do_writepages() nilfs_writepages() nilfs_construct_dsync_segment() nilfs_segctor_sync() * wait for completion of segment constructor inode_sync_complete() * clear I_SYNC flag after __writeback_single_inode() completed writeback_sb_inodes() calls do_writepages() for dirty inodes after setting I_SYNC flag on inode->i_state. do_writepages() in turn calls nilfs_writepages(), which can run segment constructor and wait for its completion. On the other hand, segment constructor calls iput(), which can call evict() and wait for the I_SYNC flag on inode_wait_for_writeback(). Since segment constructor doesn't know when I_SYNC will be set, it cannot know whether iput() will block or not unless inode->i_nlink has a non-zero count. We can prevent evict() from being called in iput() by implementing sop->drop_inode(), but it's not preferable to leave inodes with i_nlink == 0 for long periods because it even defers file truncation and inode deallocation. So, this instead resolves the deadlock by calling iput() asynchronously with a workqueue for inodes with i_nlink == 0. Signed-off-by: Ryusuke Konishi Cc: Al Viro Tested-by: Ryusuke Konishi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/nilfs2/nilfs.h | 2 -- fs/nilfs2/segment.c | 44 +++++++++++++++++++++++++++++++++++++++----- fs/nilfs2/segment.h | 5 +++++ 3 files changed, 44 insertions(+), 7 deletions(-) diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h index 91093cd74f0d..385704027575 100644 --- a/fs/nilfs2/nilfs.h +++ b/fs/nilfs2/nilfs.h @@ -141,7 +141,6 @@ enum { * @ti_save: Backup of journal_info field of task_struct * @ti_flags: Flags * @ti_count: Nest level - * @ti_garbage: List of inode to be put when releasing semaphore */ struct nilfs_transaction_info { u32 ti_magic; @@ -150,7 +149,6 @@ struct nilfs_transaction_info { one of other filesystems has a bug. */ unsigned short ti_flags; unsigned short ti_count; - struct list_head ti_garbage; }; /* ti_magic */ diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 7ef18fc656c2..469086b9f99b 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -305,7 +305,6 @@ static void nilfs_transaction_lock(struct super_block *sb, ti->ti_count = 0; ti->ti_save = cur_ti; ti->ti_magic = NILFS_TI_MAGIC; - INIT_LIST_HEAD(&ti->ti_garbage); current->journal_info = ti; for (;;) { @@ -332,8 +331,6 @@ static void nilfs_transaction_unlock(struct super_block *sb) up_write(&nilfs->ns_segctor_sem); current->journal_info = ti->ti_save; - if (!list_empty(&ti->ti_garbage)) - nilfs_dispose_list(nilfs, &ti->ti_garbage, 0); } static void *nilfs_segctor_map_segsum_entry(struct nilfs_sc_info *sci, @@ -746,6 +743,15 @@ static void nilfs_dispose_list(struct the_nilfs *nilfs, } } +static void nilfs_iput_work_func(struct work_struct *work) +{ + struct nilfs_sc_info *sci = container_of(work, struct nilfs_sc_info, + sc_iput_work); + struct the_nilfs *nilfs = sci->sc_super->s_fs_info; + + nilfs_dispose_list(nilfs, &sci->sc_iput_queue, 0); +} + static int nilfs_test_metadata_dirty(struct the_nilfs *nilfs, struct nilfs_root *root) { @@ -1900,8 +1906,8 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, static void nilfs_segctor_drop_written_files(struct nilfs_sc_info *sci, struct the_nilfs *nilfs) { - struct nilfs_transaction_info *ti = current->journal_info; struct nilfs_inode_info *ii, *n; + int defer_iput = false; spin_lock(&nilfs->ns_inode_lock); list_for_each_entry_safe(ii, n, &sci->sc_dirty_files, i_dirty) { @@ -1912,9 +1918,24 @@ static void nilfs_segctor_drop_written_files(struct nilfs_sc_info *sci, clear_bit(NILFS_I_BUSY, &ii->i_state); brelse(ii->i_bh); ii->i_bh = NULL; - list_move_tail(&ii->i_dirty, &ti->ti_garbage); + list_del_init(&ii->i_dirty); + if (!ii->vfs_inode.i_nlink) { + /* + * Defer calling iput() to avoid a deadlock + * over I_SYNC flag for inodes with i_nlink == 0 + */ + list_add_tail(&ii->i_dirty, &sci->sc_iput_queue); + defer_iput = true; + } else { + spin_unlock(&nilfs->ns_inode_lock); + iput(&ii->vfs_inode); + spin_lock(&nilfs->ns_inode_lock); + } } spin_unlock(&nilfs->ns_inode_lock); + + if (defer_iput) + schedule_work(&sci->sc_iput_work); } /* @@ -2583,6 +2604,8 @@ static struct nilfs_sc_info *nilfs_segctor_new(struct super_block *sb, INIT_LIST_HEAD(&sci->sc_segbufs); INIT_LIST_HEAD(&sci->sc_write_logs); INIT_LIST_HEAD(&sci->sc_gc_inodes); + INIT_LIST_HEAD(&sci->sc_iput_queue); + INIT_WORK(&sci->sc_iput_work, nilfs_iput_work_func); init_timer(&sci->sc_timer); sci->sc_interval = HZ * NILFS_SC_DEFAULT_TIMEOUT; @@ -2609,6 +2632,8 @@ static void nilfs_segctor_write_out(struct nilfs_sc_info *sci) ret = nilfs_segctor_construct(sci, SC_LSEG_SR); nilfs_transaction_unlock(sci->sc_super); + flush_work(&sci->sc_iput_work); + } while (ret && retrycount-- > 0); } @@ -2633,6 +2658,9 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci) || sci->sc_seq_request != sci->sc_seq_done); spin_unlock(&sci->sc_state_lock); + if (flush_work(&sci->sc_iput_work)) + flag = true; + if (flag || !nilfs_segctor_confirm(sci)) nilfs_segctor_write_out(sci); @@ -2642,6 +2670,12 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci) nilfs_dispose_list(nilfs, &sci->sc_dirty_files, 1); } + if (!list_empty(&sci->sc_iput_queue)) { + nilfs_warning(sci->sc_super, __func__, + "iput queue is not empty\n"); + nilfs_dispose_list(nilfs, &sci->sc_iput_queue, 1); + } + WARN_ON(!list_empty(&sci->sc_segbufs)); WARN_ON(!list_empty(&sci->sc_write_logs)); diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h index 38a1d0013314..a48d6de1e02c 100644 --- a/fs/nilfs2/segment.h +++ b/fs/nilfs2/segment.h @@ -26,6 +26,7 @@ #include #include #include +#include #include #include "nilfs.h" @@ -92,6 +93,8 @@ struct nilfs_segsum_pointer { * @sc_nblk_inc: Block count of current generation * @sc_dirty_files: List of files to be written * @sc_gc_inodes: List of GC inodes having blocks to be written + * @sc_iput_queue: list of inodes for which iput should be done + * @sc_iput_work: work struct to defer iput call * @sc_freesegs: array of segment numbers to be freed * @sc_nfreesegs: number of segments on @sc_freesegs * @sc_dsync_inode: inode whose data pages are written for a sync operation @@ -135,6 +138,8 @@ struct nilfs_sc_info { struct list_head sc_dirty_files; struct list_head sc_gc_inodes; + struct list_head sc_iput_queue; + struct work_struct sc_iput_work; __u64 *sc_freesegs; size_t sc_nfreesegs; From 7b02190c27091f7b4614d9a655981ae20520d22c Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 5 Feb 2015 12:25:23 -0800 Subject: [PATCH 7/7] mm/debug_pagealloc: fix build failure on ppc and some other archs Kim Phillips reported following build failure. LD init/built-in.o mm/built-in.o: In function `free_pages_prepare': mm/page_alloc.c:770: undefined reference to `.kernel_map_pages' mm/built-in.o: In function `prep_new_page': mm/page_alloc.c:933: undefined reference to `.kernel_map_pages' mm/built-in.o: In function `map_pages': mm/compaction.c:61: undefined reference to `.kernel_map_pages' make: *** [vmlinux] Error 1 Reason for this problem is that commit 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") forgot to remove the old declaration of kernel_map_pages() for some architectures. This patch removes them to fix build failure. Reported-by: Kim Phillips Signed-off-by: Joonsoo Kim Cc: Heiko Carstens Cc: Martin Schwidefsky Cc: David Miller Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/mn10300/include/asm/cacheflush.h | 7 ------- arch/powerpc/include/asm/cacheflush.h | 7 ------- arch/s390/include/asm/cacheflush.h | 4 ---- arch/sparc/include/asm/cacheflush_64.h | 5 ----- 4 files changed, 23 deletions(-) diff --git a/arch/mn10300/include/asm/cacheflush.h b/arch/mn10300/include/asm/cacheflush.h index faed90240ded..6d6df839948f 100644 --- a/arch/mn10300/include/asm/cacheflush.h +++ b/arch/mn10300/include/asm/cacheflush.h @@ -159,13 +159,6 @@ extern void flush_icache_range(unsigned long start, unsigned long end); #define copy_from_user_page(vma, page, vaddr, dst, src, len) \ memcpy(dst, src, len) -/* - * Internal debugging function - */ -#ifdef CONFIG_DEBUG_PAGEALLOC -extern void kernel_map_pages(struct page *page, int numpages, int enable); -#endif - #endif /* __ASSEMBLY__ */ #endif /* _ASM_CACHEFLUSH_H */ diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h index 5b9312220e84..30b35fff2dea 100644 --- a/arch/powerpc/include/asm/cacheflush.h +++ b/arch/powerpc/include/asm/cacheflush.h @@ -60,13 +60,6 @@ extern void flush_dcache_phys_range(unsigned long start, unsigned long stop); #define copy_from_user_page(vma, page, vaddr, dst, src, len) \ memcpy(dst, src, len) - - -#ifdef CONFIG_DEBUG_PAGEALLOC -/* internal debugging function */ -void kernel_map_pages(struct page *page, int numpages, int enable); -#endif - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_CACHEFLUSH_H */ diff --git a/arch/s390/include/asm/cacheflush.h b/arch/s390/include/asm/cacheflush.h index 3e20383d0921..58fae7d098cf 100644 --- a/arch/s390/include/asm/cacheflush.h +++ b/arch/s390/include/asm/cacheflush.h @@ -4,10 +4,6 @@ /* Caches aren't brain-dead on the s390. */ #include -#ifdef CONFIG_DEBUG_PAGEALLOC -void kernel_map_pages(struct page *page, int numpages, int enable); -#endif - int set_memory_ro(unsigned long addr, int numpages); int set_memory_rw(unsigned long addr, int numpages); int set_memory_nx(unsigned long addr, int numpages); diff --git a/arch/sparc/include/asm/cacheflush_64.h b/arch/sparc/include/asm/cacheflush_64.h index 38965379e350..68513c41e10d 100644 --- a/arch/sparc/include/asm/cacheflush_64.h +++ b/arch/sparc/include/asm/cacheflush_64.h @@ -74,11 +74,6 @@ void flush_ptrace_access(struct vm_area_struct *, struct page *, #define flush_cache_vmap(start, end) do { } while (0) #define flush_cache_vunmap(start, end) do { } while (0) -#ifdef CONFIG_DEBUG_PAGEALLOC -/* internal debugging function */ -void kernel_map_pages(struct page *page, int numpages, int enable); -#endif - #endif /* !__ASSEMBLY__ */ #endif /* _SPARC64_CACHEFLUSH_H */