Document the new function added to sync_file.c
v2: Adapt to fence_array
v3: Take in Chris Wilson suggestions
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Creates a function that given an sync file descriptor returns a
fence containing all fences in the sync_file.
v2: Comments by Daniel Vetter
- Adapt to new version of fence_collection_init()
- Hold a reference for the fence we return
v3:
- Adapt to use fput() directly
- rename to sync_file_get_fence() as we always return one fence
v4: Adapt to use fence_array
v5: set fence through fence_get()
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Create sync_file->fence to abstract the type of fence we are using for
each sync_file. If only one fence is present we use a normal struct fence
but if there is more fences to be added to the sync_file a fence_array
is created.
This change cleans up sync_file a bit. We don't need to have sync_file_cb
array anymore. Instead, as we always have one fence, only one fence
callback is registered per sync_file.
v2: Comments from Chris Wilson and Christian König
- Not using fence_ops anymore
- fence_is_array() was created to differentiate fence from fence_array
- fence_array_teardown() is now exported and used under fence_is_array()
- struct sync_file lost num_fences member
v3: Comments from Chris Wilson and Christian König
- struct sync_file lost status member in favor of fence_is_signaled()
- drop use of fence_array_teardown()
- use sizeof(*fence) to allocate only an array on fence pointers
v4: Comments from Chris Wilson
- use sizeof(*fence) to reallocate array
- fix typo in comments
- protect num_fences sum against overflows
- use array->base instead of casting the to struct fence
v5: fixes checkpatch warnings
v6: fix case where all fences are signaled.
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Add helper to check if fence is array.
v2: Comments from Chris Wilson
- remove ternary if from ops comparison
- add EXPORT_SYMBOL(fence_array_ops)
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Add support to attach a drm_bridge to imx-ldb in addition to
existing support to attach a LVDS panel.
This patch does a simple code refactoring by moving code
from for_each_child_of_node iterator to a new function named
imx_ldb_panel_ddc(). This was necessary to allow the panel ddc
code to run only when the imx_ldb is not attached to a bridge.
Cc: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Fabio Estevam <fabio.estevam@nxp.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Peter Senna Tschudin <peter.senna@collabora.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
If reserve_real_mode() fails, panicing immediately means we're
doomed. Make it safe to try more than once to allocate the
trampoline:
- Degrade a failure from panic() to pr_info(). (If we make it to
setup_real_mode() without reserving the trampoline, we'll panic
them.)
- Factor out helpers so that platform code can supply a specific
address to try.
- Warn if reserve_real_mode() is called after we're done with the
memblock allocator. If that were to happen, we would behave
unpredictably.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/876e383038f3e9971aa72fd20a4f5da05f9d193d.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no need to run setup_real_mode() as early as we run it.
Defer it to the same early_initcall that sets up the page
permissions for the real mode code.
This should be a code size reduction. More importantly, it give us
a longer window in which we can allocate the real mode trampoline.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The initialization process for trampoline_cr4_features and
mmu_cr4_features was confusing. The intent is for mmu_cr4_features
and *trampoline_cr4_features to stay in sync, but
trampoline_cr4_features is NULL until setup_real_mode() runs. The
old code synchronized *trampoline_cr4_features *twice*, once in
setup_real_mode() and once in setup_arch(). It also initialized
mmu_cr4_features in setup_real_mode(), which causes the actual value
of mmu_cr4_features to potentially depend on when setup_real_mode()
is called.
With this patch, mmu_cr4_features is initialized directly in
setup_arch(), and *trampoline_cr4_features is synchronized to
mmu_cr4_features when the trampoline is set up.
After this patch, it should be safe to defer setup_real_mode().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
reserve_bios_regions() is a quirk that reserves memory that we might
otherwise think is available. There's no need to run it so early,
and running it before we have the memory map initialized with its
non-quirky inputs makes it hard to make reserve_bios_regions() more
intelligent.
Move it right after we populate the memblock state.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
52aec3308d ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
the TLB remote shootdown is done through call function vector. That
commit didn't take care of irq_tlb_count, which a later commit:
fd0f586972 ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")
... tried to fix.
The fix assumes every increase of irq_tlb_count has a corresponding
increase of irq_call_count. So the irq_call_count is always bigger than
irq_tlb_count and we could substract irq_tlb_count from irq_call_count.
Unfortunately this is not true for the smp_call_function_single() case.
The IPI is only sent if the target CPU's call_single_queue is empty when
adding a csd into it in generic_exec_single. That means if two threads
are both adding flush tlb csds to the same CPU's call_single_queue, only
one IPI is sent. In other words, the irq_call_count is incremented by 1
but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
bigger than irq_call_count and the substract will produce a very large
irq_call_count value due to overflow.
Considering that:
1) it's not worth to send more IPIs for the sake of accurate counting of
irq_call_count in generic_exec_single();
2) it's not easy to tell if the call function interrupt is for TLB
shootdown in __smp_call_function_single_interrupt().
Not to exclude TLB shootdown from call function count seems to be the
simplest fix and this patch just does that.
This bug was found by LKP's cyclic performance regression tracking recently
with the vm-scalability test suite. I have bisected to commit:
3dec0ba0be ("mm/rmap: share the i_mmap_rwsem")
This commit didn't do anything wrong but revealed the irq_call_count
problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
concurrent with multiple threads. When remap_one is try_to_unmap_one(),
then multiple threads could queue flush TLB to the same CPU but only
one IPI will be sent.
Since the commit was added in Linux v3.19, the counting problem only
shows up from v3.19 onwards.
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A recent patch changed the format of a swap PTE.
The comment explaining the format of the swap PTE is wrong about
the bits used for the swap type field. Amusingly, the ASCII art
and the patch description are correct, but the comment itself
is wrong.
As I was looking at this, I also noticed that the
SWP_OFFSET_FIRST_BIT has an off-by-one error. This does not
really hurt anything. It just wasted a bit of space in the PTE,
giving us 2^59 bytes of addressable space in our swapfiles
instead of 2^60. But, it doesn't match with the comments, and it
wastes a bit of space, so fix it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Fixes: 00839ee3b2 ("x86/mm: Move swap offset/type up in PTE to work around erratum")
Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
5743021831 ("sched/cputime: Count actually elapsed irq & softirq time")
... didn't take steal time into consideration with passing the noirqtime
kernel parameter.
As Paolo pointed out before:
| Why not? If idle=poll, for example, any time the guest is suspended (and
| thus cannot poll) does count as stolen time.
This patch fixes it by reducing steal time from idle time accounting when
the noirqtime parameter is true. The average idle time drops from 56.8%
to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running
on one pCPU).
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
debug_putstr() is used to output strings without using printf-like
formatting but debug_putstr(v) is defined as early_printk(v) in
arch/x86/lib/kaslr.c.
This makes clang reports the following warning when building
with -Wformat-security:
arch/x86/lib/kaslr.c:57:15: warning: format string is not a string
literal (potentially insecure) [-Wformat-security]
debug_putstr(purpose);
^~~~~~~
Fix this by using "%s" in early_printk().
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160806102039.27221-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some panels only accept bpc (bit per color) 6-bit.
But, the default bpc in mt8173 display data path is 8-bit.
If we didn't enable dithering function to convert bpc,
display cannot show the smooth grayscale image.
In mt8173, the dithering function in OD (OverDrive) and
GAMMA module, we have to config them with
connector->display_mode.bpc when CRTC initial.
1. Clear the default value at *_DITHER_5 and *_DITHER_7 register.
2. Calculate the LSB_ERR_SHIFT bits and ADD_LSHIFT bits two values.
i.e. Input bpc of OD is 10 bits, we assume the bpc of panel is 6-bit,
so, we need to set 4-bit to LSB_ERR_SHIFT and ADD_LSHIFT bits respectively.
3. Then, set the OD or GAMMA to dithering mode depends on path-1 or path-2.
Signed-off-by: Bibby Hsieh <bibby.hsieh@mediatek.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Add gamma set function to correct brightness values.
It applies arbitrary mapping curve to compensate the
incorrect transfer function of the panel.
Signed-off-by: Bibby Hsieh <bibby.hsieh@mediatek.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
This is the pxa changes for v4.8 cycle.
This is a tiny fix couple to enable changes in includes in
gpio API without breaking pxa boards.
* tag 'pxa-fixes-v4.8' of https://github.com/rjarzmik/linux:
ARM: pxa: add module.h for corgi symbol_get/symbol_put usage
ARM: pxa: add module.h for spitz symbol_get/symbol_put usage
In order to correct brightness values, we have
to support gamma funciton on MT8173. In MT8173,
we have two engines for supporting gamma function:
AAL and GAMMA. This patch add some GAMMA engine
basic function, include config, start and stop
function.
Signed-off-by: Bibby Hsieh <bibby.hsieh@mediatek.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
In order to correct brightness values, we have
to support gamma funciton on MT8173. In MT8173,
we have two engines for supporting gamma function:
AAL and GAMMA. This patch add some AAL engine
basic function, include config, start and stop
function.
Signed-off-by: Bibby Hsieh <bibby.hsieh@mediatek.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Add CK Hu and Philipp Zabel as maintainers for Mediatek DRM drivers.
Signed-off-by: CK Hu <ck.hu@mediatek.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Merge misc fixes from Andrew Morton:
"8 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock
rmap: fix compound check logic in page_remove_file_rmap
mm, rmap: fix false positive VM_BUG() in page_add_file_rmap()
mm/page_alloc.c: recalculate some of node threshold when on/offline memory
mm/page_alloc.c: fix wrong initialization when sysctl_min_unmapped_ratio changes
thp: move shmem_huge_enabled() outside of SYSFS ifdef
revert "ARM: keystone: dts: add psci command definition"
rapidio: dereferencing an error pointer
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).
Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:
=============================================
[ INFO: possible recursive locking detected ]
4.8.0-rc1-gfxbench+ #1 Tainted: G U
---------------------------------------------
rmmod/8895 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by rmmod/8895:
#0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
#1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
#2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
#3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
#4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
stack backtrace:
CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
Call Trace:
__lock_acquire+0x1646/0x1ad0
lock_acquire+0xb2/0x200
_raw_spin_lock+0x36/0x50
get_partial_node.isra.63+0x47/0x430
___slab_alloc.constprop.67+0x1a7/0x3b0
__slab_alloc.isra.64.constprop.66+0x43/0x80
kmem_cache_alloc+0x236/0x2d0
__debug_object_init+0x2de/0x400
debug_object_activate+0x109/0x1e0
__call_rcu.constprop.63+0x32/0x2f0
call_rcu+0x12/0x20
discard_slab+0x3d/0x40
__kmem_cache_shutdown+0xdb/0x320
shutdown_cache+0x19/0x60
kmem_cache_destroy+0x1ae/0x220
i915_gem_load_cleanup+0x14/0x40 [i915]
i915_driver_unload+0x151/0x180 [i915]
i915_pci_remove+0x14/0x20 [i915]
pci_device_remove+0x34/0xb0
__device_release_driver+0x95/0x140
driver_detach+0xb6/0xc0
bus_remove_driver+0x53/0xd0
driver_unregister+0x27/0x50
pci_unregister_driver+0x25/0x70
i915_exit+0x1a/0x1e2 [i915]
SyS_delete_module+0x193/0x1f0
entry_SYSCALL_64_fastpath+0x1c/0xac
Fixes: 52b4b950b5 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In page_remove_file_rmap(.) we have the following check:
VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.
Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false, provoking BUGs when one runs
the libhugetlbfs test suite.
This patch replaces PageTransHuge(), with PageHead() which will work for
both HugeTLB and THP.
Fixes: dd78fedde4 ("rmap: support file thp")
Link: http://lkml.kernel.org/r/1470838217-5889-1-git-send-email-steve.capper@arm.com
Signed-off-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Shijie <shijie.huang@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PageTransCompound() doesn't distinguish THP from from any other type of
compound pages. This can lead to false-positive VM_BUG_ON() in
page_add_file_rmap() if called on compound page from a driver[1].
I think we can exclude such cases by checking if the page belong to a
mapping.
The VM_BUG_ON_PAGE() is downgraded to VM_WARN_ON_ONCE(). This path
should not cause any harm to non-THP page, but good to know if we step
on anything else.
[1] http://lkml.kernel.org/r/c711e067-0bff-a6cb-3c37-04dfe77d2db1@redhat.com
Link: http://lkml.kernel.org/r/20160810161345.GA67522@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Laura Abbott <labbott@redhat.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of node threshold depends on number of managed pages in the node.
When memory is going on/offline, it can be changed and we need to adjust
them.
Add recalculation to appropriate places and clean-up related functions
for better maintenance.
Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before resetting min_unmapped_pages, we need to initialize
min_unmapped_pages rather than min_slab_pages.
Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The newly introduced shmem_huge_enabled() function has two definitions,
but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
build error:
mm/khugepaged.o: In function `khugepaged':
khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'
This changes the #ifdef guards around the definition to match those that
are used in the header file.
Fixes: e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 51d5d12b8f ("ARM: keystone: dts: add psci command
definition"), which was inadvertently added twice.
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Vitaly Andrianov <vitalya@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Original patch: https://lkml.org/lkml/2016/8/4/32
If riocm_ch_alloc() fails then we end up dereferencing the error
pointer.
The problem is that we're not unwinding in the reverse order from how we
allocate things so it gets confusing. I've changed this around so now
"ch" is NULL when we are done with it after we call riocm_put_channel().
That way we can check if it's NULL and avoid calling riocm_put_channel()
on it twice.
I renamed err_nodev to err_put_new_ch so that it better reflects what
the goto does.
Then because we had flipping things around, it means we don't neeed to
initialize the pointers to NULL and we can remove an if statement and
pull things in an indent level.
Link: http://lkml.kernel.org/r/20160805152406.20713-1-alexandre.bounine@idt.com
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Barry Wood <barry.wood@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unit tests crash when hotplug races the previous probe. This race
requires that the loading of the nfit_test module be terminated with
SIGTERM, and the module to be unloaded while the ars scan is still
running.
In contrast to the normal nfit driver, the unit test calls
acpi_nfit_init() twice to simulate hotplug, whereas the nominal case
goes through the acpi_nfit_notify() event handler. The
acpi_nfit_notify() path is careful to flush the previous region
registration before servicing the hotplug event. The unit test was
missing this guarantee.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810cdce7>] pwq_activate_delayed_work+0x47/0x170
[..]
Call Trace:
[<ffffffff810ce186>] pwq_dec_nr_in_flight+0x66/0xa0
[<ffffffff810ce490>] process_one_work+0x2d0/0x680
[<ffffffff810ce331>] ? process_one_work+0x171/0x680
[<ffffffff810ce88e>] worker_thread+0x4e/0x480
[<ffffffff810ce840>] ? process_one_work+0x680/0x680
[<ffffffff810ce840>] ? process_one_work+0x680/0x680
[<ffffffff810d5343>] kthread+0xf3/0x110
[<ffffffff8199846f>] ret_from_fork+0x1f/0x40
[<ffffffff810d5250>] ? kthread_create_on_node+0x230/0x230
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Even when PCI is disabled, ARCH_HISI selects HISILICON_IRQ_MBIGEN
triggerring the following config warning:
warning: (ARM64 && HISILICON_IRQ_MBIGEN) selects ARM_GIC_V3_ITS which
has unmet direct dependencies (PCI && PCI_MSI)
This patch makes selection of HISILICON_IRQ_MBIGEN conditional on PCI.
Cc: Ma Jun <majun258@huawei.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Even when PCI is disabled, ARCH_ALPINE selects ALPINE_MSI triggerring
the following config warning:
warning: (ARCH_ALPINE) selects ALPINE_MSI which has unmet direct
dependencies (PCI)
This patch makes selection of ALPINE_MSI conditional on PCI.
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Clearly QEMU is very permissive in how its PL310 model may be set up,
but the real hardware turns out to be far more particular about things
actually being correct. Fix up the DT description so that the real
thing actually boots:
- The arm,data-latency and arm,tag-latency properties need 3 cells to
be valid, otherwise we end up retaining the default 8-cycle latencies
which leads pretty quickly to lockup.
- The arm,dirty-latency property is only relevant to L210/L220, so get
rid of it.
- The cache geometry override also leads to lockup and/or general
misbehaviour. Irritatingly, the manual doesn't state the actual PL310
configuration, but based on the boardfile code and poking registers
from the Boot Monitor, it would seem to be 8 sets of 16KB ways.
With that, we can successfully boot to enjoy the fun of mismatched FPUs...
Cc: stable@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
c90bb7b enabled the high speed UARTs of the Jetson TK1. Due to a merge
quirk, wrong addresses were introduced. Fix it and use the correct
addresses.
Thierry let me know, that there is another patch (b5896f67ab in
linux-next) in preparation which removes all the '0,' prefixes of unit
addresses on Tegra124 and is planned to go upstream in 4.8, so
this patch will get reverted then.
But for the moment, this patch is necessary to fix current misbehaviour.
Fixes: c90bb7b9b9 ("ARM: tegra: Add high speed UARTs to Jetson TK1 device tree")
Signed-off-by: Ralf Ramsauer <ralf@ramses-pyramidenbau.de>
Acked-by: Thierry Reding <thierry.reding@gmail.com>
Cc: stable@vger.kernel.org # v4.7
Cc: linux-tegra@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This syscon needs to be looked up by clocks, flash protection
and other consumers.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This syscon needs to be looked up by flash protection, CLCD
display output settings and other consumers.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
For unknown reasons, we have to enable three symbols for a platform
to use a reset controller driver, otherwise we get a Kconfig
warning:
warning: (MACH_OX810SE) selects RESET_OXNAS which has unmet direct dependencies (RESET_CONTROLLER)
This selects the other two symbols for oxnas.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
The machine specific header files are exported for traditional
platforms, but not for the ones that use ARCH_MULTIPLATFORM, as
they could conflict with one another.
In case of ARM_SINGLE_ARMV7M, we end up also exporting them,
but that appears to be a mistake, and we should treat it the
same way as ARCH_MULTIPLATFORM here.
'make W=1' warns about this because it passes -Wmissing-includes
to gcc and the directories are not actually present.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Three platforms used to have header files in include/mach that
are now all gone, but the removed directories are still being
included, which leads to -Wmissing-include-dirs warnings.
This removes the extra -I flags.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add access checks to sys_oabi_epoll_wait() and sys_oabi_semtimedop().
This fixes CVE-2016-3857, a local privilege escalation under
CONFIG_OABI_COMPAT.
Cc: stable@vger.kernel.org
Reported-by: Chiachih Wu <wuchiachih@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Dave Weinstein <olorin@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"Some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko.
Bonus points to Filipe for already having xfstests in place for many
of these"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
Btrfs: improve performance on fsync against new inode after rename/unlink
Btrfs: be more precise on errors when getting an inode from disk
Btrfs: send, don't bug on inconsistent snapshots
Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
Btrfs: incremental send, fix premature rmdir operations
Btrfs: incremental send, fix invalid paths for rename operations
Btrfs: send, add missing error check for calls to path_loop()
Btrfs: send, fix failure to move directories with the same name around
Btrfs: add missing check for writeback errors on fsync
A single fix for a boot crash since a commit in the merge window. Metag
was unusual in calling show_mem() early, before setup_per_cpu_pageset(),
which is no longer safe. It doesn't add much value to the log, so the
fix just drops the call.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXqyghAAoJEGwLaZPeOHZ6SgYP/0H3aDDDvM4Wq2SRKD9FvVwT
wFvyvGYf2cGCL+yVvxwD9NAcAicSngH08QnNuzoRt7K2tYdE0hV9aRVZFWECAPGN
9KttDGUZ2LsZv5ASbxfzo+feG5wm065t5+ihdmMoEIiVmizAxHzqaeJ1xXwxkUgm
r8q4nB2M2BKKu7CWcEE8a/ohYoCul3cb5G6fVZIZqxk832AcnaBlGlDSRDiCA45g
bwQ9EH11DgqHD155C+kFd7hqb2mvU+plvE0MLgWD3cOkFkIEdZW+cA13BiPKFiGy
X/yUq4B5B1Cc7Pz6fvzaHjte1lrC2FeRKCJ4rNHts4oHVA8/f0JUEgIyYdLbqvMH
QaHzaqEpzC5soatpyHAylcAvkeM6hVyWQEiJoupdHnrh16kols8DgRRVIEWK8MpD
njMR6U+fdiAPUwFijUypp4MsmUVEyyLcZ9XiDVgoCgTId6O6saCj8z7OIEHDBF0Z
9C4o0LPca4Nod5TV8R43NYeqnLGobdq9333odbdNi3jhFpg/IE62uVLe4uFhR93t
gxdn5Re6tUgcw/DolnMowXU9JOP6A0vXthXFN7sm/KreCvIijQ9ZOYDUh8YBwn5u
pr/UEhmv6SYUQWI+oN0+Srtr2wLH+7PEs21tpXZSAGpDvStJE3Sf5Z84oTnxJ4MO
0R9sB1z1oQeWZIEW71rZ
=k+4S
-----END PGP SIGNATURE-----
Merge tag 'metag-for-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag
Pull metag architecture fix from James Hogan:
"A single fix for a boot crash since a commit in the merge window.
Metag was unusual in calling show_mem() early, before setup_per_cpu_pageset(),
which is no longer safe. It doesn't add much value to the log, so the
fix just drops the call"
* tag 'metag-for-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
metag: Drop show_mem() from mem_init()
If get_maintainer is not given any filename arguments on the command line,
the standard input is read for a patch.
But checking if a VCS has a file named &STDIN is not a good idea and fails.
Verify the nominal input file is not &STDIN before checking the VCS.
Fixes: 4cad35a7ca ("get_maintainer.pl: reduce need for command-line option -f")
Reported-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Memory Protection Keys "rights register" (PKRU) is
XSAVE-managed, and is saved/restored along with the FPU state.
When kernel code accesses FPU regsisters, it does a delicate
dance with preempt. Otherwise, the context switching code can
get confused as to whether the most up-to-date state is in the
registers themselves or in the XSAVE buffer.
But, PKRU is not a normal FPU register. Using it does not
generate the normal device-not-available (#NM) exceptions which
means we can not manage it lazily, and the kernel completley
disallows using lazy mode when it is enabled.
The dance with preempt *only* occurs when managing the FPU
lazily. Since we never manage PKRU lazily, we do not have to do
the dance with preempt; we can access it directly. Doing it
this way saves a ton of complicated code (and is faster too).
Further, the XSAVES reenabling failed to patch a bit of code
in fpu__xfeature_set_state() the checked for compacted buffers.
That check caused fpu__xfeature_set_state() to silently refuse to
work when the kernel is using compacted XSAVE buffers. This
broke execute-only and future pkey_mprotect() support when using
compact XSAVE buffers.
But, removing fpu__xfeature_set_state() gets rid of this issue,
in addition to the nice cleanup and speedup.
This fixes the same thing as a fix that Sai posted:
https://lkml.org/lkml/2016/7/25/637
The fix that he posted is a much more obviously correct, but I
think we should just do this instead.
Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Building an X86_64 kernel with W=1 throws a total of 9,948 lines of warnings of
this form for both 32-bit and 64-bit syscall tables. Given that the entire rest
of the build for my config only generates 8,375 lines of output, this is a big
reduction in the warnings generated.
The warnings follow this pattern:
./arch/x86/include/generated/asm/syscalls_32.h:885:21: warning: initialized field overwritten [-Woverride-init]
__SYSCALL_I386(379, compat_sys_pwritev2, )
^
arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386'
#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,
^~~
./arch/x86/include/generated/asm/syscalls_32.h:885:21: note: (near initialization for 'ia32_sys_call_table[379]')
__SYSCALL_I386(379, compat_sys_pwritev2, )
^
arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386'
#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,
Since we intentionally build the syscall tables this way, ignore that one
warning in the two files.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/7464.1470021890@turing-police.cc.vt.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The latest UV kernel support panics when RHEL7 kexec's the kdump kernel
to make a dumpfile. This patch fixes the problem by turning off all UV
support if NUMA is off.
Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.577755634@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are some circumstances where the UV4 BIOS cannot provide the
correct Proximity Node values to associate with specific Sockets and
Physical Nodes. The decision was made to remove these values from BIOS
and for the kernel to get these values from the standard ACPI tables.
Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Save the uv_systab::size field before doing the iounmap()
of the struct pointer, to avoid a NULL dereference crash.
Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.250424783@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The UV4 Socket IDs are not guaranteed to equate to Node values which
can cause the GAM (Global Addressable Memory) table lookups to fail.
Fix this by using an independent index into the GAM table instead of
the Socket ID to reference the base address.
Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a subtle preemption race on UP kernels:
Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.
But then, there is this scenario on x86-UP:
TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:
-> mmput()
-> exit_mmap()
-> tlb_finish_mmu()
-> tlb_flush_mmu()
-> tlb_flush_mmu_tlbonly()
-> tlb_flush()
-> flush_tlb_mm_range()
-> __flush_tlb_up()
-> __flush_tlb()
-> __native_flush_tlb()
At this point current->mm is NULL but current->active_mm still points to
the "old" mm.
Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.
Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.
Now the fun part:
Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.
The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…
This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.
Fix this by disabling preemption across the critical code section.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>