Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
Johannes in commit f371763a79 ("mm: memcontrol: fix false-positive
VM_BUG_ON() on -rt"). The comment before that patch was a tiny bit better
than it is now. While the patch claimed to fix a false-postive on -RT
this was not the case. None of the -RT folks ACKed it and it was not a
false positive report. That was a *real* problem.
This patch updates the comment that is improper because it refers to
"disabled preemption" as a consequence of that lock being taken. A
spin_lock() disables preemption, true, but in this case the code relies on
the fact that the lock _also_ disables interrupts once it is acquired.
And this is the important detail (which was checked the VM_BUG_ON()) which
needs to be pointed out. This is the hint one needs while looking at the
code. It was explained by Johannes on the list that the per-CPU variables
are protected by local_irq_save(). The BUG_ON() was helpful. This code
has been workarounded in -RT in the meantime. I wouldn't mind running
into more of those if the code in question uses *special* kind of locking
since now there is no verification (in terms of lockdep or BUG_ON()) and
therefore I bring the VM_BUG_ON() check back in.
The two functions after the comment could also have a "local_irq_save()"
dance around them in order to serialize access to the per-CPU variables.
This has been avoided because the interrupts should be off.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change fills devm_gen_pool_create()/gen_pool_get() "name" argument
stub with contents and extends of_gen_pool_get() functionality on this
basis.
If there is no associated platform device with a device node passed to
of_gen_pool_get(), the function attempts to get a label property or device
node name (= repeats MTD OF partition standard) and seeks for a named
gen_pool registered by device of the parent device node.
The main idea of the change is to allow registration of independent
gen_pools under the same umbrella device, say "partitions" on "storage
device", the original functionality of one "partition" per "storage
device" is untouched.
[akpm@linux-foundation.org: fix constness in devres_find()]
[dan.carpenter@oracle.com: freeing const data pointers]
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change modifies gen_pool_get() and devm_gen_pool_create() client
interfaces adding one more argument "name" of a gen_pool object.
Due to implementation gen_pool_get() is capable to retrieve only one
gen_pool associated with a device even if multiple gen_pools are created,
fortunately right at the moment it is sufficient for the clients, hence
provide NULL as a valid argument on both producer devm_gen_pool_create()
and consumer gen_pool_get() sides.
Because only one created gen_pool per device is addressable, explicitly
add a restriction to devm_gen_pool_create() to create only one gen_pool
per device, this implies two possible error codes returned by the
function, account it on client side (only misc/sram). This completes
client side changes related to genalloc updates.
[akpm@linux-foundation.org: gen_pool_get() cleanup]
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each memblock_region has nid to indicates the Node ID of this range. For
the overlap case, memblock_add_range() inserts the lower part and leave
the upper part as indicated in the overlapped region.
If the nid of the new range differs from the overlapped region, the
information recorded is not correct.
This patch adds a WARN_ON when the nid of the new range differs from the
overlapped region.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a PTE is unmapped and it's dirty then it was writable recently. Due to
deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. This
patch defers flushing of potentially writable TLBs as long as possible.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate
CPUs.
On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that
potentially have TLB entries for pages being unmapped. When the unmapping
is complete, the full TLB is flushed on the assumption that a refill cost
is lower than flushing individual entries.
Architectures wishing to do this must give the following guarantee.
If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that linear address
from a CPU with a cached TLB entry will trap a page fault.
This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting. The
architecture should consider whether the cost of the full TLB flush is
higher than sending an IPI to flush each individual entry. An additional
architecture helper called flush_tlb_local is required. It's a trivial
wrapper with some accounting in the x86 case.
The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU
readers of mapped files that consume 10*RAM.
Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)
4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 581.00 611.43
System 5804.93 4111.76
Elapsed 161.03 122.12
This is showing that the readers completed 24.40% faster with 29% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.
The impact is lower on a single socket machine.
4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)
4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 58.09 57.64
System 111.82 76.56
Elapsed 27.29 22.55
It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.
The patch will have no impact on workloads with no memory pressure or have
relatively few mapped pages. It will have an unpredictable impact on the
workload running on the CPU being flushed as it'll depend on how many TLB
entries need to be refilled and how long that takes. Worst case, the TLB
will be completely cleared of active entries when the target PFNs were not
resident at all.
[sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
second.
There already is a window between when a page is unmapped and when it is
TLB flushed. This series increases the window so multiple pages can be
flushed using a single IPI. This should be safe or the kernel is hosed
already.
Patch 1 simply made the rest of the series easier to write as ftrace
could identify all the senders of TLB flush IPIS.
Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
to flush the entire TLB.
Patch 3 tracks when there potentially are writable TLB entries that
need to be batched differently
Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes
The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.
This patch (of 4):
It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the
source of IPIs being transmitted for TLB flushes on x86.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This test allocates two virtual areas and bounces the physical memory
across the two virtual areas using only userfaultfd.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the refile in userfaultfd_read both waitqueues could look empty to
the lockless wake_userfault(). Use a seqcount to prevent this false
negative that could leave an userfault blocked.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The THP faults were not propagating the original fault address. The
latest version of the API with uffd.arg.pagefault.address is supposed to
propagate the full address through THP faults.
This was not a kernel crashing bug and it wouldn't risk to corrupt user
memory, but it would cause a SIGBUS failure because the wrong page was
being copied.
For various reasons this wasn't easily reproducible in the qemu workload,
but the strestest exposed the problem immediately.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
despite we temporarily released the mmap_sem. The fault would just be
retried by userland then. This is safe at least on x86 and powerpc (the
two archs with the syscall implemented so far).
Hint to verify for which archs this is safe: after handle_mm_fault
returns, no access to data structures protected by the mmap_sem must be
done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
is called.
This has two main benefits: signals can run with lower latency in
production (signals aren't blocked by userfaults and userfaults are
immediately repeated after signal processing) and gdb can then trivially
debug the threads blocked in this kind of userfaults coming directly from
userland.
On a side note: while gdb has a need to get signal processed, coredumps
always worked perfectly with userfaults, no matter if the userfault is
triggered by GUP a kernel copy_user or directly from userland.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UFFDIO_API was already forced before read/poll could work. This makes the
code more strict to force it also for all other ioctls.
All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.
This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.
There's no actual plan or need to change the API or the ioctl, the current
API already should cover fine even the non cooperative usage, but this is
just for the longer term future just in case.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the rwsem starves writers it wasn't strictly a bug but lockdep
doesn't like it and this avoids depending on lowlevel implementation
details of the lock.
[akpm@linux-foundation.org: delete weird BUILD_BUG_ON()]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows to select the userfaultfd during configuration to build it.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes read O(1) and poll that was already O(1) becomes lockless.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an optimization but it's a userland visible one and it affects
the API.
The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.
userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.
The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.
This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I had requests to return the full address (not the page aligned one) to
userland.
It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset. There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated). Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).
Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well. Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.
The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.
The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure. If more will be needed a bump of UFFD_API will be
required.
[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is (seems to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one. Now more bits can be
added to the features field indicating e.g. UFFD_FEATURE_FORK and others
needed for the latter use-case.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread responsible
for doing the memory externalization can manage the page faults in
userland by talking to the kernel using the userfaultfd protocol.
poll() can be used to know when there are new pending userfaults to be
read (POLLIN).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If userfaultfd is armed on a certain vma we can't "fill" the holes with
zeroes or we'll break the userland on demand paging. The holes if the
userfault is armed, are really missing information (not zeroes) that the
userland has to load from network or elsewhere.
The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.
We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).
handle_userfault() then takes care of blocking the page fault and
delivering it to userland.
The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the vm_userfaultfd_ctx to the vm_area_struct.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
of the current hardcoded 1 (that would wake just the first waitqueue in
the head list).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the latest userfaultfd patchset. The postcopy live migration
feature on the qemu side is mostly ready to be merged and it entirely
depends on the userfaultfd syscall to be merged as well. So it'd be great
if this patchset could be reviewed for merging in -mm.
Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before (PROT_NONE +
SIGSEGV trap).
The use cases are:
1) KVM postcopy live migration (one form of cloud memory
externalization).
KVM postcopy live migration is the primary driver of this work:
http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
2) postcopy live migration of binaries inside linux containers:
http://thread.gmane.org/gmane.linux.kernel.mm/132662
3) KVM postcopy live snapshotting (allowing to limit/throttle the
memory usage, unlike fork would, plus the avoidance of fork
overhead in the first place).
While the wrprotect tracking is not implemented yet, the syscall API is
already contemplating the wrprotect fault tracking and it's generic enough
to allow its later implementation in a backwards compatible fashion.
4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
should be extended to work also on tmpfs and then the
uffdio_register.ioctls will notify userland that UFFDIO_COPY is
available even when the registered virtual memory range is tmpfs
backed.
5) alternate mechanism to notify web browsers or apps on embedded
devices that volatile pages have been reclaimed. This basically
avoids the need to run a syscall before the app can access with the
CPU the virtual regions marked volatile. This depends on point 4)
to be fulfilled first, as volatile pages happily apply to tmpfs.
Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.
This patch (of 22):
Add documentation.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While debugging a networking issue, I hit a condition that triggered an
object to be freed into the wrong kmem cache, and thus triggered the
warning in cache_from_obj().
The arguments in the error message are in wrong order: the location
of the object's kmem cache is in cachep, not s.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Description is almost copied from commit fb05e7a89f ("net: don't wait
for order-3 page allocation").
I saw excessive direct memory reclaim/compaction triggered by slub. This
causes performance issues and add latency. Slub uses high-order
allocation to reduce internal fragmentation and management overhead. But,
direct memory reclaim/compaction has high overhead and the benefit of
high-order allocation can't compensate the overhead of both work.
This patch makes auxiliary high-order allocation atomic. If there is no
memory pressure and memory isn't fragmented, the alloction will still
success, so we don't sacrifice high-order allocation's benefit here. If
the atomic allocation fails, direct memory reclaim/compaction will not be
triggered, allocation fallback to low-order immediately, hence the direct
memory reclaim/compaction overhead is avoided. In the allocation failure
case, kswapd is waken up and trying to make high-order freepages, so
allocation could success next time.
Following is the test to measure effect of this patch.
System: QEMU, CPU 8, 512 MB
Mem: 25% memory is allocated at random position to make fragmentation.
Memory-hogger occupies 150 MB memory.
Workload: hackbench -g 20 -l 1000
Average result by 10 runs (Base va Patched)
elapsed_time(s): 4.3468 vs 2.9838
compact_stall: 461.7 vs 73.6
pgmigrate_success: 28315.9 vs 7256.1
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
reference of kmem-cache kobject and frees it. Kmem cache will be freed
second time at error path in kmem_cache_create().
For example this happens when slub debug was enabled in runtime and
somebody creates new kmem cache:
# echo 1 | tee /sys/kernel/slab/*/sanity_checks
# modprobe configfs
"configfs_dir_cache" cannot be merged because existing slab have debug and
cannot create new slab because unique name ":t-0000096" already taken.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Initializing a new slab can introduce rather large latencies because most
of the initialization runs always with interrupts disabled.
There is no point in doing so. The newly allocated slab is not visible
yet, so there is no reason to protect it against concurrent alloc/free.
Move the expensive parts of the initialization into allocate_slab(), so
for all allocations with GFP_WAIT set, interrupts are enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First piece: acceleration of retrieval of per cpu objects
If we are allocating lots of objects then it is advantageous to disable
interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
faster.
Note that we cannot do the fast operation if debugging is enabled, because
we would have to add extra code to do all the debugging checks. And it
would not be fast anyway.
Note also that the requirement of having interrupts disabled avoids having
to do processor flag operations.
Allocate as many objects as possible in the fast way and then fall back to
the generic implementation for the rest of the objects.
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns
Bulk- fallback - this-patch
1 - 57 cycles(tsc) 14.432 ns - 48 cycles(tsc) 12.155 ns improved 15.8%
2 - 50 cycles(tsc) 12.746 ns - 37 cycles(tsc) 9.390 ns improved 26.0%
3 - 48 cycles(tsc) 12.180 ns - 33 cycles(tsc) 8.417 ns improved 31.2%
4 - 48 cycles(tsc) 12.015 ns - 32 cycles(tsc) 8.045 ns improved 33.3%
8 - 46 cycles(tsc) 11.526 ns - 30 cycles(tsc) 7.699 ns improved 34.8%
16 - 45 cycles(tsc) 11.418 ns - 32 cycles(tsc) 8.205 ns improved 28.9%
30 - 80 cycles(tsc) 20.246 ns - 73 cycles(tsc) 18.328 ns improved 8.8%
32 - 79 cycles(tsc) 19.946 ns - 72 cycles(tsc) 18.208 ns improved 8.9%
34 - 78 cycles(tsc) 19.659 ns - 71 cycles(tsc) 17.987 ns improved 9.0%
48 - 86 cycles(tsc) 21.516 ns - 82 cycles(tsc) 20.566 ns improved 4.7%
64 - 93 cycles(tsc) 23.423 ns - 89 cycles(tsc) 22.480 ns improved 4.3%
128 - 100 cycles(tsc) 25.170 ns - 99 cycles(tsc) 24.871 ns improved 1.0%
158 - 102 cycles(tsc) 25.549 ns - 101 cycles(tsc) 25.375 ns improved 1.0%
250 - 101 cycles(tsc) 25.344 ns - 100 cycles(tsc) 25.182 ns improved 1.0%
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the basic infrastructure for alloc/free operations on pointer arrays.
It includes a generic function in the common slab code that is used in
this infrastructure patch to create the unoptimized functionality for slab
bulk operations.
Allocators can then provide optimized allocation functions for situations
in which large numbers of objects are needed. These optimization may
avoid taking locks repeatedly and bypass metadata creation if all objects
in slab pages can be used to provide the objects required.
Allocators can extend the skeletons provided and add their own code to the
bulk alloc and free functions. They can keep the generic allocation and
freeing and just fall back to those if optimizations would not work (like
for example when debugging is on).
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patchset the SLUB allocator now has both bulk alloc and free
implemented.
This patchset mostly optimizes the "fastpath" where objects are available
on the per CPU fastpath page. This mostly amortize the less-heavy
none-locked cmpxchg_double used on fastpath.
The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis
for comparison. Measurements[1] of the fallback functions
__kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
forced "noinline" to force a function call like slab_common.c.
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
Measurements last-patch with disabled debugging:
Bulk- fallback - this-patch
1 - 57 cycles(tsc) 14.448 ns - 44 cycles(tsc) 11.236 ns improved 22.8%
2 - 51 cycles(tsc) 12.768 ns - 28 cycles(tsc) 7.019 ns improved 45.1%
3 - 48 cycles(tsc) 12.232 ns - 22 cycles(tsc) 5.526 ns improved 54.2%
4 - 48 cycles(tsc) 12.025 ns - 19 cycles(tsc) 4.786 ns improved 60.4%
8 - 46 cycles(tsc) 11.558 ns - 18 cycles(tsc) 4.572 ns improved 60.9%
16 - 45 cycles(tsc) 11.458 ns - 18 cycles(tsc) 4.658 ns improved 60.0%
30 - 45 cycles(tsc) 11.499 ns - 18 cycles(tsc) 4.568 ns improved 60.0%
32 - 79 cycles(tsc) 19.917 ns - 65 cycles(tsc) 16.454 ns improved 17.7%
34 - 78 cycles(tsc) 19.655 ns - 63 cycles(tsc) 15.932 ns improved 19.2%
48 - 68 cycles(tsc) 17.049 ns - 50 cycles(tsc) 12.506 ns improved 26.5%
64 - 80 cycles(tsc) 20.009 ns - 63 cycles(tsc) 15.929 ns improved 21.3%
128 - 94 cycles(tsc) 23.749 ns - 86 cycles(tsc) 21.583 ns improved 8.5%
158 - 97 cycles(tsc) 24.299 ns - 90 cycles(tsc) 22.552 ns improved 7.2%
250 - 102 cycles(tsc) 25.681 ns - 98 cycles(tsc) 24.589 ns improved 3.9%
Benchmarking shows impressive improvements in the "fastpath" with a small
number of objects in the working set. Once the working set increases,
resulting in activating the "slowpath" (that contains the heavier locked
cmpxchg_double) the improvement decreases.
I'm currently working on also optimizing the "slowpath" (as network stack
use-case hits this), but this patchset should provide a good foundation
for further improvements. Rest of my patch queue in this area needs some
more work, but preliminary results are good. I'm attending Netfilter
Workshop[2] next week, and I'll hopefully return working on further
improvements in this area.
This patch (of 6):
s/succedd/succeed/
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename watchdog_suspend() to lockup_detector_suspend() and
watchdog_resume() to lockup_detector_resume() to avoid confusion with the
watchdog subsystem and to be consistent with the existing name
lockup_detector_init().
Also provide comment blocks to explain the watchdog_running and
watchdog_suspended variables and their relationship.
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephane Eranian <eranian@google.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
these functions are no longer needed. If a subsystem has a need to
deactivate the watchdog temporarily, it should utilize the
watchdog_suspend() and watchdog_resume() functions.
[akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephane Eranian <eranian@google.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove update_watchdog() and restart_watchdog_hrtimer() since these
functions are no longer needed. Changes of parameters such as the sample
period are honored at the time when the watchdog threads are being
unparked.
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephane Eranian <eranian@google.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This interface can be utilized to deactivate the hard and soft lockup
detector temporarily. Callers are expected to minimize the duration of
deactivation. Multiple deactivations are allowed to occur in parallel but
should be rare in practice.
[akpm@linux-foundation.org: remove unneeded static initialization]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephane Eranian <eranian@google.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally watchdog_nmi_enable(cpu) and watchdog_nmi_disable(cpu) were
only called in watchdog thread context. However, the following commits
utilize these functions outside of watchdog thread context too.
commit 9809b18fcf
Author: Michal Hocko <mhocko@suse.cz>
Date: Tue Sep 24 15:27:30 2013 -0700
watchdog: update watchdog_thresh properly
commit b3738d2932
Author: Stephane Eranian <eranian@google.com>
Date: Mon Nov 17 20:07:03 2014 +0100
watchdog: Add watchdog enable/disable all functions
Hence, it is now possible that these functions execute concurrently with
the same 'cpu' argument. This concurrency is problematic because per-cpu
'watchdog_ev' can be accessed/modified without adequate synchronization.
The patch series aims to address the above problem. However, instead of
introducing locks to protect per-cpu 'watchdog_ev' a different approach is
taken: Invoke these functions by parking and unparking the watchdog
threads (to ensure they are always called in watchdog thread context).
static struct smp_hotplug_thread watchdog_threads = {
...
.park = watchdog_disable, // calls watchdog_nmi_disable()
.unpark = watchdog_enable, // calls watchdog_nmi_enable()
};
Both previously mentioned commits call these functions in a similar way
and thus in principle contain some duplicate code. The patch series also
avoids this duplication by providing a commonly usable mechanism.
- Patch 1/4 introduces the watchdog_{park|unpark}_threads functions that
park/unpark all watchdog threads specified in 'watchdog_cpumask'. They
are intended to be called inside of kernel/watchdog.c only.
- Patch 2/4 introduces the watchdog_{suspend|resume} functions which can
be utilized by external callers to deactivate the hard and soft lockup
detector temporarily.
- Patch 3/4 utilizes watchdog_{park|unpark}_threads to replace some code
that was introduced by commit 9809b18fcf.
- Patch 4/4 utilizes watchdog_{suspend|resume} to replace some code that
was introduced by commit b3738d2932.
A few corner cases should be mentioned here for completeness.
- kthread_park() of watchdog/N could hang if cpu N is already locked up.
However, if watchdog is enabled the lockup will be detected anyway.
- kthread_unpark() of watchdog/N could hang if cpu N got locked up after
kthread_park(). The occurrence of this scenario should be _very_ rare
in practice, in particular because it is not expected that temporary
deactivation will happen frequently, and if it happens at all it is
expected that the duration of deactivation will be short.
This patch (of 4): introduce watchdog_park_threads() and watchdog_unpark_threads()
These functions are intended to be used only from inside kernel/watchdog.c
to park/unpark all watchdog threads that are specified in
watchdog_cpumask.
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephane Eranian <eranian@google.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>