Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-7-namit@vmware.com
cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.
Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.com
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).
While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.
Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.com
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.
This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.
Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-4-namit@vmware.com
The unification of these two functions allows to use them in the updated
SMP infrastrucutre.
To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.
Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-3-namit@vmware.com
The memtype seq_file iterator allocates a buffer in the ->start and ->next
functions and frees it in the ->show function. The preferred handling for
such resources is to free them in the subsequent ->next or ->stop function
call.
Since Commit 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration
code and interface") there is no guarantee that ->show will be called
after ->next, so this function can now leak memory.
So move the freeing of the buffer to ->next and ->stop.
Link: https://lkml.kernel.org/r/161248539022.21478.13874455485854739066.stgit@noble1
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.
[elver@google.com: fix typo in test]
Link: https://lkml.kernel.org/r/X9lHQExmHGvETxY4@elver.google.com
[elver@google.com: show access type in report]
Link: https://lkml.kernel.org/r/20210111091544.3287013-2-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-9-elver@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of removing the fault handling portion of the stack trace based on
the fault handler's name, just use struct pt_regs directly.
Change kfence_handle_page_fault() to take a struct pt_regs, and plumb it
through to kfence_report_error() for out-of-bounds, use-after-free, or
invalid access errors, where pt_regs is used to generate the stack trace.
If the kernel is a DEBUG_KERNEL, also show registers for more information.
Link: https://lkml.kernel.org/r/20201105092133.2075331-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add architecture specific implementation details for KFENCE and enable
KFENCE for the x86 architecture. In particular, this implements the
required interface in <asm/kfence.h> for setting up the pool and
providing helper functions for protecting and unprotecting pages.
For x86, we need to ensure that the pool uses 4K pages, which is done
using the set_memory_4k() helper function.
[elver@google.com: add missing copyright and description header]
Link: https://lkml.kernel.org/r/20210118092159.145934-2-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-3-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Co-developed-by: Marco Elver <elver@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Another initial cleanup - more to follow - to the fault handling code.
- Other minor cleanups and corrections.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAqU0oACgkQEsHwGGHe
VUruWw//VA+/K7Ykd8tjZdmJPWdfsdqBtOrolh4hiajM6iYckTip/FdwHpeEQwM9
ff0iNMrxICG3gbQxCX6WNzPeJatYsnjtF67whfat2SEzNHSDtZDb1Bm20s2/1fbY
OurRBTEBzuYMolpEJ2XABpu7LQ+6TV3LJ6yUBungILMOjP7KvrCK0SUrWj253VDU
XljK5XBZnmYlEjPU6dlhn64Wsl/GD7AWCAeZGq47EgjH2cR6gxNmu9kYAArGbdiJ
WjF8MWE7qVwCPUTiCBv+P1CjsQawvlcUY54wtG65dBYAZvpjmN82T2ypguzAt8KT
12A38vFlBuEUAWC0rUymNouh8Q20AElpdw/odLElHkpNxbHhf/7RyZ1E00LjsFtn
MF9Gp9aSIQbfYWK+Hin9oRvqXckV08u3KtzUNeyMbdCmpyqHh6prj8JEZaxKZZUp
zCaX8Qasn+Q9zL0DO51WI9EPOwpvSpifUYHmd5RHGbQDW9DjYK4mkBCHhjVfYXd/
NcxRO5rrMLmMG+XuNPg9vuHMi2HJnClJ6odD6b80xGvBodTZxZnqnYO9tUImbYnW
pdmt73YDvakei8XY7cAdNWcsTi0kQYZGfInna6z43Ri2l+I1TZaoKGDqn7TbzNbb
9RB0lrD0tfW0PvvDbVwco0Q+8/ykIbvPkHPvjQGWioxHi6yI49s=
=uVEk
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm cleanups from Borislav Petkov:
- PTRACE_GETREGS/PTRACE_PUTREGS regset selection cleanup
- Another initial cleanup - more to follow - to the fault handling
code.
- Other minor cleanups and corrections.
* tag 'x86_mm_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/{fault,efi}: Fix and rename efi_recover_from_page_fault()
x86/fault: Don't run fixups for SMAP violations
x86/fault: Don't look for extable entries for SMEP violations
x86/fault: Rename no_context() to kernelmode_fixup_or_oops()
x86/fault: Bypass no_context() for implicit kernel faults from usermode
x86/fault: Split the OOPS code out from no_context()
x86/fault: Improve kernel-executing-user-memory handling
x86/fault: Correct a few user vs kernel checks wrt WRUSS
x86/fault: Document the locking in the fault_signal_pending() path
x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()
x86/fault: Fold mm_fault_error() into do_user_addr_fault()
x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs
x86/fault: Fix AMD erratum #91 errata fixup for user code
x86/Kconfig: Remove HPET_EMULATE_RTC depends on RTC
x86/asm: Fixup TASK_SIZE_MAX comment
x86/ptrace: Clean up PTRACE_GETREGS/PTRACE_PUTREGS regset selection
x86/vm86/32: Remove VM86_SCREEN_BITMAP support
x86: Remove definition of DEBUG
x86/entry: Remove now unused do_IRQ() declaration
x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAqT+UACgkQEsHwGGHe
VUr1hQ//d4lgbpkXAq17f/43kMZK5GJKMocdBZw+6QmUr4TYWS2PPTfXEr4FCjM1
NqxR2wo97bjbg0oHZOd0JzjxzhC6fHXwiyyFGRv3sS7WNfVPlJRZ47dNa1qPxuZX
QjFwjtWtR2YzKue5Xq7O87gf2YEWFgojWRd5cCyqNU2a0SAHSOhNij8vSswXaBZA
tD2ApIlce1de4HJxXXFBN9xAows7EqKH4OTqRld66aUqDsUfLQCdPB0aKqnTpItA
pzW6TRM392K97j2jvMjEJsHcRUVdsYvE8ICz36vk2vmpP1V3stPxOW8lfC4JV5dL
NXmhuol+CzW/3FeyW5wTtmhJRj3I5MvPl3ddf30Ex9tXSx1RT/7VU7p/zDcv81WZ
32bQFYnPeNI0SHXH+CW7p13GWDE+yTBCUy838Olmkx69K+2OBgt5B6W2BcOkiKPN
RH1NAkwxhjP53eR7CIjirvKK77/p/48/Jg2TyLVWKHcdr7QPZdFG0+FpYbQ2nKoj
q2Gy7l/uxTciu7/RTRHM3E/STVqEmrWSjIuy+dhTOJaHqjMXPyiWXWdYgH2QymUX
pkTi72uaHy5HF0DyOo5pONkJrODUBasr1iJxUKuWkR++wLJihxN5+lbTxd9Vvcop
KN5HNhLoyyQ8jYuc8OLwoHIDEskJgMqLDfNx9sLqCXfR380FfAc=
=anHE
-----END PGP SIGNATURE-----
Merge tag 'x86_seves_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV-ES fix from Borislav Petkov:
"Do not unroll string I/O for SEV-ES guests because they support it"
* tag 'x86_seves_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Do not unroll string I/O for SEV-ES guests
efi_recover_from_page_fault() doesn't recover -- it does a special EFI
mini-oops. Rename it to make it clear that it crashes.
While renaming it, I noticed a blatant bug: a page fault oops in a
different thread happening concurrently with an EFI runtime service call
would be misinterpreted as an EFI page fault. Fix that.
This isn't quite exact. The situation could be improved by using a
special CS for calls into EFI.
[ bp: Massage commit message and simplify in interrupt check. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/f43b1e80830dc78ed60ed8b0826f4f189254570c.1612924255.git.luto@kernel.org
A SMAP-violating kernel access is not a recoverable condition. Imagine
kernel code that, outside of a uaccess region, dereferences a pointer to
the user range by accident. If SMAP is on, this will reliably generate
as an intentional user access. This makes it easy for bugs to be
overlooked if code is inadequately tested both with and without SMAP.
This was discovered because BPF can generate invalid accesses to user
memory, but those warnings only got printed if SMAP was off. Make it so
that this type of error will be discovered with SMAP on as well.
[ bp: Massage commit message. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/66a02343624b1ff46f02a838c497fc05c1a871b3.1612924255.git.luto@kernel.org
Right now, the case of the kernel trying to execute from user memory
is treated more or less just like the kernel getting a page fault on a
user access. In the failure path, it checks for erratum #93, tries to
otherwise fix up the error, and then oopses.
If it manages to jump to the user address space, with or without SMEP,
it should not try to resolve the page fault. This is an error, pure and
simple. Rearrange the code so that this case is caught early, check for
erratum #93, and bail out.
[ bp: Massage commit message. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/ab8719c7afb8bd501c4eee0e36493150fbbe5f6a.1612924255.git.luto@kernel.org
In general, page fault errors for WRUSS should be just like get_user(),
etc. Fix three bugs in this area:
There is a comment that says that, if the kernel can't handle a page fault
on a user address due to OOM, the OOM-kill-and-retry logic would be
skipped. The code checked kernel *privilege*, not kernel mode, so it
missed WRUSS. This means that the kernel would malfunction if it got OOM
on a WRUSS fault -- this would be a kernel-mode, user-privilege fault, and
the OOM killer would be invoked and the handler would retry the faulting
instruction.
A failed user access from kernel while a fatal signal is pending should
fail even if the instruction in question was WRUSS.
do_sigbus() should not send SIGBUS for WRUSS -- it should handle it like
any other kernel mode failure.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/a7b7bcea730bd4069e6b7e629236bb2cf526c2fb.1612924255.git.luto@kernel.org
bad_area() and its relatives are called from many places in fault.c, and
exactly one of them wants the F00F workaround.
__bad_area_nosemaphore() no longer contains any kernel fault code, which
prepares for further cleanups.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e9668729a48ce6754022b0a4415631e8ebdd00e7.1612924255.git.luto@kernel.org
mm_fault_error() is logically just the end of do_user_addr_fault().
Combine the functions. This makes the code easier to read.
Most of the churn here is from renaming hw_error_code to error_code in
do_user_addr_fault().
This makes no difference at all to the generated code (objdump -dr) as
compared to changing noinline to __always_inline in the definition of
mm_fault_error().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/dedc4d9c9b047e51ce38b991bd23971a28af4e7b.1612924255.git.luto@kernel.org
According to the Revision Guide for AMD Athlon™ 64 and AMD Opteron™
Processors, only early revisions of family 0xF are affected. This will
avoid unnecessarily fetching instruction bytes before sending SIGSEGV to
user programs.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/477173b7784bc28afb3e53d76ae5ef143917e8dd.1612924255.git.luto@kernel.org
The recent rework of probe_kernel_address() and its conversion to
get_kernel_nofault() inadvertently broke is_prefetch(). Before this
change, probe_kernel_address() was used as a sloppy "read user or
kernel memory" helper, but it doesn't do that any more. The new
get_kernel_nofault() reads *kernel* memory only, which completely broke
is_prefetch() for user access.
Adjust the code to the correct accessor based on access mode. The
manual address bounds check is no longer necessary, since the accessor
helpers (get_user() / get_kernel_nofault()) do the right thing all by
themselves. As a bonus, by using the correct accessor, the open-coded
address bounds check is not needed anymore.
[ bp: Massage commit message. ]
Fixes: eab0c6089b ("maccess: unify the probe kernel arch hooks")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/b91f7f92f3367d2d3a88eec3b09c6aab1b2dc8ef.1612924255.git.luto@kernel.org
Don't let KVM load when running as an SEV guest, regardless of what
CPUID says. Memory is encrypted with a key that is not accessible to
the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll
see garbage when reading the VMCB.
Technically, KVM could decrypt all memory that needs to be accessible to
the L0 and use shadow paging so that L0 does not need to shadow NPT, but
exposing such information to L0 largely defeats the purpose of running as
an SEV guest. This can always be revisited if someone comes up with a
use case for running VMs inside SEV guests.
Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM
is doomed even if the SEV guest is debuggable and the hypervisor is willing
to decrypt the VMCB. This may or may not be fixed on CPUs that have the
SVME_ADDR_CHK fix.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202212017.2486595-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Under the GHCB specification, SEV-ES guests can support string I/O.
The current #VC handler contains this support, so remove the need to
unroll kernel string I/O operations. This will reduce the number of #VC
exceptions generated as well as the number VM exits for the guest.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/3de04b5b638546ac75d42ba52307fe1a922173d3.1612203987.git.thomas.lendacky@amd.com
The implementation was rather buggy. It unconditionally marked PTEs
read-only, even for VM_SHARED mappings. I'm not sure whether this is
actually a problem, but it certainly seems unwise. More importantly, it
released the mmap lock before flushing the TLB, which could allow a racing
CoW operation to falsely believe that the underlying memory was not
writable.
I can't find any users at all of this mechanism, so just remove it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Stas Sergeev <stsp2@yandex.ru>
Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org
Defining DEBUG should only be done in development. So remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20210114212827.47584-1-trix@redhat.com
Commit
28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")
introduced a new location where a pmd was released, but neglected to
run the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:
c283610e44 ("x86, mm: do not leak page->ptl for pmd page tables").
This issue was hidden until recently because the failure mode is silent,
but commit:
b2b29d6d01 ("mm: account PMD tables like PTE tables")
turns the failure mode into this signature:
BUG: Bad page state in process lt-pmem-ns pfn:15943d
page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
flags: 0xaffff800000000()
raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
page dumped because: nonzero mapcount
[..]
dump_stack+0x8b/0xb0
bad_page.cold+0x63/0x94
free_pcp_prepare+0x224/0x270
free_unref_page+0x18/0xd0
pud_free_pmd_page+0x146/0x160
ioremap_pud_range+0xe3/0x350
ioremap_page_range+0x108/0x160
__ioremap_caller.constprop.0+0x174/0x2b0
? memremap+0x7a/0x110
memremap+0x7a/0x110
devm_memremap+0x53/0xa0
pmem_attach_disk+0x4ed/0x530 [nd_pmem]
? __devm_release_region+0x52/0x80
nvdimm_bus_probe+0x85/0x210 [libnvdimm]
Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure.
As for a better helper pmd_free() is close, but it is a messy fit
due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
paravirt_tlb_remove_table() instead of free_page(), so open-coded
pgtable_pmd_page_dtor() seems the best way forward for now.
Debugged together with Matthew Wilcox <willy@infradead.org>.
Fixes: 28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com
pgt_buf is used to allocate page tables on initial direct page mapping
which bootstraps the kernel into being able to allocate these before the
direct mapping makes further pages available.
INIT_PGD_PAGE_COUNT is set to 6 pages (doubled for KASLR) - 3 (PUD, PMD,
PTE) for the 1 MiB ISA mapping and 3 more for the first direct mapping
assignment in each case providing 2 MiB of address space.
This has not been updated for 5-level page tables which has an
additional P4D page table level above PUD.
In most instances, this will not have a material impact as the first
4 page levels allocated for the ISA mapping will provide sufficient
address space to encompass all further address mappings.
If the first direct mapping is within 512 GiB of the ISA mapping, only
a PMD and PTE needs to be added in the instance the kernel is using 4
KiB page tables (e.g. CONFIG_DEBUG_PAGEALLOC is enabled) and only a PMD
if the kernel can use 2 MiB pages (the first allocation is limited to
PMD_SIZE so a GiB page cannot be used there).
However, if the machine has more than 512 GiB of RAM and the kernel is
allocating 4 KiB page size, 3 further page tables are required.
If the machine has more than 256 TiB of RAM at 4 KiB or 2 MiB page size,
further 3 or 4 page tables are required respectively.
Update INIT_PGD_PAGE_COUNT to reflect this.
[ bp: Sanitize text into passive voice without ambiguous personal pronouns. ]
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/20201215205641.34096-1-lstoakes@gmail.com
Pull swiotlb update from Konrad Rzeszutek Wilk:
"A generic (but for right now engaged only with AMD SEV) mechanism to
adjust a larger size SWIOTLB based on the total memory of the SEV
guests which right now require the bounce buffer for interacting with
the outside world.
Normal knobs (swiotlb=XYZ) still work"
* 'stable/for-linus-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
verify that a page is mapped in the kernel direct map can be useful
regardless of hibernation.
Add RISC-V implementation of kernel_page_present(), update its forward
declarations and stubs to be a part of set_memory API and remove ugly
ifdefery in inlcude/linux/mm.h around current declarations of
kernel_page_present().
Link: https://lkml.kernel.org/r/20201109192128.960-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
never fail. With this assumption is wouldn't be safe to allow general
usage of this function.
Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.
As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.
Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Consolidate all kmap_atomic() internals into a generic implementation
which builds the base for the kmap_local() API and make the
kmap_atomic() interface wrappers which handle the disabling/enabling of
preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a mapping
is established. It has to disable migration instead to guarantee that
the virtual address of the mapped slot is the same accross preemption.
- Provide better debug facilities: guard pages and enforced utilization
of the mapping mechanics on 64bit systems when the architecture allows
it.
- Provide the new kmap_local() API which can now be used to cleanup the
kmap_atomic() usage sites all over the place. Most of the usage sites
do not require the implicit disabling of preemption and pagefaults so
the penalty on 64bit and 32bit non-highmem systems is removed and quite
some of the code can be simplified. A wholesale conversion is not
possible because some usage depends on the implicit side effects and
some need to be cleaned up because they work around these side effects.
The migrate disable side effect is only effective on highmem systems
and when enforced debugging is enabled. On 64bit and 32bit non-highmem
systems the overhead is completely avoided.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
+x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
+jlfoJhMbtx5Gg==
=n71I
-----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
(Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
=lLiH
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Another branch with a nicely negative diffstat, just the way I
like 'em:
- Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
the end (Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree"
* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/ia32_signal: Propagate __user annotation properly
x86/alternative: Update text_poke_bp() kernel-doc comment
x86/PCI: Make a kernel-doc comment a normal one
x86/asm: Drop unused RDPID macro
x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
x86/head64: Remove duplicate include
x86/mm: Declare 'start' variable where it is used
x86/head/64: Remove unused GET_CR2_INTO() macro
x86/boot: Remove unused finalize_identity_maps()
x86/uaccess: Document copy_from_user_nmi()
x86/dumpstack: Make show_trace_log_lvl() static
x86/mtrr: Fix a kernel-doc markup
x86/setup: Remove unused MCA variables
x86, libnvdimm/test: Remove COPY_MC_TEST
x86: Reclaim TIF_IA32 and TIF_X32
x86/mm: Convert mmu context ia32_compat into a proper flags field
x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
elf: Expose ELF header on arch_setup_additional_pages()
x86/elf: Use e_machine to select start_thread for x32
elf: Expose ELF header in compat_start_thread()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XUC8ACgkQEsHwGGHe
VUoqig//YbNptsXveS0Kir4FUuWpLINbMGGfCekxlUItBzYFH3khqOcE4aiOfRB8
TOx2QIFySiqJuTF4tnmwSs0QidHVchjVdD7+7f5jU2G8b/qzvU94+FdVQJJkTQa9
pfOCxRr63ZjUdobPzSuEvZx9Wqr5kA1UpOsITxxx+rGR1M8HEA5dUYhSYgeoqFVI
h7GgLSAfLkyeguTt2uHF8WOw0jgzkabw5CdiLAc/HwUh8Sde3ZdNCJizH5U636BF
N+sNkQNjuo1t0lFAh9bjwTYq/0wy9ojqBYcXDDd2s1s62TuAOZ10u8dAW8+CeZuJ
iC+7/j3QMulbrUeeOv48SkUkv+cMn+gcjcxFoHSEehVue9IciaoHasDACrB+0zBy
pEZrFlyRGZyAjaXoiPQpsQ63/10W1zs+nFfP+IjM98O/I+ZawFfrjOI6x6qyWY3r
DhskIaZ/Ha5nqzka1f502Xml54vBaeETQ7D8tQwco9wo3KyUPKXRLgHLiSiaUk/x
FW/1hVC13mimn1n4eCB2sqwOKEFi8CfpLAbATrxmU8pRjIpcYK2B3mRWZBYUutEM
uW1PdnnFI2680vKFrkWfWnqxSb4yLv/v5K4th5+8HzUyODWEq4hgSOdwZgxwEH9Y
dQcfkiDbZRKEIoTiXDz4FFAlHeGMkldAT8YGYCBwQxX1DoCUvCg=
=1LFY
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm update from Borislav Petkov:
"A single improvement to check ident_pud_init()'s return value (Arvind
Sankar)"
* tag 'x86_mm_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/ident_map: Check for errors from ident_pud_init()
applications to populate protected regions of user code and data called
enclaves. Once activated, the new hardware protects enclave code and
data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to
run in enclaves, but there are also projects that try to run limited
unmodified software in enclaves."
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used by
SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XTtMACgkQEsHwGGHe
VUqxFw/+NZGf2b3CWPcrvwXCpkvSpIrqh1jQwyvkZyJ1gen7Vy8dkvf99h8+zQPI
4wSArEyjhYJKAAmBNefLKi/Cs/bdkGzLlZyDGqtM641XRjf0xXIpQkOBb6UBa+Pv
to8veQmVH2bBTM49qnd+H1wM6FzYvhTYCD8xr4HlLXtIfpP2CK2GvCb8s/4LifgD
fTucZX9TFwLgVkWOHWHN0n8XMR2Fjb2YCrwjFMKyr/M2W+pPoOCTIt4PWDuXiOeG
rFP7R4DT9jDg8ht5j2dHQT/Bo8TvTCB4Oj98MrX1TTgkSjLJySSMfyQg5EwNfSIa
HC0lg/6qwAxnhWX7cCCBETNZ4aYDmz/dxcCSsLbomGP9nMaUgUy7qn5nNuNbJilb
oCBsr8LDMzu1LJzmkduM8Uw6OINh+J8ICoVXaR5pS7gSZz/+vqIP/rK691AiqhJL
QeMkI9gQ83jEXpr/AV7ABCjGCAeqELOkgravUyTDev24eEc0LyU0qENpgxqWSTca
OvwSWSwNuhCKd2IyKZBnOmjXGwvncwX0gp1KxL9WuLkR6O8XldLAYmVCwVAOrIh7
snRot8+3qNjELa65Nh5DapwLJrU24TRoKLHLgfWK8dlqrMejNtXKucQ574Np0feR
p2hrNisOrtCwxAt7OAgWygw8agN6cJiY18onIsr4wSBm5H7Syb0=
=k7tj
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGC support from Borislav Petkov:
"Intel Software Guard eXtensions enablement. This has been long in the
making, we were one revision number short of 42. :)
Intel SGX is new hardware functionality that can be used by
applications to populate protected regions of user code and data
called enclaves. Once activated, the new hardware protects enclave
code and data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to run
in enclaves, but there are also projects that try to run limited
unmodified software in enclaves.
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used
by SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others"
* tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
x86/sgx: Fix a typo in kernel-doc markup
x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
selftests/sgx: Use a statically generated 3072-bit RSA key
x86/sgx: Clarify 'laundry_list' locking
x86/sgx: Update MAINTAINERS
Documentation/x86: Document SGX kernel architecture
x86/sgx: Add ptrace() support for the SGX driver
x86/sgx: Add a page reclaimer
selftests/x86: Add a selftest for SGX
x86/vdso: Implement a vDSO for Intel SGX enclave call
x86/traps: Attempt to fixup exceptions in vDSO before signaling
x86/fault: Add a helper function to sanitize error code
x86/vdso: Add support for exception fixup in vDSO functions
x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
x86/sgx: Add SGX_IOC_ENCLAVE_INIT
x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
x86/sgx: Add an SGX misc driver interface
...
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
SEV uses SWIOTLB to make this happen without requiring changes to device
drivers. However, depending on the workload being run, the default 64MB
of it might not be enough and it may run out of buffers to use for DMA,
resulting in I/O errors and/or performance degradation for high
I/O workloads.
Adjust the default size of SWIOTLB for SEV guests using a
percentage of the total memory available to guest for the SWIOTLB buffers.
Adds a new sev_setup_arch() function which is invoked from setup_arch()
and it calls into a new swiotlb generic code function swiotlb_adjust_size()
to do the SWIOTLB buffer adjustment.
v5 fixed build errors and warnings as
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The PAT bit is in different locations for 4k and 2M/1G page table
entries.
Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.
Fixes: 6ebcb06071 ("x86/mm: Add support to encrypt the kernel in-place")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
sync_core_before_usermode() had an incorrect optimization. If the kernel
returns from an interrupt, it can get to usermode without IRET. It just has
to schedule to a different task in the same mm and do SYSRET. Fortunately,
there were no callers of sync_core_before_usermode() that could have had
in_irq() or in_nmi() equal to true, because it's only ever called from the
scheduler.
While at it, clarify a related comment.
Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: a035b6bf86 ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not required to initialize the local variable start in
memory_map_top_down(), as the variable will be initialized in any path
before it is used.
make clang-analyzer on x86_64 tinyconfig reports:
arch/x86/mm/init.c:612:15: warning: Although the value stored to 'start' \
is used in the enclosing expression, the value is never actually read \
from 'start' [clang-analyzer-deadcode.DeadStores]
Move the variable declaration into the loop, where it is used.
No code changed:
# arch/x86/mm/init.o:
text data bss dec hex filename
7105 1424 26768 35297 89e1 init.o.before
7105 1424 26768 35297 89e1 init.o.after
md5:
a8d76c1bb5fce9cae251780a7ee7730f init.o.before.asm
a8d76c1bb5fce9cae251780a7ee7730f init.o.after.asm
[ bp: Massage. ]
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200928100004.25674-1-lukas.bulwahn@gmail.com
vDSO functions can now leverage an exception fixup mechanism similar to
kernel exception fixup. For vDSO exception fixup, the initial user is
Intel's Software Guard Extensions (SGX), which will wrap the low-level
transitions to/from the enclave, i.e. EENTER and ERESUME instructions,
in a vDSO function and leverage fixup to intercept exceptions that would
otherwise generate a signal. This allows the vDSO wrapper to return the
fault information directly to its caller, obviating the need for SGX
applications and libraries to juggle signal handlers.
Attempt to fixup vDSO exceptions immediately prior to populating and
sending signal information. Except for the delivery mechanism, an
exception in a vDSO function should be treated like any other exception
in userspace, e.g. any fault that is successfully handled by the kernel
should not be directly visible to userspace.
Although it's debatable whether or not all exceptions are of interest to
enclaves, defer to the vDSO fixup to decide whether to do fixup or
generate a signal. Future users of vDSO fixup, if there ever are any,
will undoubtedly have different requirements than SGX enclaves, e.g. the
fixup vs. signal logic can be made function specific if/when necessary.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-19-jarkko@kernel.org
vDSO exception fixup is a replacement for signals in limited situations.
Signals and vDSO exception fixup need to provide similar information to
userspace, including the hardware error code.
That hardware error code needs to be sanitized. For instance, if userspace
accesses a kernel address, the error code could indicate to userspace
whether the address had a Present=1 PTE. That can leak information about
the kernel layout to userspace, which is bad.
The existing signal code does this sanitization, but fairly late in the
signal process. The vDSO exception code runs before the sanitization
happens.
Move error code sanitization out of the signal code and into a helper.
Call the helper in the signal code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-18-jarkko@kernel.org
The x86 architecture has a set of page fault error codes. These indicate
things like whether the fault occurred from a write, or whether it
originated in userspace.
The SGX hardware architecture has its own per-page memory management
metadata (EPCM) [*] and hardware which is separate from the normal x86 MMU.
The architecture has a new page fault error code: PF_SGX. This new error
code bit is set whenever a page fault occurs as the result of the SGX MMU.
These faults occur for a variety of reasons. For instance, an access
attempt to enclave memory from outside the enclave causes a PF_SGX fault.
PF_SGX would also be set for permission conflicts, such as if a write to an
enclave page occurs and the page is marked read-write in the x86 page
tables but is read-only in the EPCM.
These faults do not always indicate errors, though. SGX pages are
encrypted with a key that is destroyed at hardware reset, including
suspend. Throwing a SIGSEGV allows user space software to react and recover
when these events occur.
Include PF_SGX in the PF error codes list and throw SIGSEGV when it is
encountered.
[*] Intel SDM: 36.5.1 Enclave Page Cache Map (EPCM)
[ bp: Add bit 15 to the comment above enum x86_pf_error_code too. ]
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-7-jarkko@kernel.org
Switch the atomic iomap implementation over to kmap_local and stick the
preempt/pagefault mechanics into the generic code similar to the
kmap_atomic variants.
Rename the x86 map function in preparation for a non-atomic variant.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20201103095858.625310005@linutronix.de
When SEV is enabled, the kernel requests the C-bit position again from
the hypervisor to build its own page-table. Since the hypervisor is an
untrusted source, the C-bit position needs to be verified before the
kernel page-table is used.
Call sev_verify_cbit() before writing the CR3.
[ bp: Massage. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org