Currently, kprobes decodes the opcode right after single-stepping in
resume_execution(). But the opcode was already decoded while preparing
arch_specific_insn in arch_copy_kprobe().
Decode the opcode in arch_copy_kprobe() instead of in resume_execution()
and set some flags which classify the opcode for the resuming process.
[ bp: Massage commit message. ]
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/160830072561.349576.3014979564448023213.stgit@devnote2
On this system the M.2 PCIe WiFi card isn't detected after reboot, only
after cold boot. reboot=pci fixes this behavior. In [0] the same issue
is described, although on another system and with another Intel WiFi
card. In case it's relevant, both systems have Celeron CPUs.
Add a PCI reboot quirk on affected systems until a more generic fix is
available.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=202399
[ bp: Massage commit message. ]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1524eafd-f89c-cfa4-ed70-0bde9e45eec9@gmail.com
accesses, inefficient and disfunctional code. The goal is to remove the
export of irq_to_desc() to prevent these things from creeping up again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/ifgsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYm6EACAo8sObkuY3oWLagtGj1KHxon53oGZ
VfDw2LYKM+rgJjDWdiyocyxQU5gtm6loWCrIHjH2adRQ4EisB5r8hfI8NZHxNMyq
8khUi822NRBfFN6SCpO8eW9o95euscNQwCzqi7gV9/U/BAKoDoSEYzS4y0YmJlup
mhoikkrFiBuFXplWI0gbP4ihb8S/to2+kTL6o7eBoJY9+fSXIFR3erZ6f3fLjYZG
CQUUysTywdDhLeDkC9vaesXwgdl2XnaPRwcQqmK8Ez0QYNYpawyILUHLD75cIHDu
bHdK2ZoDv/wtad/3BoGTK3+wChz20a/4/IAnBIUVgmnSLsPtW8zNEOPWNNc0aGg+
rtafi5bvJ1lMoSZhkjLWQDOGU6vFaXl9NkC2fpF+dg1skFMT2CyLC8LD/ekmocon
zHAPBva9j3m2A80hI3dUH9azo/IOl1GHG8ccM6SCxY3S/9vWSQChNhQDLe25xBEO
VtKZS7DYFCRiL8mIy9GgwZWof8Vy2iMua2ML+W9a3mC9u3CqSLbCFmLMT/dDoXl1
oHnMdAHk1DRatA8pJAz83C75RxbAS2riGEqtqLEQ6OaNXn6h0oXCanJX9jdKYDBh
z6ijWayPSRMVktN6FDINsVNFe95N4GwYcGPfagIMqyMMhmJDic6apEzEo7iA76lk
cko28MDqTIK4UQ==
=BXv+
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"This is the second attempt after the first one failed miserably and
got zapped to unblock the rest of the interrupt related patches.
A treewide cleanup of interrupt descriptor (ab)use with all sorts of
racy accesses, inefficient and disfunctional code. The goal is to
remove the export of irq_to_desc() to prevent these things from
creeping up again"
* tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
genirq: Restrict export of irq_to_desc()
xen/events: Implement irq distribution
xen/events: Reduce irq_info:: Spurious_cnt storage size
xen/events: Only force affinity mask for percpu interrupts
xen/events: Use immediate affinity setting
xen/events: Remove disfunct affinity spreading
xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
net/mlx5: Use effective interrupt affinity
net/mlx5: Replace irq_to_desc() abuse
net/mlx4: Use effective interrupt affinity
net/mlx4: Replace irq_to_desc() abuse
PCI: mobiveil: Use irq_data_get_irq_chip_data()
PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
NTB/msi: Use irq_has_action()
mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
pinctrl: nomadik: Use irq_has_action()
drm/i915/pmu: Replace open coded kstat_irqs() copy
drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
...
- Don't move BSS section around pointlessly in the x86 decompressor
- Refactor helper for discovering the EFI secure boot mode
- Wire up EFI secure boot to IMA for arm64
- Some fixes for the capsule loader
- Expose the RT_PROP table via the EFI test module
- Relax DT and kernel placement restrictions on ARM
+ followup fixes:
- fix the build breakage on IA64 caused by recent capsule loader changes
- suppress a type mismatch build warning in the expansion of
EFI_PHYS_ALIGN on ARM
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/kWCMACgkQEsHwGGHe
VUqVlxAAg3jSS5w5fuaXON2xYZmgKdlRB0fjbklo1ZrWS6sEHrP+gmVmrJSWGZP+
qFleQ6AxaYK57UiBXxS6Xfn7hHRToqdOAGnaSYzIg1aQIofRoLxvm3YHBMKllb+g
x73IBS/Hu9/kiH8EVDrJSkBpVdbPwDnw+FeW4ZWUMF9GVmV8oA6Zx23BVSVsbFda
jat/cEsJQS3GfECJ/Fg5ae+c/2zn5NgbaVtLxVnMnJfAwEpoPz3ogKoANSskdZg3
z6pA1aMFoHr+lnlzcsM5zdboQlwZRKPHvFpsXPexESBy5dPkYhxFnHqgK4hSZglC
c3QoO9Gn+KOJl4KAKJWNzCrd3G9kKY5RXkoei4bH9wGMjW2c68WrbFyXgNsO3vYR
v5CKpq3+jlwGo03GiLJgWQFdgqX0EgTVHHPTcwFpt8qAMi9/JIPSIeTE41p2+AjZ
cW5F0IlikaR+N8vxc2TDvQTuSsroMiLcocvRWR61oV/48pFlEjqiUjV31myDsASg
gGkOxZOOz2iBJfK8lCrKp5p9JwGp0M0/GSHTxlYQFy+p4SrcOiPX4wYYdLsWxioK
AbVhvOClgB3kN7y7TpLvdjND00ciy4nKEC0QZ5p5G59jSLnpSBM/g6av24LsSQwo
S1HJKhQPbzcI1lhaPjo91HQoOOMZHWLes0SqK4FGNIH+0imHliA=
=n7Gc
-----END PGP SIGNATURE-----
Merge tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Borislav Petkov:
"These got delayed due to a last minute ia64 build issue which got
fixed in the meantime.
EFI updates collected by Ard Biesheuvel:
- Don't move BSS section around pointlessly in the x86 decompressor
- Refactor helper for discovering the EFI secure boot mode
- Wire up EFI secure boot to IMA for arm64
- Some fixes for the capsule loader
- Expose the RT_PROP table via the EFI test module
- Relax DT and kernel placement restrictions on ARM
with a few followup fixes:
- fix the build breakage on IA64 caused by recent capsule loader
changes
- suppress a type mismatch build warning in the expansion of
EFI_PHYS_ALIGN on ARM"
* tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: arm: force use of unsigned type for EFI_PHYS_ALIGN
efi: ia64: disable the capsule loader
efi: stub: get rid of efi_get_max_fdt_addr()
efi/efi_test: read RuntimeServicesSupported
efi: arm: reduce minimum alignment of uncompressed kernel
efi: capsule: clean scatter-gather entries from the D-cache
efi: capsule: use atomic kmap for transient sglist mappings
efi: x86/xen: switch to efi_get_secureboot_mode helper
arm64/ima: add ima_arch support
ima: generalize x86/EFI arch glue for other EFI architectures
efi: generalize efi_get_secureboot
efi/libstub: EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER should not default to yes
efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()
efi/libstub/x86: simplify efi_is_native()
Merge KASAN updates from Andrew Morton.
This adds a new hardware tag-based mode to KASAN. The new mode is
similar to the existing software tag-based KASAN, but relies on arm64
Memory Tagging Extension (MTE) to perform memory and pointer tagging
(instead of shadow memory and compiler instrumentation).
By Andrey Konovalov and Vincenzo Frascino.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (60 commits)
kasan: update documentation
kasan, mm: allow cache merging with no metadata
kasan: sanitize objects when metadata doesn't fit
kasan: clarify comment in __kasan_kfree_large
kasan: simplify assign_tag and set_tag calls
kasan: don't round_up too much
kasan, mm: rename kasan_poison_kfree
kasan, mm: check kasan_enabled in annotations
kasan: add and integrate kasan boot parameters
kasan: inline (un)poison_range and check_invalid_free
kasan: open-code kasan_unpoison_slab
kasan: inline random_tag for HW_TAGS
kasan: inline kasan_reset_tag for tag-based modes
kasan: remove __kasan_unpoison_stack
kasan: allow VMAP_STACK for HW_TAGS mode
kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK
kasan: introduce set_alloc_info
kasan: rename get_alloc/free_info
kasan: simplify quarantine_put call site
kselftest/arm64: check GCR_EL1 after context switch
...
When a split lock is detected always make sure to disable interrupts
before returning from the trap handler.
The kernel exit code assumes that all exits run with interrupts
disabled, otherwise the SWAPGS sequence can race against interrupts and
cause recursing page faults and later panics.
The problem will only happen on CPUs with split lock disable
functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville.
Fixes: ca4c6a9858 ("x86/traps: Make interrupt enable/disable symmetric in C code")
Fixes: bce9b042ec ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a config option CONFIG_KASAN_STACK that has to be enabled for
KASAN to use stack instrumentation and perform validity checks for
stack variables.
There's no need to unpoison stack when CONFIG_KASAN_STACK is not enabled.
Only call kasan_unpoison_task_stack[_below]() when CONFIG_KASAN_STACK is
enabled.
Note, that CONFIG_KASAN_STACK is an option that is currently always
defined when CONFIG_KASAN is enabled, and therefore has to be tested
with #if instead of #ifdef.
Link: https://lkml.kernel.org/r/d09dd3f8abb388da397fd11598c5edeaa83fe559.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/If8a891e9fe01ea543e00b576852685afec0887e3
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PSCI relay at EL2 when "protected KVM" is enabled
* New exception injection code
* Simplification of AArch32 system register handling
* Fix PMU accesses when no PMU is enabled
* Expose CSV3 on non-Meltdown hosts
* Cache hierarchy discovery fixes
* PV steal-time cleanups
* Allow function pointers at EL2
* Various host EL2 entry cleanups
* Simplification of the EL2 vector allocation
s390:
* memcg accouting for s390 specific parts of kvm and gmap
* selftest for diag318
* new kvm_stat for when async_pf falls back to sync
x86:
* Tracepoints for the new pagetable code from 5.10
* Catch VFIO and KVM irqfd events before userspace
* Reporting dirty pages to userspace with a ring buffer
* SEV-ES host support
* Nested VMX support for wait-for-SIPI activity state
* New feature flag (AVX512 FP16)
* New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
* Selftest improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
=ISud
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Much x86 work was pushed out to 5.12, but ARM more than made up for it.
ARM:
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
s390:
- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync
x86:
- Tracepoints for the new pagetable code from 5.10
- Catch VFIO and KVM irqfd events before userspace
- Reporting dirty pages to userspace with a ring buffer
- SEV-ES host support
- Nested VMX support for wait-for-SIPI activity state
- New feature flag (AVX512 FP16)
- New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
- Selftest improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: SVM: fix 32-bit compilation
KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
KVM: SVM: Provide support to launch and run an SEV-ES guest
KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
KVM: SVM: Provide support for SEV-ES vCPU loading
KVM: SVM: Provide support for SEV-ES vCPU creation/loading
KVM: SVM: Update ASID allocation to support SEV-ES guests
KVM: SVM: Set the encryption mask for the SVM host save area
KVM: SVM: Add NMI support for an SEV-ES guest
KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
KVM: SVM: Do not report support for SMM for an SEV-ES guest
KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
KVM: SVM: Add support for EFER write traps for an SEV-ES guest
KVM: SVM: Support string IO operations for an SEV-ES guest
KVM: SVM: Support MMIO for an SEV-ES guest
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
KVM: SVM: Create trace events for VMGEXIT processing
...
The major update to this release is that there's a new arch config option called:
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it.
All the ftrace callbacks now take a struct ftrace_regs instead of a struct
pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then
the ftrace_regs will have enough information to read the arguments of the
function being traced, as well as access to the stack pointer. This way, if
a user (like live kernel patching) only cares about the arguments, then it
can avoid using the heavier weight "regs" callback, that puts in enough
information in the struct ftrace_regs to simulate a breakpoint exception
(needed for kprobes).
New config option that audits the timestamps of the ftrace ring buffer at
most every event recorded. The "check_buffer()" calls will conflict with
mainline, because I purposely added the check without including the fix that
it caught, which is in mainline. Running a kernel built from the commit of
the added check will trigger it.
Ftrace recursion protection has been cleaned up to move the protection to
the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace to do
it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that lists all
the places that triggered the recursion protection of the function tracer.
This will show where things need to be fixed as recursion slows down the
function tracer.
The eval enum mapping updates done at boot up are now offloaded to a work
queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX9uq8xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtrwAQCHevqWMjKc1Q76bnCgwB0AbFKB6vqy
5b6g/co5+ihv8wD/eJPWlZMAt97zTVW7bdp5qj/GTiCDbAsODMZ597LsxA0=
=rZEz
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major update to this release is that there's a new arch config
option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.
Currently, only x86_64 enables it. All the ftrace callbacks now take a
struct ftrace_regs instead of a struct pt_regs. If the architecture
has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
have enough information to read the arguments of the function being
traced, as well as access to the stack pointer.
This way, if a user (like live kernel patching) only cares about the
arguments, then it can avoid using the heavier weight "regs" callback,
that puts in enough information in the struct ftrace_regs to simulate
a breakpoint exception (needed for kprobes).
A new config option that audits the timestamps of the ftrace ring
buffer at most every event recorded.
Ftrace recursion protection has been cleaned up to move the protection
to the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace
to do it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that
lists all the places that triggered the recursion protection of the
function tracer. This will show where things need to be fixed as
recursion slows down the function tracer.
The eval enum mapping updates done at boot up are now offloaded to a
work queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes"
* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Offload eval map updates to a work queue
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
ring-buffer: Add rb_check_bpage in __rb_allocate_pages
ring-buffer: Fix two typos in comments
tracing: Drop unneeded assignment in ring_buffer_resize()
tracing: Disable ftrace selftests when any tracer is running
seq_buf: Avoid type mismatch for seq_buf_init
ring-buffer: Fix a typo in function description
ring-buffer: Remove obsolete rb_event_is_commit()
ring-buffer: Add test to validate the time stamp deltas
ftrace/documentation: Fix RST C code blocks
tracing: Clean up after filter logic rewriting
tracing: Remove the useless value assignment in test_create_synth_event()
livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
MAINTAINERS: assign ./fs/tracefs to TRACING
tracing: Fix some typos in comments
ftrace: Remove unused varible 'ret'
ring-buffer: Add recording of ring buffer recursion into recursed_functions
...
Pull swiotlb update from Konrad Rzeszutek Wilk:
"A generic (but for right now engaged only with AMD SEV) mechanism to
adjust a larger size SWIOTLB based on the total memory of the SEV
guests which right now require the bounce buffer for interacting with
the outside world.
Normal knobs (swiotlb=XYZ) still work"
* 'stable/for-linus-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel. Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.
Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b38130 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 70e806e4e6 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.
However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:
CPU 0 CPU 1
get_user_pages_fast()
internal_get_user_pages_fast()
copy_page_range()
pte_alloc_map_lock()
copy_present_page()
atomic_read(has_pinned) == 0
page_maybe_dma_pinned() == false
atomic_set(has_pinned, 1);
gup_pgd_range()
gup_pte_range()
pte_t pte = gup_get_pte(ptep)
pte_access_permitted(pte)
try_grab_compound_head()
pte = pte_wrprotect(pte)
set_pte_at();
pte_unmap_unlock()
// GUP now returns with a write protected page
The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e ("mm: avoid
early COW write protect games during fork()")
Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.
Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl/XoggPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDsRYP/3ZtGWsyBc1sKdaTBIwQdnrPQHL+7o1Mmjnl
b+YqRMWcJW4g3O81GW6IA+vM0A1UMJxVOjzkZd8KulGv3RCZiqQmWJClWFlYbwLj
e+HHx+Zo/qsmDrwcVoFI8/n+iC/a5fIaCbSWMSPaKHrOMxBiHQk0qlaq4AZ8gb7a
/eHYqI/hISJQb1ZVFHmwlp8FoMnB2M6/FDpCf8oeGKjpF2hjghIPugJ0oRlPLZjB
o3Q6ELEScJV1wBy7d1+5rkm52t9j8gpGhXxja0QwypADNzk5KHEzghXq+rTWUh1S
et9OfqkflMtKMsh0qNwe5ZFbqtsH69qtYMAj4ok7rZOwQcbJ97VSrP5ka7VVzSdC
AgcQU9c9LoyQ7rk0dbs3t0cd8hMgVu50guZ/iHfW88CcdykN9M0nnSPRAYpNbW85
xndBQ5k/a4FoufwoY4e0hS28HIiRfLoEA68mps+yoMiiKh27HO2v4GFRIJoCNxzp
YQ01zOBp9FKYTsxj0h7mMf+5EEyo9E4X/kJOfZpOVVbVKy82wPAGLJpDEnbnoJUe
j1jBmiV/trkn+nTnWmDoXcw2ljuIF9dBm2M8r8yGKdNEHptnN8tMVRlCRImVVWW0
BbZGAzoK0tpKXPIlUh4aXS3mtV9qlohs9rzjVyKfGnaRRbRGANM8qrH5aKuDFinM
RugpMWyk
=hf4L
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.11
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
This function uses irq_to_desc() and is going to be used by modules to
replace the open coded irq_to_desc() (ab)usage. The final goal is to remove
the export of irq_to_desc() so driver cannot fiddle with it anymore.
Move it into the core code and fixup the usage sites to include the proper
header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201210194042.548936472@linutronix.de
- Simplification and distangling of the MSI related functionality
- Let IO/APIC construct the RTE entries from an MSI message instead of
having IO/APIC specific code in the interrupt remapping drivers
- Make the retrieval of the parent interrupt domain (vector or remap
unit) less hardcoded and use the relevant irqdomain callbacks for
selection.
- Allow the handling of more than 255 CPUs without a virtualized IOMMU
when the hypervisor supports it. This has made been possible by the
above modifications and also simplifies the existing workaround in the
HyperV specific virtual IOMMU.
- Cleanup of the historical timer_works() irq flags related
inconsistencies.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln
+hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2
16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb
gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK
R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL
VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av
JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd
QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx
Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l
f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0
xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI
c2KMg5zoesI4bg==
=8Gfs
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Thomas Gleixner:
"Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality
- Let IO/APIC construct the RTE entries from an MSI message instead
of having IO/APIC specific code in the interrupt remapping drivers
- Make the retrieval of the parent interrupt domain (vector or remap
unit) less hardcoded and use the relevant irqdomain callbacks for
selection.
- Allow the handling of more than 255 CPUs without a virtualized
IOMMU when the hypervisor supports it. This has made been possible
by the above modifications and also simplifies the existing
workaround in the HyperV specific virtual IOMMU.
- Cleanup of the historical timer_works() irq flags related
inconsistencies"
* tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
x86/ioapic: Cleanup the timer_works() irqflags mess
iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled
iommu/amd: Fix union of bitfields in intcapxt support
x86/ioapic: Correct the PCI/ISA trigger type selection
x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available
x86/apic: Support 15 bits of APIC ID in MSI where available
x86/ioapic: Handle Extended Destination ID field in RTE
iommu/vt-d: Simplify intel_irq_remapping_select()
x86: Kill all traces of irq_remapping_get_irq_domain()
x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
iommu/hyper-v: Implement select() method on remapping irqdomain
iommu/vt-d: Implement select() method on remapping irqdomain
iommu/amd: Implement select() method on remapping irqdomain
x86/apic: Add select() method on vector irqdomain
...
- Consolidate all kmap_atomic() internals into a generic implementation
which builds the base for the kmap_local() API and make the
kmap_atomic() interface wrappers which handle the disabling/enabling of
preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a mapping
is established. It has to disable migration instead to guarantee that
the virtual address of the mapped slot is the same accross preemption.
- Provide better debug facilities: guard pages and enforced utilization
of the mapping mechanics on 64bit systems when the architecture allows
it.
- Provide the new kmap_local() API which can now be used to cleanup the
kmap_atomic() usage sites all over the place. Most of the usage sites
do not require the implicit disabling of preemption and pagefaults so
the penalty on 64bit and 32bit non-highmem systems is removed and quite
some of the code can be simplified. A wholesale conversion is not
possible because some usage depends on the implicit side effects and
some need to be cleaned up because they work around these side effects.
The migrate disable side effect is only effective on highmem systems
and when enforced debugging is enabled. On 64bit and 32bit non-highmem
systems the overhead is completely avoided.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
+x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
+jlfoJhMbtx5Gg==
=n71I
-----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
- migrate_disable/enable() support which originates from the RT tree and
is now a prerequisite for the new preemptible kmap_local() API which aims
to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
/QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
Gsq0rJW5iiu/OQ==
=Oz1V
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
Core:
- Better handling of page table leaves on archictectures which have
architectures have non-pagetable aligned huge/large pages. For such
architectures a leaf can actually be part of a larger entry.
- Prevent a deadlock vs. exec_update_mutex
Architectures:
- The related updates for page size calculation of leaf entries
- The usual churn to support new CPUs
- Small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XvgATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUrdEACatdr93wv75vnm5tCZM4EsFvB2PzVJ
ck4K4+hHiMVV4802qf+kW5plF+rckAU4TAai/L7wkTntKHvjD/0/o1epoIStb+dS
SCpVkQMCLT/8xT242iHPOfgsQpVpJnIiBwVRjn8HXu82nXdgMJhKnBjTe634UfxW
o2OCFiyJzpRi5l86gVp67ueqgvl34NPI2JaSLc0g80QfZ8akzdePPpED35CzYjZh
41k+7ssvt6qch3vMUySHAhkX4gQl0nc80YAaF/XZbCfvdyY7D03PtfBjfvphTSK0
l54z9aWh0ciK9P1aPfvkHDXBJUR2VtUAx2GiURK+XU3jNk3KMrz9CcBl1D/exIAg
07IsiYVoB38YAUOZoR9K8p+p+5EuwYRRUMAgfQfBALCuaLQV477Cne82b2KmNCus
1izUQvcDDf0s74OyYTHWFXRGla95COJvNLzkrZ1oU3mX4HgdKdOAUbf/2XTLWeKO
3HOIS+jsg5cp82tRe4X5r51h73pONYlo9lLo/CjQXz25vMcXKtE/MZGq2gkRff4p
N4k88eQ5LOsRqUaU46GcHozXRCfcpW7SPI9AaN5I/fKGIZvHP7uMdMb+g5DV8yHI
dNZ8u5uLPHwdg80C3fJ3Pnp7VsVNHliPXMwv0vib7BCp7aUVZWeFnOntw3PdYFRk
XKEbfl36IuAadg==
=rZ99
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
"Core:
- Better handling of page table leaves on archictectures which have
architectures have non-pagetable aligned huge/large pages. For such
architectures a leaf can actually be part of a larger entry.
- Prevent a deadlock vs exec_update_mutex
Architectures:
- The related updates for page size calculation of leaf entries
- The usual churn to support new CPUs
- Small fixes and improvements all over the place"
* tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
perf/x86/intel: Add Tremont Topdown support
uprobes/x86: Fix fall-through warnings for Clang
perf/x86: Fix fall-through warnings for Clang
kprobes/x86: Fix fall-through warnings for Clang
perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
x86/kprobes: Restore BTF if the single-stepping is cancelled
perf: Break deadlock involving exec_update_mutex
sparc64/mm: Implement pXX_leaf_size() support
powerpc/8xx: Implement pXX_leaf_size() support
arm64/mm: Implement pXX_leaf_size() support
perf/core: Fix arch_perf_get_page_size()
mm: Introduce pXX_leaf_size()
mm/gup: Provide gup_get_pte() more generic
perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
perf/x86/intel/uncore: Add Rocket Lake support
perf/x86/msr: Add Rocket Lake CPU support
perf/x86/cstate: Add Rocket Lake CPU support
perf/x86/intel: Add Rocket Lake CPU support
perf,mm: Handle non-page-table-aligned hugetlbfs
...
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs.
- Lockdep-RCU updates reducing the need for __maybe_unused.
- Tasks-RCU updates.
- Miscellaneous fixes.
- Documentation updates.
- Torture-test updates.
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xon4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobXUD/92LJTI/TMgK6Z6EEQBiJZO/2mNKjK8
FEKc6AqTNMlZNsWCfQ5UgqtHpn+MkBZsX1x4u22gehE1qaCB8gnQ5wXgbXon8tQm
exxVk6vvQZjseeqCMqrsUYQlD7dNgHnf1qAmWXJvji4sA/1Opo6n2M74tqfE2ueV
S5hpQwSuK/6Zu2Hrr62HD8+Fx0in6ZuKRZxHGp1392l++DGbniJM3dzntRXB+JbZ
w3PDHFCQuGzTytyeKuQV48ot9IK+2YzmjIp/+4tHL6mvU38xeSu6gcYtqKPcfYWw
D6HXvDa965h5IrFdSA2JWSzjJ+VYgZVElk2HyXDNIae0fM/8GidgoIDQipT1WAur
sxW/Ke4U6Jm5MMqXqV8iMNduktkGD1/h6G/iB1Yis29xFdthorNpbHVAP+8cKXgf
1cR6RorOuBYv6XpyzygHtE7qfLY5ST352pJ4+UqNzboujOcuEnGaygttt0F/F8sA
ZH8NT5dyUfbGeqepdZWkbj116Hjeg3fyV3CZeyBhDeqpjf1Nn3nbJ1xRksPLfa3i
IKvN7HSzEg+vKnsJNnQeFlAmQ/W3n2bedzRqfaCg77pNhKI6jPuavY5f2YGFUj0y
yx0UzOYoI1Cln0keBMmynbyUKgJ7zstLkrt/JenjhtD3B+0df5BmYjkL+nqkP6ax
+XTCu7Xg+B061g==
=N/iO
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Thomas Gleixner:
"RCU, LKMM and KCSAN updates collected by Paul McKenney.
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs
- Lockdep-RCU updates reducing the need for __maybe_unused
- Tasks-RCU updates
- Miscellaneous fixes
- Documentation updates
- Torture-test updates
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests"
* tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
srcu: Take early exit on memory-allocation failure
rcu/tree: Defer kvfree_rcu() allocation to a clean context
rcu: Do not report strict GPs for outgoing CPUs
rcu: Fix a typo in rcu_blocking_is_gp() header comment
rcu: Prevent lockdep-RCU splats on lock acquisition/release
rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
rcu,ftrace: Fix ftrace recursion
rcu/tree: Make struct kernel_param_ops definitions const
rcu/tree: Add a warning if CPU being onlined did not report QS already
rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
rcu: Fix single-CPU check in rcu_blocking_is_gp()
rcu: Implement rcu_segcblist_is_offloaded() config dependent
list.h: Update comment to explicitly note circular lists
rcu: Panic after fixed number of stalls
x86/smpboot: Move rcu_cpu_starting() earlier
rcu: Allow rcu_irq_enter_check_tick() from NMI
tools/memory-model: Label MP tests' producers and consumers
tools/memory-model: Use "buf" and "flag" for message-passing tests
tools/memory-model: Add types to litmus tests
tools/memory-model: Add a glossary of LKMM terms
...
- More generalization of entry/exit functionality
- The consolidation work to reclaim TIF flags on x86 and also for non-x86
specific TIF flags which are solely relevant for syscall related work
and have been moved into their own storage space. The x86 specific part
had to be merged in to avoid a major conflict.
- The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
delivery mode of task work and results in an impressive performance
improvement for io_uring. The non-x86 consolidation of this is going to
come seperate via Jens.
- The selective syscall redirection facility which provides a clean and
efficient way to support the non-Linux syscalls of WINE by catching them
at syscall entry and redirecting them to the user space emulation. This
can be utilized for other purposes as well and has been designed
carefully to avoid overhead for the regular fastpath. This includes the
core changes and the x86 support code.
- Simplification of the context tracking entry/exit handling for the users
of the generic entry code which guarantee the proper ordering and
protection.
- Preparatory changes to make the generic entry code accomodate S390
specific requirements which are mostly related to their syscall restart
mechanism.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XoPoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoe0tD/4jSKHIogVM9kVpiYfwjDGS1NluaBXn
71ZoASbX9GZebyGandMyF2QP1iJ24ZO0RztBwHEVH6fyomKB2iFNedssCpO9yfWV
3eFRpOvMpbszY2W2bd0QG3GrqaTttjVfB4ahkGLzqeSbchdob6hZpNDYtBZnujA6
GSnrrurfJkCGoQny+yJQYdQJXQU+BIX90B2a2Q+jW123Luy/iHXC1f/krZSA1m14
fC9xYLSUjPphTzh2ZOW+C3DgdjOL5PfAm/6F+DArt4GtLgrEGD7R74aLSFhvetky
dn5QtG+yAsz1i0cc5Wu/JBcT9tOkY92rPYSyLI9bYQUSQ/bMyuprz6oYKj3dubsu
ZSsKPdkNFPIniL4fLdCMWZcIXX5xgnrxKjdgXZXW3gtrcxSns8w8uED3Sh7dgE08
pgIeq67E5g/OB8kJXH1VxdewmeQb9cOmnzzHwNO7TrrGbBKjDTYHNdYOKf1dUTTK
ZX1UjLfGwxTkMYAbQD1k0JGZ2OLRshzSaH5BW/ZKa3bvJW6yYOq+/YT8B8hbJ8U3
vThlO75/55IJxS5r5Y3vZd/IHdsYbPuETD+TA8tNYtPqNZasW8nnk4TYctWqzDuO
/Ka1wvWYid3c6ySznQn4zSyRjr968AfHeZ9YTUMhWufy5waXVmdBMG41u3IKfsVt
osyzNc4EK19/Mg==
=hsjV
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry/exit updates from Thomas Gleixner:
"A set of updates for entry/exit handling:
- More generalization of entry/exit functionality
- The consolidation work to reclaim TIF flags on x86 and also for
non-x86 specific TIF flags which are solely relevant for syscall
related work and have been moved into their own storage space. The
x86 specific part had to be merged in to avoid a major conflict.
- The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
delivery mode of task work and results in an impressive performance
improvement for io_uring. The non-x86 consolidation of this is
going to come seperate via Jens.
- The selective syscall redirection facility which provides a clean
and efficient way to support the non-Linux syscalls of WINE by
catching them at syscall entry and redirecting them to the user
space emulation. This can be utilized for other purposes as well
and has been designed carefully to avoid overhead for the regular
fastpath. This includes the core changes and the x86 support code.
- Simplification of the context tracking entry/exit handling for the
users of the generic entry code which guarantee the proper ordering
and protection.
- Preparatory changes to make the generic entry code accomodate S390
specific requirements which are mostly related to their syscall
restart mechanism"
* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
entry: Add syscall_exit_to_user_mode_work()
entry: Add exit_to_user_mode() wrapper
entry_Add_enter_from_user_mode_wrapper
entry: Rename exit_to_user_mode()
entry: Rename enter_from_user_mode()
docs: Document Syscall User Dispatch
selftests: Add benchmark for syscall user dispatch
selftests: Add kselftest for syscall user dispatch
entry: Support Syscall User Dispatch on common syscall entry
kernel: Implement selective syscall userspace redirection
signal: Expose SYS_USER_DISPATCH si_code type
x86: vdso: Expose sigreturn address on vdso to the kernel
MAINTAINERS: Add entry for common entry code
entry: Fix boot for !CONFIG_GENERIC_ENTRY
x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
sched: Detect call to schedule from critical entry code
context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
x86: Reclaim unused x86 TI flags
...
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags (like
SPARC ADI, subject to wiring up the arch code). The user will have to
opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra bits, if
available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans the
Device Tree dma-ranges and the ACPI IORT information before deciding
on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override (UAO)
ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl/XcSgACgkQa9axLQDI
XvGkwg//SLknimELD/cphf2UzZm5RFuCU0x1UnIXs9XYo5BrOpgVLLA//+XkCrKN
0GLAdtBDfw1axWJudzgMBiHrv6wSGh4p3YWjLIW06u/PJu3m3U8oiiolvvF8d7Yq
UKDseKGQnQkrl97J0SyA+Da/u8D11GEzp52SWL5iRxzt6vInEC27iTOp9n1yoaoP
f3y7qdp9kv831ryUM3rXFYpc8YuMWXk+JpBSNaxqmjlvjMzipA5PhzBLmNzfc657
XcrRX5qsgjEeJW8UUnWUVNB42j7tVzN77yraoUpoVVCzZZeWOQxqq5EscKPfIhRt
AjtSIQNOs95ZVE0SFCTjXnUUb823coUs4dMCdftqlE62JNRwdR+3bkfa+QjPTg1F
O9ohW1AzX0/JB19QBxMaOgbheB8GFXh3DVJ6pizTgxJgyPvQQtFuEhT1kq8Cst0U
Pe+pEWsg9t41bUXNz+/l9tUWKWpeCfFNMTrBXLmXrNlTLeOvDh/0UiF0+2lYJYgf
YAboibQ5eOv2wGCcSDEbNMJ6B2/6GtubDJxH4du680F6Emb6pCSw0ntPwB7mSGLG
5dXz+9FJxDLjmxw7BXxQgc5MoYIrt5JQtaOQ6UxU8dPy53/+py4Ck6tXNkz0+Ap7
gPPaGGy1GqobQFu3qlHtOK1VleQi/sWcrpmPHrpiiFUf6N7EmcY=
=zXFk
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags
(like SPARC ADI, subject to wiring up the arch code). The user will
have to opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra
bits, if available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans
the Device Tree dma-ranges and the ACPI IORT information before
deciding on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override
(UAO) ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
perf/imx_ddr: Add system PMU identifier for userspace
bindings: perf: imx-ddr: add compatible string
arm64: Fix build failure when HARDLOCKUP_DETECTOR_PERF is enabled
arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
arm64: mark __system_matches_cap as __maybe_unused
arm64: uaccess: remove vestigal UAO support
arm64: uaccess: remove redundant PAN toggling
arm64: uaccess: remove addr_limit_user_check()
arm64: uaccess: remove set_fs()
arm64: uaccess cleanup macro naming
arm64: uaccess: split user/kernel routines
arm64: uaccess: refactor __{get,put}_user
arm64: uaccess: simplify __copy_user_flushcache()
arm64: uaccess: rename privileged uaccess routines
arm64: sdei: explicitly simulate PAN/UAO entry
arm64: sdei: move uaccess logic to arch/arm64/
arm64: head.S: always initialize PSTATE
arm64: head.S: cleanup SCTLR_ELx initialization
arm64: head.S: rename el2_setup -> init_kernel_el
arm64: add C wrappers for SET_PSTATE_*()
...
(Arvind Sankar)
- Remove -m16 workaround now that the GCC versions that need it are unsupported
(Nick Desaulniers)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhwYACgkQEsHwGGHe
VUqYPQ/9ECDO6Ms8rkvZf4f7+LfKO8XsRYf0h411h5zuVssgPybIbZCvuiQ5Jj9r
IzTe0pO0AoVoX5/JE+mxbvroBb5hRokNehJayPmdVjZSrzsYWZR3RVW6iSRo/TQ0
bjzY1cHC4EPmtyYA4FDr+b0JAEEbyPuPOAcqSzBBadsx5FURvwH7s1hUILCZ1nmy
i8DADZDo7Puk1W8VQkGkjUeo3QVY8KG2VzZegU0iUR/M7Hoek3WBPObYFg0bsKN4
6hvRWPMt2lFXOSNZOFhG+LjtWXTJ22g7ENU4SHXhHBdxmM8H/OZvh/3UZTes7rdo
u/LZJ580GfOa3TYHOToGQq9iBxoi9Se3vLXUtt2I9Z7sbBVUGTKx2Llezvt9/QRh
kJkpvMc2r2b9n7Cu52gHKQ89OrqFzs2scNyjQ2KRU61bPwLjWoGiobugnl8T9q8Y
Li8LKS2tCD6yo7L1eYt0NDriWVF3/Tv1rebWMNIf3sGSHFkxgHxFviTx3L3SWEBS
8JMuKLPh20WC9ombXP/tDN2rtTmmL9hEoqZ9uSTG1tRNLOLLNv790ucUh+lp6RFp
7y8dzl+L3hXsFdcCYooOIGO1phmdThwSLMqZHmbpRWpKNr+8P0wCqZP16twnoXNC
kO7OiNd6XHVpJg4WnN3C3X70n7TPXSZ7dkpDNFBJaH2QilNEHa8=
=Qk/S
-----END PGP SIGNATURE-----
Merge tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Borislav Petkov:
"Two x86 build fixes:
- Fix the vmlinux size check on 64-bit along with adding useful
clarifications on the topic (Arvind Sankar)
- Remove -m16 workaround now that the GCC versions that need it are
unsupported (Nick Desaulniers)"
* tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Remove -m16 workaround for unsupported versions of GCC
x86/build: Fix vmlinux size check on 64-bit
(Fenghua Yu)
- Cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhTUACgkQEsHwGGHe
VUpTkA/9FNOaJohBa9XO2bv3RjsqE4/2PqZS434HJL6yrbbRDFyscSp2sNCDlfh/
6zj9R72HpRf6xLW8CTeszrfUQ0z3CFnfwz8EAolhv5DOiJnM3wSS5inmLhtyTMgw
mJ8qVXXELdAFm1R8rAQLVmA3FE9aV6u19POstKeXMUykCyaxKDhNLJgn3CiXpegU
AvG3QWmrR/1x4DjQguMNwXMQiYVDkRdnfJa5SOzCTwyUj0D0an+kPmSHEMvhaOL6
tUYfjLkZYdB5THG8eUM833EJmgRe7X1VtTQIydcVyt/6tbL50FmVYUpWk+SXuSnJ
/uBdzx0IXmfvYKgGSFw+FGOyGH8u02nR4621rECNLNf1h8YknzpN7Ri1hB83GjQe
PMiW/gCwxM6gfRsBpJ17xnAbmR5Az2dOz71uj5k13L1GNL8VZD2DZz//a4TDarzE
4R8i3PQ/oUMgmL+ARpVWwycc/dhB8o+1glmAjOwWHO/kM1k/hx9Ou5bbLlhffbL5
zkk6ORrNfKzAMvE1flkjim5XgNHY8n/L4CwBKXJFQ3IYbXBGg0CKxBHWvTpIz1KO
O4h52OnYUpM15sY22NSvpdRiHcsgEqNOB4lb6K2dby9iE5b5RHpNgfMNO6+eOtag
dgCWopamLK2+MFeNe48+1v/3bDveskhvJon91GUZYmfkFzhZTN8=
=9Lsd
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
- add logic to correct MBM total and local values fixing errata SKX99
and BDF102 (Fenghua Yu)
- cleanups
* tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Clean up unused function parameter in rmdir path
x86/resctrl: Constify kernfs_ops
x86/resctrl: Correct MBM total and local values
Documentation/x86: Rename resctrl_ui.rst and add two errata to the file
(Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
=lLiH
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Another branch with a nicely negative diffstat, just the way I
like 'em:
- Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
the end (Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree"
* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/ia32_signal: Propagate __user annotation properly
x86/alternative: Update text_poke_bp() kernel-doc comment
x86/PCI: Make a kernel-doc comment a normal one
x86/asm: Drop unused RDPID macro
x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
x86/head64: Remove duplicate include
x86/mm: Declare 'start' variable where it is used
x86/head/64: Remove unused GET_CR2_INTO() macro
x86/boot: Remove unused finalize_identity_maps()
x86/uaccess: Document copy_from_user_nmi()
x86/dumpstack: Make show_trace_log_lvl() static
x86/mtrr: Fix a kernel-doc markup
x86/setup: Remove unused MCA variables
x86, libnvdimm/test: Remove COPY_MC_TEST
x86: Reclaim TIF_IA32 and TIF_X32
x86/mm: Convert mmu context ia32_compat into a proper flags field
x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
elf: Expose ELF header on arch_setup_additional_pages()
x86/elf: Use e_machine to select start_thread for x32
elf: Expose ELF header in compat_start_thread()
...
an attempt to have userspace tools not poke at naked MSRs. This round
deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct poking into it
by our in-tree tools in favor of the proper "energy_perf_bias" sysfs
interface which we already have.
In addition, the msr.ko write filtering's error message points to a new
summary page which contains the info we collected from helpful reporters
about which userspace tools write MSRs:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about
along with the current status of their conversion.
Rest is the usual small fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVKYACgkQEsHwGGHe
VUondg//fv3aQM3KtWE7sxv6BjpiUNozPBELRuKo+EskHSxHudRhBxzdSMM7WgKq
2uojb2CQtzRzYhHuiXjXKfbB7Ci/Jo4EDCJW2otpiqit7/UgXu15Q5ypCUMIteiV
u9A2w3oN3GPR5TuofLWCffaotVMpFok3u7jX7RxEQPWmZqJItTwZpqYLeyniHaKM
c6taAxZVyV13iejRhxim2zkl/hMXpjA8I+8CqWIL25J7GYlYeWLWxWYmHIQTs0NM
zSIyr47RD8RRXVeRdeJMxnQblKE1zrObIV1fUXXu1dSW47DkrrcOQwEMorNjPtPA
FR5Xhi+TX8JrBasMpwCnV/CTj6Ua8UsMfwQcPOFnXALPj87HfFSypa5BpnBH5xTW
PaiatRmiNJm3g79ncaTvXCksMbb4WANqOYK+gsGYvtKbfLR+caWT6vytjZA6sC6x
laynstV9PFUyewdwjjAjilhArzV+y+5RsRudBK8xSjcawbyV4ZEorNKYS9qrhm+y
7CAM9A8fCQiO6POr6W7HcfmkUOHC9PLhtyjdJH89tAmaf+sfvaczzx3awwSuKx7P
0rJlDiJP1v7yEpOMWHbpGIqjMBaWK4y3mb4g3UwFpHpo8cTl+WXZQppOPIBn9GA9
ASLYT/ze7zk1Ua2V88qoXiC5AEvqBnSq4fp2pmf06ROZgBnYT6o=
=ISyk
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
"The main part of this branch is the ongoing fight against windmills in
an attempt to have userspace tools not poke at naked MSRs.
This round deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct
poking into it by our in-tree tools in favor of the proper
"energy_perf_bias" sysfs interface which we already have.
In addition, the msr.ko write filtering's error message points to a
new summary page which contains the info we collected from helpful
reporters about which userspace tools write MSRs:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about
along with the current status of their conversion.
The rest is the usual small fixes and improvements"
* tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/msr: Add a pointer to an URL which contains further details
x86/pci: Fix the function type for check_reserved_t
selftests/x86: Add missing .note.GNU-stack sections
selftests/x86/fsgsbase: Fix GS == 1, 2, and 3 tests
x86/msr: Downgrade unrecognized MSR message
x86/msr: Do not allow writes to MSR_IA32_ENERGY_PERF_BIAS
tools/power/x86_energy_perf_policy: Read energy_perf_bias from sysfs
tools/power/turbostat: Read energy_perf_bias from sysfs
tools/power/cpupower: Read energy_perf_bias from sysfs
MAINTAINERS: Cleanup SGI-related entries
(Justin Ernst and Mike Travis)
- The usual set of small fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XWLkACgkQEsHwGGHe
VUpgoQ//WnhLoyByrxyKScOLW2aCCX4sIQ2R0Mjqt4z6wHHL6OB1GIswtIWsFlpw
UtFOhTmKV0A5q/yXIGYY1EOZPhA4wrrTesYHsOoVcJ3xdwrAarc3fzw02uKT/uQp
sW0JxTtJ5iK0XZp+K81344sQy26iuTiLh0Lwg9lJ5mpUotW+1iSOp7LAgcN462r9
3l8To6qWyWnnMWgJlV7jdcsswk0V2TkT/iipQITNqEGtwqRm/uFGRgK/669IbqP/
P/qROhU1XseHxVFfPMXfFVeFjNTpkcgD11C8YeqWiHXrmBesCHIC9cd2pYfJepLJ
F4sIh6zqt6v6z5UOD+yo8IjdmL0v9nPLJiCeiu9ES59pVIx7DsR+p005fBLTAXp8
QpbiAhJn4ajpu78QAkkhieAvldXP2+1h42wK8HUTWKt2wPsnyGozvY71kp8UkMSd
1Kh3mahrAN57TfvlewrMr/ynhGMMuWQHIr3ScYei0L6vF2c6f98HSiNzzZ1PiiY7
6OwR+qISucV8ZeXMv6va5POI/T67UZjjuT5Zw0ls1LpqJ6CkGmcrCeq6qYbxK7Fy
Vd8T/v51FknrTF0J42GMLb9ICIFmC0yAvzflTytFvWqgGEZpEEpRzmKfeC+a0G7B
m3XEm/xE3+3TKnID+2tPSFOEU4mdwKC0+lguD7DV8aTYpdjsxXs=
=VZBo
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- add a new uv_sysfs driver and expose read-only information from UV
BIOS (Justin Ernst and Mike Travis)
- the usual set of small fixes
* tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Update sysfs documentation
x86/platform/uv: Add deprecated messages to /proc info leaves
x86/platform/uv: Add sysfs hubless leaves
x86/platform/uv: Add sysfs leaves to replace those in procfs
x86/platform/uv: Add kernel interfaces for obtaining system info
x86/platform/uv: Make uv_pcibus_kset and uv_hubs_kset static
x86/platform/uv: Fix an error code in uv_hubs_init()
x86/platform/uv: Update MAINTAINERS for uv_sysfs driver
x86/platform/uv: Update ABI documentation of /sys/firmware/sgi_uv/
x86/platform/uv: Add new uv_sysfs platform driver
x86/platform/uv: Add and export uv_bios_* functions
x86/platform/uv: Remove existing /sys/firmware/sgi_uv/ interface
code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVlYACgkQEsHwGGHe
VUpxTA/9F0KsgSyTh66uX+aX5qkQ3WTBVgxbXGFrn5qPvwcALXabU8qObDWTSdwS
1YbiWDjKNBJX+dggWe/fcQgUZxu5DFkM4IKEW1V7MLJEcdfylcqCyc1YNpEI4ySn
ebw2Sy4/5iXGAvhz802/WoUU/o3A2uZwe0RFyodHGxof5027HkZhRHeYB27Htw+l
z0IsmiYOoPl/4mNuVgr/qieIFSw1SUE9kwjU8RvM6xVWmXWXpM68JHa9s+/51pFt
6BaOz485OyzWUCtSx3/++GEkU2d53bWYOuQ1zTLEiuaBfYC5n5T/kAcT4WJNK6Tf
tX7yrzmWm9ecykIxfkgMrhG57G38y2GMJcEg+dFQHeXC062fdHDg+oY6Ql2EkAm5
t5RIQ/cyOmQCLns31rHI/kwQ3RMKc/lfnL/z8lrlfWsC5o755yFJKttbfLJugbTo
3BO1fbs4xgQcgi0KoqXOUETrQtsOLtr9FJwvcArB94XXqcIPClE8Ir7n8T7FCuLr
9litSXIdn46EHwD6hD5QIk7y+Rxwk/jxZFys3eh90jcWDDZTaG2lz3if33RbZ1go
XBrS5X3HsMODGZlaMeUjrbFIz3e0Zyoo+RO/TX48w8nzivC6xSNxSNFgIZ1XTF5E
SLMGa6lEQ9mLiqRfgFjynNwSYOSlGv3euMkZaVPS3hnNmn+vZbI=
=RsCs
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
"Only AMD-specific changes this time:
- Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and
convert all code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD
(Arvind Sankar)"
* tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove dead code for TSEG region remapping
x86/topology: Set cpu_die_id only if DIE_TYPE found
EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
x86/CPU/AMD: Remove amd_get_nb_id()
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
applications to populate protected regions of user code and data called
enclaves. Once activated, the new hardware protects enclave code and
data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to
run in enclaves, but there are also projects that try to run limited
unmodified software in enclaves."
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used by
SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XTtMACgkQEsHwGGHe
VUqxFw/+NZGf2b3CWPcrvwXCpkvSpIrqh1jQwyvkZyJ1gen7Vy8dkvf99h8+zQPI
4wSArEyjhYJKAAmBNefLKi/Cs/bdkGzLlZyDGqtM641XRjf0xXIpQkOBb6UBa+Pv
to8veQmVH2bBTM49qnd+H1wM6FzYvhTYCD8xr4HlLXtIfpP2CK2GvCb8s/4LifgD
fTucZX9TFwLgVkWOHWHN0n8XMR2Fjb2YCrwjFMKyr/M2W+pPoOCTIt4PWDuXiOeG
rFP7R4DT9jDg8ht5j2dHQT/Bo8TvTCB4Oj98MrX1TTgkSjLJySSMfyQg5EwNfSIa
HC0lg/6qwAxnhWX7cCCBETNZ4aYDmz/dxcCSsLbomGP9nMaUgUy7qn5nNuNbJilb
oCBsr8LDMzu1LJzmkduM8Uw6OINh+J8ICoVXaR5pS7gSZz/+vqIP/rK691AiqhJL
QeMkI9gQ83jEXpr/AV7ABCjGCAeqELOkgravUyTDev24eEc0LyU0qENpgxqWSTca
OvwSWSwNuhCKd2IyKZBnOmjXGwvncwX0gp1KxL9WuLkR6O8XldLAYmVCwVAOrIh7
snRot8+3qNjELa65Nh5DapwLJrU24TRoKLHLgfWK8dlqrMejNtXKucQ574Np0feR
p2hrNisOrtCwxAt7OAgWygw8agN6cJiY18onIsr4wSBm5H7Syb0=
=k7tj
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGC support from Borislav Petkov:
"Intel Software Guard eXtensions enablement. This has been long in the
making, we were one revision number short of 42. :)
Intel SGX is new hardware functionality that can be used by
applications to populate protected regions of user code and data
called enclaves. Once activated, the new hardware protects enclave
code and data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to run
in enclaves, but there are also projects that try to run limited
unmodified software in enclaves.
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used
by SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others"
* tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
x86/sgx: Fix a typo in kernel-doc markup
x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
selftests/sgx: Use a statically generated 3072-bit RSA key
x86/sgx: Clarify 'laundry_list' locking
x86/sgx: Update MAINTAINERS
Documentation/x86: Document SGX kernel architecture
x86/sgx: Add ptrace() support for the SGX driver
x86/sgx: Add a page reclaimer
selftests/x86: Add a selftest for SGX
x86/vdso: Implement a vDSO for Intel SGX enclave call
x86/traps: Attempt to fixup exceptions in vDSO before signaling
x86/fault: Add a helper function to sanitize error code
x86/vdso: Add support for exception fixup in vDSO functions
x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
x86/sgx: Add SGX_IOC_ENCLAVE_INIT
x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
x86/sgx: Add an SGX misc driver interface
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSkwACgkQEsHwGGHe
VUoYZBAAwGeZCV6KKQERAhMsTKAI+UvGS8rAOBOOTheDHyS/X4Rf63OuSLO/wC3c
NJ9Z4M8GeMmW8K/oeEeUyEI6980FC2I0tQE20J29SPjK37jTowIHjJ26ofKGUP1+
TiqSL2Eo90KqUSVg4GGdDKEuO6by5gS8tRoNE9Of0eGz6ahVOn9PFHt6QVk9cFug
KOByzLjwmBd+XNE/gm56Tl4khOUBYy8+2JGZfewfJOsktln9t+s0WnuB1P7PU+2B
9uI2AAs9JmaO5RzIVHi3VYFsPEzrL3GXBpBCR2XJucIk4h6Km94yNDZW/th5XP6T
vKDNrhg3ErmeI0MgfIKHq+QXxGLobsra9/ZQazg8ZuPvIg47jzo/TaCiLtJGPips
Kr6MLMK6lfQOnOsaDjnGUJPMpOvAmMxOU9o7MVn1XxpDiFr8Fzt1njBMJ/kZikXt
YNOUj+Pws/WfvdFW/QkBBao6iOidEFxgTLJnADeocL0T0y5feVqvrOrWKFXGiL+H
i8ehg4040DZA0PxqxxUYe+EVQgMs41LAHr0zvUxI+hxQBKPFFg6xkpUCOaHd/rSZ
XTx0AsYuaE2RPY/ldDWkWrzGja9FfzTSKwzkDkhECrHXGYaIyiNSCpFZQ4Mqua3c
Z0mT4pLx6blLFqJ+O+Q+juuMSixSVKWe5NwEoG45sWZfGAHIl0M=
=IsGj
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader update from Borislav Petkov:
"This one wins the award for most boring pull request ever. But that's
a good thing - this is how I like 'em and the microcode loader
*should* be boring. :-)
A single cleanup removing "break" after a return statement (Tom Rix)"
* tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/amd: Remove unneeded break
- Pass error records logged by firmware through the MCE decoding chain
to provide human-readable error descriptions instead of raw values
(Smita Koralahalli)
- Some #MC handler fixes (Gabriele Paoloni)
- The usual small fixes and cleanups all over.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSJcACgkQEsHwGGHe
VUob6hAAgSJq1IcftZR4DSk/Mlrt0x4orDNCmoGhxAlOT7ryiidYhXuKV2tvWloA
3v7X5E0r9CroS3PMghQtVOD7qJMjNAWKun5C6zPhLkIeV+CvcLAHhfShlcyWhJ76
PQaHHSxRQGEh5M2Xcp26kdwqrARbhcl66ukBvFNpiNUkLH+robDmIazraI5a2bV/
9azNoVUZBSXQoJYpPz/tBTxu2EToj/xrMVIZg1OPHR5cxtOwUSZCr8V69KGK4onm
avYQY8TSCDxsG1VcywYzBNi2W6lKs2EFlhCVZBLCz0NIkZCYTXErI4OsuExMufLu
t17sAfHjQg+SxEcL5pc+iQkr9i0LLnujKz+Cl0ShtRk6SER0U+9pc/yf0wQSGDhB
AZz87z+a6+r4pxdTSclOkpQCAfRR+pWjNwA5dyi6/72Qbqi6lmwKWDPJnnyq0YS4
UZI01zjs7ir93nS1zwJcekFOJCSTsb6XmhEgMVlpw+YoZHaOki1KJMCU0kIgZt8O
YlEniP/DdXBS0mflOJQnoes7XrcIWVqWEubeRZdoWYnC07hNmdg4XJ0c3Skx8ZW+
gL8kt4pDWlnKHlTlhtgocG3H5BkMazrYEmbograc/Oe8lkr9ESqIS7yS1l8lM7Z6
i0HXATcdvDHV0AqW/uoNczXpck4x8xrahIzyPqAve2G15XEIgq4=
=D9fV
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Enable additional logging mode on older Xeons (Tony Luck)
- Pass error records logged by firmware through the MCE decoding chain
to provide human-readable error descriptions instead of raw values
(Smita Koralahalli)
- Some #MC handler fixes (Gabriele Paoloni)
- The usual small fixes and cleanups all over.
* tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Rename kill_it to kill_current_task
x86/mce: Remove redundant call to irq_work_queue()
x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3
x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places
x86/mce, cper: Pass x86 CPER through the MCA handling chain
x86/mce: Use "safe" MSR functions when enabling additional error logging
x86/mce: Correct the detection of invalid notifier priorities
x86/mce: Assign boolean values to a bool variable
x86/mce: Enable additional error logging on certain Intel CPUs
x86/mce: Remove unneeded break
Update the GHCB accessor functions to add functions for retrieve GHCB
fields by name. Update existing code to use the new accessor functions.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On systems that do not have hardware enforced cache coherency between
encrypted and unencrypted mappings of the same physical page, the
hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
contents of an SEV guest page. When a small number of pages are being
flushed, this can be used in place of issuing a WBINVD across all CPUs.
CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
available. Add a CPUID feature to indicate it is supported and define the
MSR.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit
7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
changed the padding bytes between functions from NOP to INT3. However,
when optprobe decodes a target function it finds INT3 and gives up the
jump optimization.
Instead of giving up any INT3 detection, check whether the rest of the
bytes to the end of the function are INT3. If all of them are INT3,
those come from the linker. In that case, continue the optprobe jump
optimization.
[ bp: Massage commit message. ]
Fixes: 7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
Reported-by: Adam Zabrocki <pi3@pi3.com.pl>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2
Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.
A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
SEV uses SWIOTLB to make this happen without requiring changes to device
drivers. However, depending on the workload being run, the default 64MB
of it might not be enough and it may run out of buffers to use for DMA,
resulting in I/O errors and/or performance degradation for high
I/O workloads.
Adjust the default size of SWIOTLB for SEV guests using a
percentage of the total memory available to guest for the SWIOTLB buffers.
Adds a new sev_setup_arch() function which is invoked from setup_arch()
and it calls into a new swiotlb generic code function swiotlb_adjust_size()
to do the SWIOTLB buffer adjustment.
v5 fixed build errors and warnings as
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.
# dmesg | grep 'Estimated ratio of average max'
[ 14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.
On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.
Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.
Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.
[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]
Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);
which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.
That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/87k0tpju47.fsf@nanos.tec.linutronix.de
Prarit reported that depending on the affinity setting the
' irq $N: Affinity broken due to vector space exhaustion.'
message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.
Shung-Hsi provided traces and analysis which pinpoints the problem:
The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:
1) Try the intersection of affinity mask and node mask
2) Try the node mask
3) Try the full affinity mask
4) Try the full online mask
Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.
In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.
Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.
If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.
Fixes: d6ffc6ac83 ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
rdtgroup_mondata_show()
mon_event_read()
mon_event_count()
__mon_event_count()
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Before fix:
local b/w (bytes/s): 11076796416
After fix:
local b/w (bytes/s): 5465014272
Fixes: ba0f26d852 (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
Fix to restore BTF if single-stepping causes a page fault and
it is cancelled.
Usually the BTF flag was restored when the single stepping is done
(in resume_execution()). However, if a page fault happens on the
single stepping instruction, the fault handler is invoked and
the single stepping is cancelled. Thus, the BTF flag is not
restored.
Fixes: 1ecc798c67 ("x86: debugctlmsr kprobes")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
Commit
26bfa5f894 ("x86, amd: Cleanup init_amd")
moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().
However, bsp_init_amd() is executed well before the direct mapping is
actually created:
setup_arch()
-> early_cpu_init()
-> early_identify_cpu()
-> this_cpu->c_bsp_init()
-> bsp_init_amd()
...
-> init_mem_mapping()
So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.
It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.
Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.
Committer notes:
This workaround is incomplete anyway:
1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.
2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.
3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after
66520ebc2d ("x86, mm: Only direct map addresses that are marked as E820_RAM")
which means, there's nothing to be worked around either.
So let's rip it out.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
After having collected the majority of reports about MSRs being written
by userspace tools and what tools those are, and all newer reports
mostly repeating, add an URL where detailed information is gathered and
kept up-to-date.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201205002825.19107-1-bp@alien8.de
Add "deprecated" message to any access to old /proc/sgi_uv/* leaves.
[ bp: Do not have a trailing function opening brace and the arguments
continuing on the next line and align them on the opening brace. ]
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-5-mike.travis@hpe.com
Add kernel interfaces used to obtain info for the uv_sysfs driver
to display.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-2-mike.travis@hpe.com
Update kernel-doc parameter name after
c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")
changed the last parameter from @handler to @emulate.
[ bp: Make commit message more precise. ]
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201203145020.2441-1-hqjagain@gmail.com
- Make the AMD L3 QoS code and data priorization enable/disable mechanism
work correctly. The control bit was only set/cleared on one of the CPUs
in a L3 domain, but it has to be modified on all CPUs in the domain. The
initial documentation was not clear about this, but the updated one from
Oct 2020 spells it out.
- Fix an off by one in the UV platform detection code which causes the UV
hubs to be identified wrongly. The chip revisions start at 1 not at 0.
- Fix a long standing bug in the evaluation of prefixes in the uprobes
code which fails to handle repeated prefixes properly. The aggregate
size of the prefixes can be larger than the bytes array but the code
blindly iterated over the aggregate size beyond the array boundary.
Add a macro to handle this case properly and use it at the affected
places.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M2GoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUGeD/0TxQyVIZTJjY/ExDYvQZeSWgvFVon9
A/QcwkgtkWzYKI3YThgxBtSNKhHPP8eQ9QK8ErcaQKEHU5TdXVEOwHgTm1A6OGat
0+M9/EWEe7tTu+cLpu7esQ+1VbSvEcFXZbljbhCKrShlBlsIjFG4eCPAprSw9yjI
RSgfXKZu4NmHVS6nfJTwmIIaTeLQ6U3b7b1D5s/66slBFScqnLbRNhABVbHbos4F
pl/lxDCFOddy2YbEojHjjGqMA7oxPav7c0nYFOM/zG+wAqfEjbqOxReT31bGQPi2
XT9K4JEqDqILo0KnhV4GsYoWAhes3BtmsJ9IoZ7IijsMriYl80mD9URAORidJ4PX
28Ckk9V/DlE8uDrAnBDcWDSoKlg78mhVV7V9L6v43teg/gJfSZNROtNDBmqRmwG4
Op2NJfzJITtaxVQuSZRkSs8rzGv+QUfaM1sBUQ+Oz4KYeIjjA7G2MAOECrzIAWKB
GWc5toYRVS6oGT+RbZhSxZYoh8ASoGJ2MrL8K4OV4RqEqHHcXcih0WmmljtsDIFI
td4FHHH6fghIb9S6iYKiApd6k2qKa33mwJwa/xZOoIrv0w5xT0WDJnxT60gu/Mec
YDkqhmA009CNSD2G4oNRNF5MH7gp34UII+25jOGatbVh+5DDPYs+5Jnh/DR7jssR
PryAG9ER7UUb6w==
=rNBq
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Make the AMD L3 QoS code and data priorization enable/disable
mechanism work correctly.
The control bit was only set/cleared on one of the CPUs in a L3
domain, but it has to be modified on all CPUs in the domain. The
initial documentation was not clear about this, but the updated one
from Oct 2020 spells it out.
- Fix an off by one in the UV platform detection code which causes
the UV hubs to be identified wrongly.
The chip revisions start at 1 not at 0.
- Fix a long standing bug in the evaluation of prefixes in the
uprobes code which fails to handle repeated prefixes properly.
The aggregate size of the prefixes can be larger than the bytes
array but the code blindly iterated over the aggregate size beyond
the array boundary. Add a macro to handle this case properly and
use it at the affected places"
* tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
x86/platform/uv: Fix UV4 hub revision adjustment
x86/resctrl: Fix AMD L3 QOS CDP enable/disable
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be
insn.prefixes.bytes[i] != 0 and i < 4
instead of using insn.prefixes.nbytes.
Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
Kees Cook <keescook@chromium.org>.
[ bp: Massage commit message, sync with the respective header in tools/
and drop "we". ]
Fixes: 2b14449835 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2
The sgx_enclave_add_pages.length field is documented as
* @length: length of the data (multiple of the page size)
Fail with -EINVAL, when the caller gives a zero length buffer of data
to be added as pages to an enclave. Right now 'ret' is returned as
uninitialized in that case.
[ bp: Flesh out commit message. ]
Fixes: c6d26d3707 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-sgx/X8ehQssnslm194ld@mwanda/
Link: https://lkml.kernel.org/r/20201203183527.139317-1-jarkko@kernel.org
SYS_USER_DISPATCH will be triggered when a syscall is sent to userspace
by the Syscall User Dispatch mechanism. This adjusts eventual
BUILD_BUG_ON around the tree.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-3-krisman@collabora.com
Currently, if an MCE happens in user-mode or while the kernel is copying
data from user space, 'kill_it' is used to check if execution of the
interrupted task can be recovered or not; the flag name however is not
very meaningful, hence rename it to match its goal.
[ bp: Massage commit message, rename the queue_task_work() arg too. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
Currently, __mc_scan_banks() in do_machine_check() does the following
callchain:
__mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work).
Hence, the call to irq_work_queue() below after __mc_scan_banks()
seems redundant. Just remove it.
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
Right now for LMCE, if no_way_out is set, mce_panic() is called
regardless of mca_cfg.tolerant. This is not correct as, if
mca_cfg.tolerant = 3, the code should never panic.
Add that check.
[ bp: use local ptr 'cfg'. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
Right now, for local MCEs the machine calls panic(), if needed, right
after lmce is set. For MCE broadcasting, mce_reign() takes care of
calling mce_panic().
Hence:
- improve readability by moving the conditional evaluation of
tolerant up to when kill_it is set first;
- move the mce_panic() call up into the statement where mce_end()
fails.
[ bp: Massage, remove comment in the mce_end() failure case because it
is superfluous; use local ptr 'cfg' in both tests. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
Commit
fd8d9db355 ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
removed superfluous kernfs_get() calls in rdtgroup_ctrl_remove() and
rdtgroup_rmdir_ctrl(). That change resulted in an unused function
parameter to these two functions.
Clean up the unused function parameter in rdtgroup_ctrl_remove(),
rdtgroup_rmdir_mon() and their callers rdtgroup_rmdir_ctrl() and
rdtgroup_rmdir().
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1606759618-13181-1-git-send-email-xiaochen.shen@intel.com
When the AMD QoS feature CDP (code and data prioritization) is enabled
or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs
in an L3 domain (core complex). That is not correct - the CDP bit needs
to be updated on all the logical CPUs in the domain.
This was not spelled out clearly in the spec earlier. The specification
has been updated and the updated document, "AMD64 Technology Platform
Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue
Date: October 2020" is available now. Refer the section: Code and Data
Prioritization.
Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache
data structure.
The documentation can be obtained at:
https://developer.amd.com/wp-content/resources/56375.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
[ bp: Massage commit message. ]
Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
idle path. Similar to the entry path the low level idle functions have to
be non-instrumentable.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
cT1B/lA1PHXj6Q==
=raED
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two more places which invoke tracing from RCU disabled regions in the
idle path.
Similar to the entry path the low level idle functions have to be
non-instrumentable"
* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_idle: Fix intel_idle() vs tracing
sched/idle: Fix arch_cpu_idle() vs tracing
resctrl fs (Xiaochen Shen)
- Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)
- A fix to not lose already seen MCE severity which determines whether
the machine can recover (Gabriele Paoloni)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/DeEcACgkQEsHwGGHe
VUrAlg/+KO2GMYiz5mNslI3tg0tUf0tDEw7bsfsibGL9XkkRRre0zn2pTN3GoutG
ZeM9d+epFNfv1msyTYXjhhMeO5XLGnuBzyeCv4wQkIH0zdNJSencPlr6tKnaMrAo
Alk/BxmvBmOOWHayBOhgNBVYWXBoACkIzIUFi0d3lRbgiGDhOqowYbYkLWIu/tvm
pI8anovuAfkluhh+tD96TyBrSa+GfWWiOYItNbJbmKAbGjGbJU5IEVmlXeMEziGO
7zmGNwl4Di2V3lU5oDJNcYvF2Ngok/L81QqOMKeVu7EYDHLBjHxp0rxwuE8QqlX8
bk40EEHyI1Aiwx/7WxB/ChnDOfpIpXljWEvropZeNOCY1LqDMnMnAJFrygQAmYeo
t9GxbhBYvtXSxTkkfTWzowt20Vmm2+r5j09kpq3tfMrZXwk7VrSF8TjQ/3iM/5iM
eV/wPaigqX0xZwdzF1o3fuXeM+RXHhMrnX+cKlFemUX9uFZi0/ZXk3TlJwCG/jKL
QWbpTBuxcDwXR3K1RNPjBJG1OA1cupZQ9ePK8h95H1844rEwglreypCOf3LwDJzU
9XEZo94V/J+gxNzg5iakidtdBMSst+SYY+vd/SUlLHXzkwcykOzwPgvZwVA7U2HM
74Eh7eQhIht3Lin35i9h1Ez/bxE/Ss6Fi9wnObp8ij5ImOaS5eU=
=JrT9
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A couple of urgent fixes which accumulated this last week:
- Two resctrl fixes to prevent refcount leaks when manipulating the
resctrl fs (Xiaochen Shen)
- Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)
- A fix to not lose already seen MCE severity which determines
whether the machine can recover (Gabriele Paoloni)"
* tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Do not overwrite no_way_out if mce_end() fails
x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
- Fix intel iommu driver when running on devices without VCCAP_REG
- Fix swiotlb and "iommu=pt" interaction under TXT (tboot)
- Fix missing return value check during device probe()
- Fix probe ordering for Qualcomm SMMU implementation
- Ensure page-sized mappings are used for AMD IOMMU buffers with SNP RMP
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl/A3msQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNAI6B/9/MLjurPmQrSusq9Y7dhnGR7aahtICLAwR
UDZHebGTeVxNqtTklQVl/1qgcY7DxU40LAO1777MM7eHOW1FnlCAWrkTo6BBQsGx
U1FKegOJ/0eVHFPtFDfM7IA2skeLwlZW+hywNLAksme5mtd6iZG9yQLlDFqAjxL7
v8uXgfHFn6Z2MvMc2O+IeKTtflIwPek/6rYuaEf7UknA6ZYPAD3hnu9i1RTEuUAA
h2PoVrrJ/KefEsCMUIq2jwMTsSvxohDH8ClGK9b6h74J2CLKKuhALSgABRAmsEL9
7w5TVdtMQ85n3ccnXyT4RBQ+O/eVtmsKSdfeAbFI4nG9e9j7YE1L
=BI1u
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull iommu fixes from Will Deacon:
"Here's another round of IOMMU fixes for -rc6 consisting mainly of a
bunch of independent driver fixes. Thomas agreed for me to take the
x86 'tboot' fix here, as it fixes a regression introduced by a vt-d
change.
- Fix intel iommu driver when running on devices without VCCAP_REG
- Fix swiotlb and "iommu=pt" interaction under TXT (tboot)
- Fix missing return value check during device probe()
- Fix probe ordering for Qualcomm SMMU implementation
- Ensure page-sized mappings are used for AMD IOMMU buffers with SNP
RMP"
* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
iommu/vt-d: Don't read VCCAP register unless it exists
x86/tboot: Don't disable swiotlb when iommu is forced on
iommu: Check return of __iommu_attach_device()
arm-smmu-qcom: Ensure the qcom_scm driver has finished probing
iommu/amd: Enforce 4k mapping for certain IOMMU data structures
Currently, if mce_end() fails, no_way_out - the variable denoting
whether the machine can recover from this MCE - is determined by whether
the worst severity that was found across the MCA banks associated with
the current CPU, is of panic severity.
However, at this point no_way_out could have been already set by
mca_start() after looking at all severities of all CPUs that entered the
MCE handler. If mce_end() fails, check first if no_way_out is already
set and, if so, stick to it, otherwise use the local worst value.
[ bp: Massage. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command
line, IBPB is force-enabled and STIPB is conditionally-enabled (or not
available).
However, since
21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP}
instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour.
Because the issuing of IBPB relies on the switch_mm_*_ibpb static
branches, the mitigations behave as expected.
Since
1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
this discrepency caused the misreporting of IB speculation via prctl().
On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb,
prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL |
PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are
always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB
speculation mode, even though the flag is ignored.
Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should
also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not
available.
[ bp: Massage commit message. ]
Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Fixes: 1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
After commit 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA
domain"), swiotlb could also be used for direct memory access if IOMMU
is enabled but a device is configured to pass through the DMA translation.
Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work
if "iommu=pt" kernel parameter is used.
Fixes: 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA domain")
Reported-and-tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201125014124.4070776-1-baolu.lu@linux.intel.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210237
Signed-off-by: Will Deacon <will@kernel.org>
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.
(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
Replace kmap_atomic_pfn() with kmap_local_pfn() which is preemptible and
can take page faults.
Remove the indirection of the dump page and the related cruft which is not
longer required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.670851839@linutronix.de
On resource group creation via a mkdir an extra kernfs_node reference is
obtained by kernfs_get() to ensure that the rdtgroup structure remains
accessible for the rdtgroup_kn_unlock() calls where it is removed on
deletion. Currently the extra kernfs_node reference count is only
dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
structure is removed in a few other locations that lack the matching
reference drop.
In call paths of rmdir and umount, when a control group is removed,
kernfs_remove() is called to remove the whole kernfs nodes tree of the
control group (including the kernfs nodes trees of all child monitoring
groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
structures of all child monitoring groups under the control group are
freed by kfree() in free_all_child_rdtgrp().
Before calling kfree() to free the rdtgroup structures, the kernfs node
of the control group itself as well as the kernfs nodes of all child
monitoring groups still take the extra references which will never be
dropped to 0 and the kernfs nodes will never be freed. It leads to
reference count leak and kernfs_node_cache memory leak.
For example, reference count leak is observed in these two cases:
(1) mount -t resctrl resctrl /sys/fs/resctrl
mkdir /sys/fs/resctrl/c1
mkdir /sys/fs/resctrl/c1/mon_groups/m1
umount /sys/fs/resctrl
(2) mkdir /sys/fs/resctrl/c1
mkdir /sys/fs/resctrl/c1/mon_groups/m1
rmdir /sys/fs/resctrl/c1
The same reference count leak issue also exists in the error exit paths
of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().
Fix this issue by following changes to make sure the extra kernfs_node
reference on rdtgroup is dropped before freeing the rdtgroup structure.
(1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
kernfs_put() and kfree().
(2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
structure is about to be freed by kfree().
(3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
exit paths of mkdir where an extra reference is taken by kernfs_get().
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
Willem reported growing of kernfs_node_cache entries in slabtop when
repeatedly creating and removing resctrl subdirectories as well as when
repeatedly mounting and unmounting the resctrl filesystem.
On resource group (control as well as monitoring) creation via a mkdir
an extra kernfs_node reference is obtained to ensure that the rdtgroup
structure remains accessible for the rdtgroup_kn_unlock() calls where it
is removed on deletion. The kernfs_node reference count is dropped by
kernfs_put() in rdtgroup_kn_unlock().
With the above explaining the need for one kernfs_get()/kernfs_put()
pair in resctrl there are more places where a kernfs_node reference is
obtained without a corresponding release. The excessive amount of
reference count on kernfs nodes will never be dropped to 0 and the
kernfs nodes will never be freed in the call paths of rmdir and umount.
It leads to reference count leak and kernfs_node_cache memory leak.
Remove the superfluous kernfs_get() calls and expand the existing
comments surrounding the remaining kernfs_get()/kernfs_put() pair that
remains in use.
Superfluous kernfs_get() calls are removed from two areas:
(1) In call paths of mount and mkdir, when kernfs nodes for "info",
"mon_groups" and "mon_data" directories and sub-directories are
created, the reference count of newly created kernfs node is set to 1.
But after kernfs_create_dir() returns, superfluous kernfs_get() are
called to take an additional reference.
(2) kernfs_get() calls in rmdir call paths.
Fixes: 17eafd0762 ("x86/intel_rdt: Split resource group removal in two")
Fixes: 4af4a88e0c ("x86/intel_rdt/cqm: Add mount,umount support")
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Fixes: 5dc1d5c6ba ("x86/intel_rdt: Simplify info and base file lists")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Fixes: 4e978d06de ("x86/intel_rdt: Add "info" files to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
Fix
./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Function parameter or member \
'encl' not described in 'sgx_ioc_enclave_provision'
./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Excess function parameter \
'enclave' description in 'sgx_ioc_enclave_provision'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201123181922.0c009406@canb.auug.org.au
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.
This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.
This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.
Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
same because the proper one is going through the IOMMU tree. (Thomas Gleixner)
* An Intel microcode loader fix to save the correct microcode patch to
apply during resume. (Chen Yu)
* A fix to not access user memory of other processes when dumping opcode
bytes. (Thomas Gleixner)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+6QnYACgkQEsHwGGHe
VUpMgg//cE4GvK7dFihq5iNYS16yKKgyVBLwHiHoCxQkOu4CPSf0EejTMLKPSKpA
hS7pJm6dMYLJ7PhIKRwlQD1NYet3SHCOi1h0Y2/3rVay82Vz/As2nZRd7paEkP9k
kiNd/Ib2uAQ9ZwCKhmLanOWDoXXFo5Fwmpw+O/zhn31Y3kBx05mc2ojXnEwCERK/
j6GwOBkfOq+q6D7HzsfEehlX0iy2rlPSq6OUN1H/Z4w+W9zfpubLX9bhhr9RvDmS
6Hsmwos8fM/cQXZC2oxoKUP6zeYk5lDF/Qf7hOFt3WgdCpTZP2z6uLwUL6+NXu/C
Ms+77Cx+Loe8UaDkyxXXruiv7mZIZCnl0phZulE4ibOjzL7W7w5cEf0LHdDS7uDL
JzZwTJlbK9RjrSp4pXR+rn+U/PdluZJYpjh/Gj8+V23QszvbDsVd55SV5+CLSTM/
qwrNp5ffLEnIFHmW98/OQKMvNTTzBFdlltnJm7mMW2JucgfCZMx4zwY0+dZ+ocDq
/Kt/HNCzez1jHWiS5jdZCM15KuIhPTNuTpq0ipU+UAXxZOaNtTo1st1lhKoPvTz/
k4y7S/EibX6PK4lxBQMM3fBG/T1KK7ZMgNEN0jbhWegqEmdWtgwc57p1gVE70XUU
f6elyUupD65TMwgio2I8sQuiNBCKfF2Wj144dxXGJnGnOoZKey8=
=z/Gq
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
same because the proper one is going through the IOMMU tree (Thomas
Gleixner)
- An Intel microcode loader fix to save the correct microcode patch to
apply during resume (Chen Yu)
- A fix to not access user memory of other processes when dumping
opcode bytes (Thomas Gleixner)
* tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
x86/dumpstack: Do not try to access user space code of other tasks
x86/microcode/intel: Check patch signature before saving microcode for early loading
iommu/vt-d: Take CONFIG_PCI_ATS into account
The kernel uses ACPI Boot Error Record Table (BERT) to report fatal
errors that occurred in a previous boot. The MCA errors in the BERT are
reported using the x86 Processor Error Common Platform Error Record
(CPER) format. Currently, the record prints out the raw MSR values and
AMD relies on the raw record to provide MCA information.
Extract the raw MSR values of MCA registers from the BERT and feed them
into mce_log() to decode them properly.
The implementation is SMCA-specific as the raw MCA register values are
given in the register offset order of the SMCA address space.
[ bp: Massage. ]
[ Fix a build breakage in patch v1. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Punit Agrawal <punit1.agrawal@toshiba.co.jp>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20201119182938.151155-1-Smita.KoralahalliChannabasappa@amd.com
- Fix boot when intel iommu initialisation fails under TXT (tboot)
- Fix intel iommu compilation error when DMAR is enabled without ATS
- Temporarily update IOMMU MAINTAINERs entry
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+3p6gQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNPowCADBKq5PBRwqIM1tmntRu5rePf/WuB1BotlJ
cUUR8dIxFDvKMeGvN4mDbn/5jjL1+TlD7sQqFh+gcWJ9tImpAMzp1ENbg71isl5o
Ej21X9YhjfiIsKpTrNGxcetCkYaR2tp8Z4/WzSQaO+IGB578dQ9VwiwSnDyFdWUb
1Qcj712ainYvKjbq4NyL80lPm9v43OV8QZ73WeelzBjE2UcDFAV0Xl4BIX8uuGtv
I1x03JfgzmzhVBmpj5HUGPsSopMDdo5K4vlcqscqSqvBY90iHD5fgRF4ZatTjR1t
PVM/1+c9vdJIWSxSt1hsB8DPtqXVOvGjzrLW/h63rxyWsISwkAq0
=DNX/
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull iommu fixes from Will Deacon:
"Two straightforward vt-d fixes:
- Fix boot when intel iommu initialisation fails under TXT (tboot)
- Fix intel iommu compilation error when DMAR is enabled without ATS
and temporarily update IOMMU MAINTAINERs entry"
* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
MAINTAINERS: Temporarily add myself to the IOMMU entry
iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set
iommu/vt-d: Avoid panic if iommu init fails in tboot system
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:
=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x77/0x97
__is_insn_slot_addr+0x15d/0x170
kernel_text_address+0xba/0xe0
? get_stack_info+0x22/0xa0
__kernel_text_address+0x9/0x30
show_trace_log_lvl+0x17d/0x380
? dump_stack+0x77/0x97
dump_stack+0x77/0x97
__lock_acquire+0xdf7/0x1bf0
lock_acquire+0x258/0x3d0
? vprintk_emit+0x6d/0x2c0
_raw_spin_lock+0x27/0x40
? vprintk_emit+0x6d/0x2c0
vprintk_emit+0x6d/0x2c0
printk+0x4d/0x69
start_secondary+0x1c/0x100
secondary_startup_64_no_verify+0xb8/0xbb
This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.
Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The only usage of the kf_ops field in the rftype struct is to pass
it as argument to __kernfs_create_file(), which accepts a pointer to
const. Make it a pointer to const. This makes it possible to make
rdtgroup_kf_single_ops and kf_mondata_ops const, which allows the
compiler to put them in read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201110230228.801785-1-rikard.falkeborn@gmail.com
CPUID Leaf 0x1F defines a DIE_TYPE level (nb: ECX[8:15] level type == 0x5),
but CPUID Leaf 0xB does not. However, detect_extended_topology() will
set struct cpuinfo_x86.cpu_die_id regardless of whether a valid Die ID
was found.
Only set cpu_die_id if a DIE_TYPE level is found. CPU topology code may
use another value for cpu_die_id, e.g. the AMD NodeId on AMD-based
systems. Code ordering should be maintained so that the CPUID Leaf 0x1F
Die ID value will take precedence on systems that may use another value.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-5-Yazen.Ghannam@amd.com
The Last Level Cache ID is returned by amd_get_nb_id(). In practice,
this value is the same as the AMD NodeId for callers of this function.
The NodeId is saved in struct cpuinfo_x86.cpu_die_id.
Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and
remove the function.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
AMD systems provide a "NodeId" value that represents a global ID
indicating to which "Node" a logical CPU belongs. The "Node" is a
physical structure equivalent to a Die, and it should not be confused
with logical structures like NUMA nodes. Logical nodes can be adjusted
based on firmware or other settings whereas the physical nodes/dies are
fixed based on hardware topology.
The NodeId value can be used when a physical ID is needed by software.
Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value
from CPUID or MSR as appropriate. Default to phys_proc_id otherwise.
Do so for both AMD and Hygon systems.
Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no
longer needed.
Update the x86 topology documentation.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com
Return -ERESTARTSYS instead of -EINTR in sgx_ioc_enclave_add_pages()
when interrupted before any pages have been processed. At this point
ioctl can be obviously safely restarted.
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201118213932.63341-1-jarkko@kernel.org
- Cure the fallout from the MSI irqdomain overhaul which missed that the
Intel IOMMU does not register virtual function devices and therefore
never reaches the point where the MSI interrupt domain is assigned. This
makes the VF devices use the non-remapped MSI domain which is trapped by
the IOMMU/remap unit.
- Remove an extra space in the SGI_UV architecture type procfs output for
UV5.
- Remove a unused function which was missed when removing the UV BAU TLB
shootdown handler.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
ZqkZbqCZm3vfHA==
=X/P+
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/iommu/fixes
Pull in x86 fixes from Thomas, as they include a change to the Intel DMAR
code on which we depend:
* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iommu/vt-d: Cure VF irqdomain hickup
x86/platform/uv: Fix copied UV5 output archtype
x86/platform/uv: Drop last traces of uv_flush_tlb_others
Short Version:
The SGX section->laundry_list structure is effectively thread-local, but
declared next to some shared structures. Its semantics are clear as mud.
Fix that. No functional changes. Compile tested only.
Long Version:
The SGX hardware keeps per-page metadata. This can provide things like
permissions, integrity and replay protection. It also prevents things
like having an enclave page mapped multiple times or shared between
enclaves.
But, that presents a problem for kexec()'d kernels (or any other kernel
that does not run immediately after a hardware reset). This is because
the last kernel may have been rude and forgotten to reset pages, which
would trigger the "shared page" sanity check.
To fix this, the SGX code "launders" the pages by running the EREMOVE
instruction on all pages at boot. This is slow and can take a long
time, so it is performed off in the SGX-specific ksgxd instead of being
synchronous at boot. The init code hands the list of pages to launder in
a per-SGX-section list: ->laundry_list. The only code to touch this list
is the init code and ksgxd. This means that no locking is necessary for
->laundry_list.
However, a lock is required for section->page_list, which is accessed
while creating enclaves and by ksgxd. This lock (section->lock) is
acquired by ksgxd while also processing ->laundry_list. It is easy to
confuse the purpose of the locking as being for ->laundry_list and
->page_list.
Rename ->laundry_list to ->init_laundry_list to make it clear that this
is not normally used at runtime. Also add some comments clarifying the
locking, and reorganize 'sgx_epc_section' to put 'lock' near the things
it protects.
Note: init_laundry_list is 128 bytes of wasted space at runtime. It
could theoretically be dynamically allocated and then freed after
the laundering process. But it would take nearly 128 bytes of extra
instructions to do that.
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201116222531.4834-1-dave.hansen@intel.com
Commit
4b47cdbda6 ("x86/head/64: Move early exception dispatch to C code")
removed the usage of GET_CR2_INTO().
Drop the definition as well, and related definitions in paravirt.h and
asm-offsets.h
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151208.2212886-3-nivedita@alum.mit.edu
Enclave memory is normally inaccessible from outside the enclave. This
makes enclaves hard to debug. However, enclaves can be put in a debug
mode when they are being built. In that mode, enclave data *can* be read
and/or written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions.
This is obviously only for debugging and destroys all the protections
present with normal enclaves. But, enclaves know their own debug status
and can adjust their behavior appropriately.
Add a vm_ops->access() implementation which can be used to read and write
memory inside debug enclaves. This is typically used via ptrace() APIs.
[ bp: Massage. ]
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-23-jarkko@kernel.org
Just like normal RAM, there is a limited amount of enclave memory available
and overcommitting it is a very valuable tool to reduce resource use.
Introduce a simple reclaim mechanism for enclave pages.
In contrast to normal page reclaim, the kernel cannot directly access
enclave memory. To get around this, the SGX architecture provides a set of
functions to help. Among other things, these functions copy enclave memory
to and from normal memory, encrypting it and protecting its integrity in
the process.
Implement a page reclaimer by using these functions. Picks victim pages in
LRU fashion from all the enclaves running in the system. A new kernel
thread (ksgxswapd) reclaims pages in the background based on watermarks,
similar to normal kswapd.
All enclave pages can be reclaimed, architecturally. But, there are some
limits to this, such as the special SECS metadata page which must be
reclaimed last. The page version array (used to mitigate replaying old
reclaimed pages) is also architecturally reclaimable, but not yet
implemented. The end result is that the vast majority of enclave pages are
currently reclaimable.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-22-jarkko@kernel.org
vDSO functions can now leverage an exception fixup mechanism similar to
kernel exception fixup. For vDSO exception fixup, the initial user is
Intel's Software Guard Extensions (SGX), which will wrap the low-level
transitions to/from the enclave, i.e. EENTER and ERESUME instructions,
in a vDSO function and leverage fixup to intercept exceptions that would
otherwise generate a signal. This allows the vDSO wrapper to return the
fault information directly to its caller, obviating the need for SGX
applications and libraries to juggle signal handlers.
Attempt to fixup vDSO exceptions immediately prior to populating and
sending signal information. Except for the delivery mechanism, an
exception in a vDSO function should be treated like any other exception
in userspace, e.g. any fault that is successfully handled by the kernel
should not be directly visible to userspace.
Although it's debatable whether or not all exceptions are of interest to
enclaves, defer to the vDSO fixup to decide whether to do fixup or
generate a signal. Future users of vDSO fixup, if there ever are any,
will undoubtedly have different requirements than SGX enclaves, e.g. the
fixup vs. signal logic can be made function specific if/when necessary.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-19-jarkko@kernel.org
The whole point of SGX is to create a hardware protected place to do
“stuff”. But, before someone is willing to hand over the keys to
the castle , an enclave must often prove that it is running on an
SGX-protected processor. Provisioning enclaves play a key role in
providing proof.
There are actually three different enclaves in play in order to make this
happen:
1. The application enclave. The familiar one we know and love that runs
the actual code that’s doing real work. There can be many of these on
a single system, or even in a single application.
2. The quoting enclave (QE). The QE is mentioned in lots of silly
whitepapers, but, for the purposes of kernel enabling, just pretend they
do not exist.
3. The provisioning enclave. There is typically only one of these
enclaves per system. Provisioning enclaves have access to a special
hardware key.
They can use this key to help to generate certificates which serve as
proof that enclaves are running on trusted SGX hardware. These
certificates can be passed around without revealing the special key.
Any user who can create a provisioning enclave can access the
processor-unique Provisioning Certificate Key which has privacy and
fingerprinting implications. Even if a user is permitted to create
normal application enclaves (via /dev/sgx_enclave), they should not be
able to create provisioning enclaves. That means a separate permissions
scheme is needed to control provisioning enclave privileges.
Implement a separate device file (/dev/sgx_provision) which allows
creating provisioning enclaves. This device will typically have more
strict permissions than the plain enclave device.
The actual device “driver” is an empty stub. Open file descriptors for
this device will represent a token which allows provisioning enclave duty.
This file descriptor can be passed around and ultimately given as an
argument to the /dev/sgx_enclave driver ioctl().
[ bp: Touchups. ]
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-security-module@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112220135.165028-16-jarkko@kernel.org
Enclaves have two basic states. They are either being built and are
malleable and can be modified by doing things like adding pages. Or,
they are locked down and not accepting changes. They can only be run
after they have been locked down. The ENCLS[EINIT] function induces the
transition from being malleable to locked-down.
Add an ioctl() that performs ENCLS[EINIT]. After this, new pages can
no longer be added with ENCLS[EADD]. This is also the time where the
enclave can be measured to verify its integrity.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-15-jarkko@kernel.org
SGX enclave pages are inaccessible to normal software. They must be
populated with data by copying from normal memory with the help of the
EADD and EEXTEND functions of the ENCLS instruction.
Add an ioctl() which performs EADD that adds new data to an enclave, and
optionally EEXTEND functions that hash the page contents and use the
hash as part of enclave “measurement” to ensure enclave integrity.
The enclave author gets to decide which pages will be included in the
enclave measurement with EEXTEND. Measurement is very slow and has
sometimes has very little value. For instance, an enclave _could_
measure every page of data and code, but would be slow to initialize.
Or, it might just measure its code and then trust that code to
initialize the bulk of its data after it starts running.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-14-jarkko@kernel.org
Add an ioctl() that performs the ECREATE function of the ENCLS
instruction, which creates an SGX Enclave Control Structure (SECS).
Although the SECS is an in-memory data structure, it is present in
enclave memory and is not directly accessible by software.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-13-jarkko@kernel.org
Intel(R) SGX is a new hardware functionality that can be used by
applications to set aside private regions of code and data called
enclaves. New hardware protects enclave code and data from outside
access and modification.
Add a driver that presents a device file and ioctl API to build and
manage enclaves.
[ bp: Small touchups, remove unused encl variable in sgx_encl_find() as
Reported-by: kernel test robot <lkp@intel.com> ]
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-12-jarkko@kernel.org
"intel_iommu=off" command line is used to disable iommu but iommu is force
enabled in a tboot system for security reason.
However for better performance on high speed network device, a new option
"intel_iommu=tboot_noforce" is introduced to disable the force on.
By default kernel should panic if iommu init fail in tboot for security
reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off".
Fix the code setting force_on and move intel_iommu_tboot_noforce
from tboot code to intel iommu code.
Fixes: 7304e8f28b ("iommu/vt-d: Correctly disable Intel IOMMU force on")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Tested-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
sysrq-t ends up invoking show_opcodes() for each task which tries to access
the user space code of other processes, which is obviously bogus.
It either manages to dump where the foreign task's regs->ip points to in a
valid mapping of the current task or triggers a pagefault and prints "Code:
Bad RIP value.". Both is just wrong.
Add a safeguard in copy_code() and check whether the @regs pointer matches
currents pt_regs. If not, do not even try to access it.
While at it, add commentary why using copy_from_user_nmi() is safe in
copy_code() even if the function name suggests otherwise.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201117202753.667274723@linutronix.de
show_trace_log_lvl() is not used by other compilation units so make it
static and remove the declaration from the header file.
Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201113133943.GA136221@rlk
Add functions for runtime allocation and free.
This allocator and its algorithms are as simple as it gets. They do a
linear search across all EPC sections and find the first free page. They
are not NUMA-aware and only hand out individual pages. The SGX hardware
does not support large pages, so something more complicated like a buddy
allocator is unwarranted.
The free function (sgx_free_epc_page()) implicitly calls ENCLS[EREMOVE],
which returns the page to the uninitialized state. This ensures that the
page is ready for use at the next allocation.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-10-jarkko@kernel.org
Add a kernel parameter to disable SGX kernel support and document it.
[ bp: Massage. ]
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-9-jarkko@kernel.org
Kernel support for SGX is ultimately decided by the state of the launch
control bits in the feature control MSR (MSR_IA32_FEAT_CTL). If the
hardware supports SGX, but neglects to support flexible launch control, the
kernel will not enable SGX.
Enable SGX at feature control MSR initialization and update the associated
X86_FEATURE flags accordingly. Disable X86_FEATURE_SGX (and all
derivatives) if the kernel is not able to establish itself as the authority
over SGX Launch Control.
All checks are performed for each logical CPU (not just boot CPU) in order
to verify that MSR_IA32_FEATURE_CONTROL is correctly configured on all
CPUs. All SGX code in this series expects the same configuration from all
CPUs.
This differs from VMX where X86_FEATURE_VMX is intentionally cleared only
for the current CPU so that KVM can provide additional information if KVM
fails to load like which CPU doesn't support VMX. There’s not much the
kernel or an administrator can do to fix the situation, so SGX neglects to
convey additional details about these kinds of failures if they occur.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-8-jarkko@kernel.org
Although carved out of normal DRAM, enclave memory is marked in the
system memory map as reserved and is not managed by the core mm. There
may be several regions spread across the system. Each contiguous region
is called an Enclave Page Cache (EPC) section. EPC sections are
enumerated via CPUID
Enclave pages can only be accessed when they are mapped as part of an
enclave, by a hardware thread running inside the enclave.
Parse CPUID data, create metadata for EPC pages and populate a simple
EPC page allocator. Although much smaller, ‘struct sgx_epc_page’
metadata is the SGX analog of the core mm ‘struct page’.
Similar to how the core mm’s page->flags encode zone and NUMA
information, embed the EPC section index to the first eight bits of
sgx_epc_page->desc. This allows a quick reverse lookup from EPC page to
EPC section. Existing client hardware supports only a single section,
while upcoming server hardware will support at most eight sections.
Thus, eight bits should be enough for long term needs.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-6-jarkko@kernel.org
ENCLS is the userspace instruction which wraps virtually all
unprivileged SGX functionality for managing enclaves. It is essentially
the ioctl() of instructions with each function implementing different
SGX-related functionality.
Add macros to wrap the ENCLS functionality. There are two main groups,
one for functions which do not return error codes and a “ret_” set for
those that do.
ENCLS functions are documented in Intel SDM section 36.6.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-3-jarkko@kernel.org
Define the SGX architectural data structures used by various SGX
functions. This is not an exhaustive representation of all SGX data
structures but only those needed by the kernel.
The goal is to sequester hardware structures in "sgx/arch.h" and keep
them separate from kernel-internal or uapi structures.
The data structures are described in Intel SDM section 37.6.
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-2-jarkko@kernel.org
Currently, scan_microcode() leverages microcode_matches() to check
if the microcode matches the CPU by comparing the family and model.
However, the processor stepping and flags of the microcode signature
should also be considered when saving a microcode patch for early
update.
Use find_matching_signature() in scan_microcode() and get rid of the
now-unused microcode_matches() which is a good cleanup in itself.
Complete the verification of the patch being saved for early loading in
save_microcode_patch() directly. This needs to be done there too because
save_mc_for_early() will call save_microcode_patch() too.
The second reason why this needs to be done is because the loader still
tries to support, at least hypothetically, mixed-steppings systems and
thus adds all patches to the cache that belong to the same CPU model
albeit with different steppings.
For example:
microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6
microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23
microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23
microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23
The patch which is being saved for early loading, however, can only be
the one which fits the CPU this runs on so do the signature verification
before saving.
[ bp: Do signature verification in save_microcode_patch()
and rewrite commit message. ]
Fixes: ec400ddeff ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535
Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.chen@intel.com
This macro is useless, and could cause gcc warning:
arch/x86/kernel/kvmclock.c:47:0: warning: macro "HV_CLOCK_SIZE" is not
used [-Wunused-macros]
Let's remove it.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <1604651963-10067-1-git-send-email-alex.shi@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that all in-kernel-tree users are converted to using the sysfs file,
remove the MSR from the "allowlist".
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20201029190259.3476-5-bp@alien8.de
Booting as a guest under KVM results in error messages about
unchecked MSR access:
unchecked MSR access error: RDMSR from 0x17f at rIP: 0xffffffff84483f16 (mce_intel_feature_init+0x156/0x270)
because KVM doesn't provide emulation for random model specific
registers.
Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message.
Fixes: 68299a42f8 ("x86/mce: Enable additional error logging on certain Intel CPUs")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201111003954.GA11878@agluck-desk2.amr.corp.intel.com
- Cure the fallout from the MSI irqdomain overhaul which missed that the
Intel IOMMU does not register virtual function devices and therefore
never reaches the point where the MSI interrupt domain is assigned. This
makes the VF devices use the non-remapped MSI domain which is trapped by
the IOMMU/remap unit.
- Remove an extra space in the SGI_UV architecture type procfs output for
UV5.
- Remove a unused function which was missed when removing the UV BAU TLB
shootdown handler.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
ZqkZbqCZm3vfHA==
=X/P+
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of fixes for x86:
- Cure the fallout from the MSI irqdomain overhaul which missed that
the Intel IOMMU does not register virtual function devices and
therefore never reaches the point where the MSI interrupt domain is
assigned. This made the VF devices use the non-remapped MSI domain
which is trapped by the IOMMU/remap unit
- Remove an extra space in the SGI_UV architecture type procfs output
for UV5
- Remove a unused function which was missed when removing the UV BAU
TLB shootdown handler"
* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iommu/vt-d: Cure VF irqdomain hickup
x86/platform/uv: Fix copied UV5 output archtype
x86/platform/uv: Drop last traces of uv_flush_tlb_others
- A set of commits which reduce the stack usage of various perf event
handling functions which allocated large data structs on stack causing
stack overflows in the worst case.
- Use the proper mechanism for detecting soft interrupts in the recursion
protection.
- Make the resursion protection simpler and more robust.
- Simplify the scheduling of event groups to make the code more robust and
prepare for fixing the issues vs. scheduling of exclusive event groups.
- Prevent event multiplexing and rotation for exclusive event groups
- Correct the perf event attribute exclusive semantics to take pinned
events, e.g. the PMU watchdog, into account
- Make the anythread filtering conditional for Intel's generic PMU
counters as it is not longer guaranteed to be supported on newer
CPUs. Check the corresponding CPUID leaf to make sure.
- Fixup a duplicate initialization in an array which was probably cause by
the usual copy & paste - forgot to edit mishap.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xIi0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofixD/4+4gc8DhOmAkMrN0Z9tiW8ebgMKmb9
wZRkMr5Osi0GzLJOPZ6SdY6jd0A3rMN/sW6P1DT6pDtcty4bKFoW5VZBuUDIAhel
BC4C93L3y1En/GEZu1GTy3LvsBwLBQTOoY4goDjbdAbk60S/0RTHOGyQsRsOQFe6
fVs3iXozAFuaR6I6N3dlxuJAE51zvr8MyBWaUoByNDB//1+lLNW+JfClaAOG1oXx
qZIg/niatBVGzSGgKNRUyh3g8G1HJtabsA/NZ4PH8ZHuYABfmj4lmmUPR77ICLfV
wMITEBG7eaktB8EqM9hvaoOZLA5kpXHO2JbCFSs4c4x11mlC8g7QMV3poCw33YoN
a5TmT1A3muri1riy1/Ee9lXACOq7/tf2+Xfn9o6dvDdBwd6s5pzlhLGR8gILp2lF
2bcg3IwYvHT/Kiurb/WGNpbCqQIPJpcUcfs3tNBCCtKegahUQNnGjxN3NVo9RCit
zfL6xIJ8eZiYnsxXx4NKm744AukWiql3aRNgRkOdBP5WC68xt6VLcxG1YZKUoDhy
jRSOCD/DuPSMSvAAgN7S8OWlPsKWBxVxxWYV+K8FpwhgzbQ3WbS3UDiYkhgjeOxu
OlM692oWpllKvQWlvYthr2Be6oPCRRi1vvADNNbTKzgHk5i61bwympsGl1EZx3Pz
2ROp7NJFRESnqw==
=FzCf
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of fixes for perf:
- A set of commits which reduce the stack usage of various perf
event handling functions which allocated large data structs on
stack causing stack overflows in the worst case
- Use the proper mechanism for detecting soft interrupts in the
recursion protection
- Make the resursion protection simpler and more robust
- Simplify the scheduling of event groups to make the code more
robust and prepare for fixing the issues vs. scheduling of
exclusive event groups
- Prevent event multiplexing and rotation for exclusive event groups
- Correct the perf event attribute exclusive semantics to take
pinned events, e.g. the PMU watchdog, into account
- Make the anythread filtering conditional for Intel's generic PMU
counters as it is not longer guaranteed to be supported on newer
CPUs. Check the corresponding CPUID leaf to make sure
- Fixup a duplicate initialization in an array which was probably
caused by the usual 'copy & paste - forgot to edit' mishap"
* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix Add BW copypasta
perf/x86/intel: Make anythread filter support conditional
perf: Tweak perf_event_attr::exclusive semantics
perf: Fix event multiplexing for exclusive groups
perf: Simplify group_sched_in()
perf: Simplify group_sched_out()
perf/x86: Make dummy_iregs static
perf/arch: Remove perf_sample_data::regs_user_copy
perf: Optimize get_recursion_context()
perf: Fix get_recursion_context()
perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
perf: Reduce stack usage of perf_output_begin()
When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call
will be able to set the ip of the calling function. This will improve the
performance of live kernel patching where it does not need all the regs to
be stored just to change the instruction pointer.
If all archs that support live kernel patching also support
HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function
klp_arch_set_pc() could be made generic.
It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but
not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching.
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: live-patching@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the only way to get access to the registers of a function via a
ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this
saves all regs as if a breakpoint were to trigger (for use with kprobes), it
is expensive.
The regs are already saved on the stack for the default ftrace callbacks, as
that is required otherwise a function being traced will get the wrong
arguments and possibly crash. And on x86, the arguments are already stored
where they would be on a pt_regs structure to use that code for both the
regs version of a callback, it makes sense to pass that information always
to all functions.
If an architecture does this (as x86_64 now does), it is to set
HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it
could have access to arguments without having to set the flags.
This also includes having the stack pointer being saved, which could be used
for accessing arguments on the stack, as well as having the function graph
tracer not require its own trampoline!
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.
For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.
This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A test shows that the output contains a space:
# cat /proc/sgi_uv/archtype
NSGI4 U/UVX
Remove that embedded space by copying the "trimmed" buffer instead of the
untrimmed input character list. Use sizeof to remove size dependency on
copy out length. Increase output buffer size by one character just in case
BIOS sends an 8 character string for archtype.
Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20201111010418.82133-1-mike.travis@hpe.com
PCI's default trigger type is level and ISA's is edge. The recent
refactoring made it the other way round, which went unnoticed as it seems
only to cause havoc on some AMD systems.
Make the comment and code do the right thing again.
Fixes: a27dca645d ("x86/io_apic: Cleanup trigger/polarity helpers")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/87d00lgu13.fsf@nanos.tec.linutronix.de
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org
- Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
integrated assembler upset.
- Correct the mitigation selection logic which prevented the related prctl
to work correctly.
- Make the UV5 hubless system work correctly by fixing up the malformed
table entries and adding the missing ones.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oDNYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaN0EACPWY15k1YuAEIjiQxRBhq22J8Y6wNX
Ui/rF2AZcAnNEJDTIyvjP6COnT9mjX/tuuluMaI6i/XY/9Xp5LpKvivkL2PXNN3X
onW01ouIc1iYxXwQEVZvhYHsOyhkR9Z8yNG/q9I7xYAXNSZcAHwXVar4VlPBT7Ay
iP75i8pGmb/NCc4oHNXuBp/dV/0/dCoLTndb5p5pX8oS60AAt9ZuK3IRc3ucayhI
M4rTTEya1oY+ZNbtP4A4Jp7Qc/NGYDo6q04za+jcxZ5Gqacs+fk/PNuWgL1fZZtW
sn1D+SMWEb55Xcsdy976b29FFU/DcOcf7TRASzyKgyPW5jg1dP6BZ6U0wpVV3KZw
S2h5/pt48JZI7olrDsLQ0tzjALlk2CcFNrnRtOMDduHdw9wyz+Sg58lZYuvH3sXK
5ZblWRJ3JiBNsNO0sA3kd4sp7xWQB3ey6mkYD8Vqb7zRIt8aXT9jqBxhDrP+Vqs/
/UKv+BJfD6WxC0nQ4x6MS3g4sDvI+1SLfHSZ/UjWJ6NfYJW5/w429pFCaF73xCTd
cqxja1dZYixn7ioFZjolMUdvuDiC5B2+5+RzEV87kaDzO9QZQyvsl7G74MSfwx6G
DAydvuyJoxP2qVASobOBcVOzLQO7DsLzFZzJTttZcnkK2iprcz4qrsFLMxF9SxTD
Amb8qck60dLfqA==
=JdPk
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes:
- Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
integrated assembler upset
- Correct the mitigation selection logic which prevented the related
prctl to work correctly
- Make the UV5 hubless system work correctly by fixing up the
malformed table entries and adding the missing ones"
* tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Recognize UV5 hubless system identifier
x86/platform/uv: Remove spaces from OEM IDs
x86/platform/uv: Fix missing OEM_TABLE_ID
x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
Testing shows a problem in that UV5 hubless systems were not being
recognized. Add them to the list of OEM IDs checked.
Fixes: 6c7794423a ("Add UV5 direct references")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-4-mike.travis@hpe.com
Testing shows that trailing spaces caused problems with the OEM_ID and
the OEM_TABLE_ID. One being that the OEM_ID would not string compare
correctly. Another the OEM_ID and OEM_TABLE_ID would be concatenated
in the printout. Remove any trailing spaces.
Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-3-mike.travis@hpe.com
Testing shows a problem in that the OEM_TABLE_ID was missing for
hubless systems. This is used to determine the APIC type (legacy or
extended). Add the OEM_TABLE_ID to the early hubless processing.
Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-2-mike.travis@hpe.com
Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency. Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.
[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
The aperfmperf_snapshot_cpu() function is invoked upon access to
/proc/cpuinfo, and it does do an early exit if the specified CPU has
recently done a snapshot. Unfortunately, the indication that a snapshot
has been completed is set in an IPI handler, and the execution of this
handler can be delayed by any number of unfortunate events. This means
that a system that starts a number of applications, each of which
parses /proc/cpuinfo, can suffer from an smp_call_function_single()
storm, especially given that each access to /proc/cpuinfo invokes
smp_call_function_single() for all CPUs. Please note that this is not
theoretical speculation. Note also that one CPU's pending IPI serves
all requests, so there is no point in ever having more than one IPI
pending to a given CPU.
This commit therefore suppresses duplicate IPIs to a given CPU via a
new ->scfpending field in the aperfmperf_sample structure. This field
is set to the value one if an IPI is pending to the corresponding CPU
and to zero otherwise.
The aperfmperf_snapshot_cpu() function uses atomic_xchg() to set this
field to the value one and sample the old value. If this function's
"wait" parameter is zero, smp_call_function_single() is called only if
the old value of the ->scfpending field was zero. The IPI handler uses
atomic_set_release() to set this new field to zero just before returning,
so that the prior stores into the aperfmperf_sample structure are seen
by future requests that get to the atomic_xchg(). Future requests that
pass the elapsed-time check are ordered by the fact that on x86 loads act
as acquire loads, just as was the case prior to this change. The return
value is based off of the age of the prior snapshot, just as before.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
[ paulmck: Allow /proc/cpuinfo to take advantage of arch_freq_get_on_cpu(). ]
[ paulmck: Add comment on memory barrier. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Commit
c9c6d216ed ("x86/mce: Rename "first" function as "early"")
changed the enumeration of MCE notifier priorities. Correct the check
for notifier priorities to cover the new range.
[ bp: Rewrite commit message, remove superfluous brackets in
conditional. ]
Fixes: c9c6d216ed ("x86/mce: Rename "first" function as "early"")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com
This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.
Link: https://lkml.kernel.org/r/20201106023548.102375687@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.
The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.
Link: https://lkml.kernel.org/r/20201028115613.140212174@goodmis.org
Link: https://lkml.kernel.org/r/20201106023546.944907560@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move the x86 IMA arch code into security/integrity/ima/ima_efi.c,
so that we will be able to wire it up for arm64 in a future patch.
Co-developed-by: Chester Lin <clin@suse.com>
Signed-off-by: Chester Lin <clin@suse.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
On AMD CPUs which have the feature X86_FEATURE_AMD_STIBP_ALWAYS_ON,
STIBP is set to on and
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED
At the same time, IBPB can be set to conditional.
However, this leads to the case where it's impossible to turn on IBPB
for a process because in the PR_SPEC_DISABLE case in ib_prctl_set() the
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED
condition leads to a return before the task flag is set. Similarly,
ib_prctl_get() will return PR_SPEC_DISABLE even though IBPB is set to
conditional.
More generally, the following cases are possible:
1. STIBP = conditional && IBPB = on for spectre_v2_user=seccomp,ibpb
2. STIBP = on && IBPB = conditional for AMD CPUs with
X86_FEATURE_AMD_STIBP_ALWAYS_ON
The first case functions correctly today, but only because
spectre_v2_user_ibpb isn't updated to reflect the IBPB mode.
At a high level, this change does one thing. If either STIBP or IBPB
is set to conditional, allow the prctl to change the task flag.
Also, reflect that capability when querying the state. This isn't
perfect since it doesn't take into account if only STIBP or IBPB is
unconditionally on. But it allows the conditional feature to work as
expected, without affecting the unconditional one.
[ bp: Massage commit message and comment; space out statements for
better readability. ]
Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201105163246.v2.1.Ifd7243cd3e2c2206a893ad0a5b9a4f19549e22c6@changeid
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.
Move it to common code and extend irqentry_state_t to carry lockdep state.
[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
In commit b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to
find remapping irqdomain") the I/O-APIC code was changed to find its
parent irqdomain using irq_find_matching_fwspec(), but the key used
for the lookup was wrong. It shouldn't use 'ioapic' which is the index
into its own ioapics[] array. It should use the actual arbitration
ID of the I/O-APIC in question, which is mpc_ioapic_id(ioapic).
Fixes: b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain")
Reported-by: lkp <oliver.sang@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/57adf2c305cd0c5e9d860b2f3007a7e676fd0f9f.camel@infradead.org
When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.
The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.
On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
> 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.
[ tglx: Updated comment as suggested by David ]
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20201103011136.59108-1-decui@microsoft.com
hypervisor checks before enabling encryption. (Joerg Roedel)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+hKN8ACgkQEsHwGGHe
VUrkZQ/+LWjbDrbkLCQpWuzLagAocZMKKvr4+2ujU+krj0QU5FFJbfuzhkktQD+H
cbfOW7+E8lqTDoj/dwoJPj2Xs8HvW4Ua6sbxF5lCPhlEr3NIetRfQ7SPj3qFvQG+
FKP/55RSnjKIx7aZXKN9YAw2FF3EC1BisjszCBKid5S8HbGqjLMb2Ue0i/nssksY
CvLwaxtDOGuSzJ8FwL+vmI70NkeLZ0ulTxbuxXAqfMTvJX3e1QA9dgeZMgfU1hng
eA1Pjlm0X7FOsnwihYP2EZ6NzRrTkYeGl1Iagz1apqlDlQ+bcaxvs2btIyb7MKt5
6PPDGg0P0WVMNfOEUYTZob31QcLnakA/p8kG8sYE6h2PlqO9Tf5cpmOJ6pv+DYFz
hfcjAZfamStUbWdWpr33RVCXN5pwZRu+UytD3JYykzgwmKXQxLHqrbjHXLO3zJ7k
+L0JE+N2vmi/7M5Ghsv3yKwy5fR5rMT5V6qEHSd1qrr9VpKBceNMJgPA8wh4882F
SD5sD2b6L/Cf9L4FAFqICHb/p4rxPRf5VnUoybo70U7EiwfbZQik5g3X5cO4KO2N
0z8nMk7dIZncQF0LYJNElIvKonrU8sIa+TbHjYyBWdQlOPgK4IlCvZeyjVUvUG24
kYx2WbANhCxGFd86rsl5P7xNXvBiSALf1afbQPvU0VTbZ43vSnQ=
=Pvgr
-----END PGP SIGNATURE-----
Merge tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV-ES fixes from Borislav Petkov:
"A couple of changes to the SEV-ES code to perform more stringent
hypervisor checks before enabling encryption (Joerg Roedel)"
* tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Do not support MMIO to/from encrypted memory
x86/head/64: Check SEV encryption before switching to kernel page-table
x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path
x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handler
x86/boot/compressed/64: Introduce sev_status
The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an
optional additional error logging mode which is enabled by an MSR.
Previously, this mode was enabled from the mcelog(8) tool via /dev/cpu,
but userspace should not be poking at MSRs. So move the enabling into
the kernel.
[ bp: Correct the explanation why this is done. ]
Suggested-by: Boris Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201030190807.GA13884@agluck-desk2.amr.corp.intel.com
Commit
b4e0409a36 ("x86: check vmlinux limits, 64-bit")
added a check that the size of the 64-bit kernel is less than
KERNEL_IMAGE_SIZE.
The check uses (_end - _text), but this is not enough. The initial
PMD used in startup_64() (level2_kernel_pgt) can only map upto
KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text, and the
modules area (MODULES_VADDR) starts at KERNEL_IMAGE_SIZE.
The correct check is what is currently done for 32-bit, since
LOAD_OFFSET is defined appropriately for the two architectures. Just
check (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.
Note that on 32-bit, the limit is not strict: KERNEL_IMAGE_SIZE is not
really used by the main kernel. The higher the kernel is located, the
less the space available for the vmalloc area. However, it is used by
KASLR in the compressed stub to limit the maximum address of the kernel
to a safe value.
Clean up various comments to clarify that despite the name,
KERNEL_IMAGE_SIZE is not a limit on the size of the kernel image, but a
limit on the maximum virtual address that the image can occupy.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201029161903.2553528-1-nivedita@alum.mit.edu
MMIO memory is usually not mapped encrypted, so there is no reason to
support emulated MMIO when it is mapped encrypted.
Prevent a possible hypervisor attack where a RAM page is mapped as
an MMIO page in the nested page-table, so that any guest access to it
will trigger a #VC exception and leak the data on that page to the
hypervisor via the GHCB (like with valid MMIO). On the read side this
attack would allow the HV to inject data into the guest.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-6-joro@8bytes.org
When SEV is enabled, the kernel requests the C-bit position again from
the hypervisor to build its own page-table. Since the hypervisor is an
untrusted source, the C-bit position needs to be verified before the
kernel page-table is used.
Call sev_verify_cbit() before writing the CR3.
[ bp: Massage. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org
Check whether the hypervisor reported the correct C-bit when running as
an SEV guest. Using a wrong C-bit position could be used to leak
sensitive data from the guest to the hypervisor.
The check function is in a separate file:
arch/x86/kernel/sev_verify_cbit.S
so that it can be re-used in the running kernel image.
[ bp: Massage. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-4-joro@8bytes.org
The early #VC handler which doesn't have a GHCB can only handle CPUID
exit codes. It is needed by the early boot code to handle #VC exceptions
raised in verify_cpu() and to get the position of the C-bit.
But the CPUID information comes from the hypervisor which is untrusted
and might return results which trick the guest into the no-SEV boot path
with no C-bit set in the page-tables. All data written to memory would
then be unencrypted and could leak sensitive data to the hypervisor.
Add sanity checks to the early #VC handler to make sure the hypervisor
can not pretend that SEV is disabled.
[ bp: Massage a bit. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-3-joro@8bytes.org
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.
In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart(). Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.
For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk