- Proper emulation of the OSLock feature of the debug architecture
- Scalibility improvements for the MMU lock when dirty logging is on
- New VMID allocator, which will eventually help with SVA in VMs
- Better support for PMUs in heterogenous systems
- PSCI 1.1 support, enabling support for SYSTEM_RESET2
- Implement CONFIG_DEBUG_LIST at EL2
- Make CONFIG_ARM64_ERRATUM_2077057 default y
- Reduce the overhead of VM exit when no interrupt is pending
- Remove traces of 32bit ARM host support from the documentation
- Updated vgic selftests
- Various cleanups, doc updates and spelling fixes
RISC-V:
- Prevent KVM_COMPAT from being selected
- Optimize __kvm_riscv_switch_to() implementation
- RISC-V SBI v0.3 support
s390:
- memop selftest
- fix SCK locking
- adapter interruptions virtualization for secure guests
- add Claudio Imbrenda as maintainer
- first step to do proper storage key checking
x86:
- Continue switching kvm_x86_ops to static_call(); introduce
static_call_cond() and __static_call_ret0 when applicable.
- Cleanup unused arguments in several functions
- Synthesize AMD 0x80000021 leaf
- Fixes and optimization for Hyper-V sparse-bank hypercalls
- Implement Hyper-V's enlightened MSR bitmap for nested SVM
- Remove MMU auditing
- Eager splitting of page tables (new aka "TDP" MMU only) when dirty
page tracking is enabled
- Cleanup the implementation of the guest PGD cache
- Preparation for the implementation of Intel IPI virtualization
- Fix some segment descriptor checks in the emulator
- Allow AMD AVIC support on systems with physical APIC ID above 255
- Better API to disable virtualization quirks
- Fixes and optimizations for the zapping of page tables:
- Zap roots in two passes, avoiding RCU read-side critical sections
that last too long for very large guests backed by 4 KiB SPTEs.
- Zap invalid and defunct roots asynchronously via concurrency-managed
work queue.
- Allowing yielding when zapping TDP MMU roots in response to the root's
last reference being put.
- Batch more TLB flushes with an RCU trick. Whoever frees the paging
structure now holds RCU as a proxy for all vCPUs running in the guest,
i.e. to prolongs the grace period on their behalf. It then kicks the
the vCPUs out of guest mode before doing rcu_read_unlock().
Generic:
- Introduce __vcalloc and use it for very large allocations that
need memcg accounting
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmI4fdwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMq8gf/WoeVHtw2QlL5Mmz6McvRRmPAYPLV
wLUIFNrRqRvd8Tw4kivzZoh/xTpwmnojv0YdK5SjKAiMjgv094YI1LrNp1JSPvmL
pitocMkA10RSJNWHeEMg9cMSKH0rKiqeYl6S1e2XsdB+UZZ2BINOCVtvglmjTAvJ
dFBdKdBkqjAUZbdXAGIvz4JEEER3N/LkFDKGaUGX+0QIQOzGBPIyLTxynxIDG6mt
RViCCFyXdy5NkVp5hZFm96vQ2qAlWL9B9+iKruQN++82+oqWbeTdSqPhdwF7GyFz
BfOv3gobQ2c4ef/aMLO5LswZ9joI1t/4kQbbAn6dNybpOAz/NXfDnbNefg==
=keox
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Proper emulation of the OSLock feature of the debug architecture
- Scalibility improvements for the MMU lock when dirty logging is on
- New VMID allocator, which will eventually help with SVA in VMs
- Better support for PMUs in heterogenous systems
- PSCI 1.1 support, enabling support for SYSTEM_RESET2
- Implement CONFIG_DEBUG_LIST at EL2
- Make CONFIG_ARM64_ERRATUM_2077057 default y
- Reduce the overhead of VM exit when no interrupt is pending
- Remove traces of 32bit ARM host support from the documentation
- Updated vgic selftests
- Various cleanups, doc updates and spelling fixes
RISC-V:
- Prevent KVM_COMPAT from being selected
- Optimize __kvm_riscv_switch_to() implementation
- RISC-V SBI v0.3 support
s390:
- memop selftest
- fix SCK locking
- adapter interruptions virtualization for secure guests
- add Claudio Imbrenda as maintainer
- first step to do proper storage key checking
x86:
- Continue switching kvm_x86_ops to static_call(); introduce
static_call_cond() and __static_call_ret0 when applicable.
- Cleanup unused arguments in several functions
- Synthesize AMD 0x80000021 leaf
- Fixes and optimization for Hyper-V sparse-bank hypercalls
- Implement Hyper-V's enlightened MSR bitmap for nested SVM
- Remove MMU auditing
- Eager splitting of page tables (new aka "TDP" MMU only) when dirty
page tracking is enabled
- Cleanup the implementation of the guest PGD cache
- Preparation for the implementation of Intel IPI virtualization
- Fix some segment descriptor checks in the emulator
- Allow AMD AVIC support on systems with physical APIC ID above 255
- Better API to disable virtualization quirks
- Fixes and optimizations for the zapping of page tables:
- Zap roots in two passes, avoiding RCU read-side critical
sections that last too long for very large guests backed by 4
KiB SPTEs.
- Zap invalid and defunct roots asynchronously via
concurrency-managed work queue.
- Allowing yielding when zapping TDP MMU roots in response to the
root's last reference being put.
- Batch more TLB flushes with an RCU trick. Whoever frees the
paging structure now holds RCU as a proxy for all vCPUs running
in the guest, i.e. to prolongs the grace period on their behalf.
It then kicks the the vCPUs out of guest mode before doing
rcu_read_unlock().
Generic:
- Introduce __vcalloc and use it for very large allocations that need
memcg accounting"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
KVM: use kvcalloc for array allocations
KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
kvm: x86: Require const tsc for RT
KVM: x86: synthesize CPUID leaf 0x80000021h if useful
KVM: x86: add support for CPUID leaf 0x80000021
KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask
Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU
KVM: arm64: fix typos in comments
KVM: arm64: Generalise VM features into a set of flags
KVM: s390: selftests: Add error memop tests
KVM: s390: selftests: Add more copy memop tests
KVM: s390: selftests: Add named stages for memop test
KVM: s390: selftests: Add macro as abstraction for MEM_OP
KVM: s390: selftests: Split memop tests
KVM: s390x: fix SCK locking
RISC-V: KVM: Implement SBI HSM suspend call
RISC-V: KVM: Add common kvm_riscv_vcpu_wfi() function
RISC-V: Add SBI HSM suspend related defines
RISC-V: KVM: Implement SBI v0.3 SRST extension
...
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI4WdIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jdTA/7BADTYzFCbdwPzHt2mR8osv7k+pDvYxs9
wxNjyi1X7N8cPkhqgIg9CfdhdyDOqo7+J4fG17f2qbwjNK7b2Fb1/U6ZoZaf+f8F
W0e2LX5KZTXUhkA+TEjrXvYD9FmJaCPM/l2RQg8U7okBs2kb0H6QT2Yn21wd1roC
WwI5KFiWSVS1IzpVLaXjDh+FJfJHd75ReMqJeus+QoVQ9NHeuI+t4DglSB1IBi54
d/zeVXE/Y4dFTQOrU06S2HxcOEptvXZsPmVLvKab/veeGGyWiGPxQpvu6bXm6u3x
0sV+dn67zut2m2pQlUZUucgGTSYIZTpOe+rNukTB9hJ4XeN4/1ohOOCrOuYM+63P
lGFbN1v+LD7Wc6C2eEhw8G5GEL0qbwzFNQ06O3EOFi7C7GKn7WS/ET6XuuMOERFk
uxEPb4pFtbBlJ0SriCprFJSd5NL3PORZlLIhv4hGH5hilLR1TFeKDuwZaM4noQxU
dL3rKGLi9H+P46Eni9H28+0gDISbv1xL+WivHOFQNmhBqAZO52ZcF3J+dgBaR1B5
pBxVTycFpZMjxSZnqTE0gMsFaLIpVGc+75Chns1rajR0mEtRtJUQUbYz4tK4zb0E
dZR1p+VF6+DYmSRhiqeaTi9uz9oE8kMa8o/EcbFIg/9BgEnUwJXU20bjnar30xQ7
9OIn7r9hjHI=
=XPuo
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event updates from Ingo Molnar:
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
* tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix the build on !CONFIG_PHYS_ADDR_T_64BIT
perf: Add irq and exception return branch types
perf/x86/intel/uncore: Make uncore_discovery clean for 64 bit addresses
perf/x86/intel/pt: Add a capability and config bit for disabling TNTs
perf/x86/intel/pt: Add a capability and config bit for event tracing
perf/x86/intel: Increase max number of the fixed counters
KVM: x86: use the KVM side max supported fixed counter
perf/x86/intel: Enable PEBS format 5
perf/core: Allow kernel address filter when not filtering the kernel
perf/x86/intel/pt: Fix address filter config for 32-bit kernel
perf/core: Fix address filter parser for multiple filters
x86: Share definition of __is_canonical_address()
perf/x86/intel/pt: Relax address filter validation
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.
Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- A couple of fixes when handling an exception while a SError has been
delivered
- Workaround for Cortex-A510's single-step[ erratum
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmH9LlcPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDLTcP/3Ry8CzvPubZquMyNdRUFvEg2EcfTa6vtIGW
Fw7ap2hwPUaXUgJKDihMFIWj3Wf/wPmXw4t2Sr8R/yq8v9kWe+IG1isnT0yQhY3W
kLXEqc8Mu4Rf8+jvlFHsp5mLENHIswpWAv/EY49ChgZkNmtkKpnPm1qnD89d8bNv
tUwooDWidQ/7nXdM3z6zygSROJS24+OGTYTWzOQ1KgV3FGaXbqYiCleoPOpRR/Tc
DQQWF/tVl8bZCqgkGKZCv3aXT0ZUPrQggARJGai78vP0l2sE/Kyaydgq5I7npZja
2L2U4kDNoPYIVa8A1jvV3Ef3AqNFs6B7+jXWfYIgAcXjCYzDK3cZcxavf/Inq9F1
3udVGJGSzH1KkGaihW3BVhsqGORRHKCdksJzWRgqf6vGyJhJw0u0D2u1rTWcT+jw
Nm4KxShp0CX59HSLnVF5sR0Mct3jNNZ7UCCgH7q10wuBqYRfJT32hCo2ZrT7g9oD
IQ+pa2dVYa3SaKZ4O6T/lSlbLOuuxtvmcEIfxYpPD6m10S5RrxOdsW3MCtiYM5HQ
24oo2mk6NIu/va0XxhcW+NBMcYtLQD9JUGbkUkpcRy2mgilTi9b4YPp+muYM7plQ
/S1gj2kGY8vjMg0H+wysjMJyl2huEwSRsZ/UfxCAgW+MYhHLDxhxAnDWc8EcwGgE
tUzomowB
=Mbx/
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.17, take #2
- A couple of fixes when handling an exception while a SError has been
delivered
- Workaround for Cortex-A510's single-step[ erratum
For consistency and clarity, migrate x86 over to the generic helpers for
guest timing and lockdep/RCU/tracing management, and remove the
x86-specific helpers.
Prior to this patch, the guest timing was entered in
kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
svm_vcpu_enter_exit()), and was exited by the call to
vtime_account_guest_exit() within vcpu_enter_guest().
To minimize duplication and to more clearly balance entry and exit, both
entry and exit of guest timing are placed in vcpu_enter_guest(), using
the new guest_timing_{enter,exit}_irqoff() helpers. When context
tracking is used a small amount of additional time will be accounted
towards guests; tick-based accounting is unnaffected as IRQs are
disabled at this point and not enabled until after the return from the
guest.
This also corrects (benign) mis-balanced context tracking accounting
introduced in commits:
ae95f566b3 ("KVM: X86: TSCDEADLINE MSR emulation fastpath")
26efe2fd92 ("KVM: VMX: Handle preemption timer fastpath")
Where KVM can enter a guest multiple times, calling vtime_guest_enter()
without a corresponding call to vtime_account_guest_exit(), and with
vtime_account_system() called when vtime_account_guest() should be used.
As account_system_time() checks PF_VCPU and calls account_guest_time(),
this doesn't result in any functional problem, but is unnecessarily
confusing.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <20220201132926.3301912-4-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- selftest compilation fix for non-x86
- KVM: avoid warning on s390 in mark_page_dirty
x86:
- fix page write-protection bug and improve comments
- use binary search to lookup the PMU event filter, add test
- enable_pmu module parameter support for Intel CPUs
- switch blocked_vcpu_on_cpu_lock to raw spinlock
- cleanups of blocked vCPU logic
- partially allow KVM_SET_CPUID{,2} after KVM_RUN (5.16 regression)
- various small fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHpmT0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOstggAi1VSpT43oGslQjXNDZacHEARoYQs
b0XpoW7HXicGSGRMWspCmiAPdJyYTsioEACttAmXUMs7brAgHb9n/vzdlcLh1ymL
rQw2YFQlfqqB1Ki1iRhNkWlH9xOECsu28WLng6ylrx51GuT/pzWRt+V3EGUFTxIT
ldW9HgZg2oFJIaLjg2hQVR/8EbBf0QdsAD3KV3tyvhBlXPkyeLOMcGe9onfjZ/NE
JQeW7FtKtP4SsIFt1KrJpDPjtiwFt3bRM0gfgGw7//clvtKIqt1LYXZiq4C3b7f5
tfYiC8lO2vnOoYcfeYEmvybbSsoS/CgSliZB32qkwoVvRMIl82YmxtDD+Q==
=/Mak
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
"Generic:
- selftest compilation fix for non-x86
- KVM: avoid warning on s390 in mark_page_dirty
x86:
- fix page write-protection bug and improve comments
- use binary search to lookup the PMU event filter, add test
- enable_pmu module parameter support for Intel CPUs
- switch blocked_vcpu_on_cpu_lock to raw spinlock
- cleanups of blocked vCPU logic
- partially allow KVM_SET_CPUID{,2} after KVM_RUN (5.16 regression)
- various small fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (46 commits)
docs: kvm: fix WARNINGs from api.rst
selftests: kvm/x86: Fix the warning in lib/x86_64/processor.c
selftests: kvm/x86: Fix the warning in pmu_event_filter_test.c
kvm: selftests: Do not indent with spaces
kvm: selftests: sync uapi/linux/kvm.h with Linux header
selftests: kvm: add amx_test to .gitignore
KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
KVM: VMX: Don't do full kick when handling posted interrupt wakeup
KVM: VMX: Fold fallback path into triggering posted IRQ helper
KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ
KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs
KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
...
The new module parameter to control PMU virtualization should apply
to Intel as well as AMD, for situations where userspace is not trusted.
If the module parameter allows PMU virtualization, there could be a
new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
PMU virtualization on a per-VM basis.
If the module parameter does not allow PMU virtualization, there
should be no userspace override, since we have no precedent for
authorizing that kind of override. If it's false, other counter-based
profiling features (such as LBR including the associated CPUID bits
if any) will not be exposed.
Change its name from "pmu" to "enable_pmu" as we have temporary
variables with the same name in our code like "struct kvm_pmu *pmu".
Fixes: b1d66dad65 ("KVM: x86/svm: Add module param to control PMU virtualization")
Suggested-by : Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220111073823.21885-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into
KVM's 'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to
a simpler state and less shared data between EL1 and EL2 in
the nVHE case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be
unmapped from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once
the vcpu xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and
page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHhxvsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPZkAf+Nz92UL/5nNGcdHtE4m7AToMmitE9
bYkesf9BMQvAe5wjkABLuoHGi6ay4jabo4fiGzbdkiK7lO5YgfsWiMB3/MT5fl4E
jRPzaVQabp3YZLM8UYCBmfUVuRj524S967SfSRe0AvYjDEH8y7klPf4+7sCsFT0/
Px9Vf2KGuOlf0eM78yKg4rGaF0jS22eLgXm6FfNMY8/e29ZAo/jyUmqBY+Z2xxZG
aWhceDtSheW1jwLHLj3nOlQJvHTn8LVGXBE/R8Gda3ZjrBV2rKaDi4Fh+HD+dz86
2zVXwzQ7uck2CMW73GMoXMTWoKSHMyvlBOs1BdvBm4UsnGcXR+q8IFCeuQ==
=s73m
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"RISCV:
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into KVM's
'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to a
simpler state and less shared data between EL1 and EL2 in the nVHE
case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be unmapped
from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once the vcpu
xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen
emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency
checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (268 commits)
x86/fpu: Fix inline prefix warnings
selftest: kvm: Add amx selftest
selftest: kvm: Move struct kvm_x86_state to header
selftest: kvm: Reorder vcpu_load_state steps for AMX
kvm: x86: Disable interception for IA32_XFD on demand
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
kvm: selftests: Add support for KVM_CAP_XSAVE2
kvm: x86: Add support for getting/setting expanded xstate buffer
x86/fpu: Add uabi_size to guest_fpu
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
x86/fpu: Add guest support to xfd_enable_feature()
...
"Cleanup of the perf/kvm interaction."
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvbkACgkQEsHwGGHe
VUrX7w/9FwKUm0WlGcQIAOSdWk85N2qAVH3brYcQHNpTCVe68TOqTCrxCDrGgyUq
2XnCOim99MUlnsVU6QRZqF4yJ8S1tGrc0COJ/qR4SGntucu0oYuDe2aMVq+mWUD7
/IThA0oMRfhki9WwAyUuyCrXzk4blZdlrXyYIRMJGl9xeGNy3cvUtU8f68Kiy22E
OcmQ/o9Etsr38dueAMU1KYEmgSTvG47rS8nfyRUu3QpJHbyLmRXH32PQrm3tduxS
Bw3gMAH5vqq1UDZJ8ZvsPsO0vFX7dtnKEwEKz4qdtRWk9gi8oLGHIwIXC+VtNqpf
mCmX33Jw8uFz9h3JhE84J0j/CgsWHoU6MOs0MOch4Tb69/BfCjQnw1enImhejG8q
YEIDjJf/vgRNaw9PYshiTHT+EJTe9inT3S4eK/ynLRDUEslAqyWZZm7bUE/XrEDi
yRyGIxry/hNZVvRkXT9QBw32fpgnIH2NAMPLEjJSGCRxT89Tfqz0aRDfacCuHTTh
P8pAeiDuy/6RkDlQckOZJWOFFh2IHsykX2l3IJcHqVRqt4ob9b+SZB5qoH/Mv9qb
MSAqdFUupYZFC+6XuPAeX5/Mo+wSkP+pYYSbWNxjUa0yNiYecOjE7/8T2SB2y6Mx
lk2L0ypsZUYSmpHSfvOdPmf6ucj19/5B4+VCX6PQfcNJTnvvhTE=
=tU5G
-----END PGP SIGNATURE-----
Merge tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Borislav Petkov:
"Cleanup of the perf/kvm interaction."
* tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Drop guest callback (un)register stubs
KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c
KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y
KVM: arm64: Convert to the generic perf callbacks
KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
KVM: Move x86's perf guest info callbacks to generic KVM
KVM: x86: More precisely identify NMI from guest when handling PMI
KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable
perf/core: Use static_call to optimize perf_guest_info_callbacks
perf: Force architectures to opt-in to guest callbacks
perf: Add wrappers for invoking guest callbacks
perf/core: Rework guest callbacks to prepare for static_call support
perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv
perf: Stop pretending that perf can handle multiple guest callbacks
KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
KVM: x86: Register perf callbacks after calling vendor's hardware_setup()
perf: Protect perf_guest_cbs with RCU
When dirty ring logging is enabled, any dirty logging without an active
vCPU context will cause a kernel oops. But we've already declared that
the shared_info page doesn't get dirty tracking anyway, since it would
be kind of insane to mark it dirty every time we deliver an event channel
interrupt. Userspace is supposed to just assume it's always dirty any
time a vCPU can run or event channels are routed.
So stop using the generic kvm_write_wall_clock() and just write directly
through the gfn_to_pfn_cache that we already have set up.
We can make kvm_write_wall_clock() static in x86.c again now, but let's
not remove the 'sec_hi_ofs' argument even though it's not used yet. At
some point we *will* want to use that for KVM guests too.
Fixes: 629b534884 ("KVM: x86/xen: update wallclock region")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.
Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.
Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When processing a hypercall for a guest with protected state, currently
SEV-ES guests, the guest CS segment register can't be checked to
determine if the guest is in 64-bit mode. For an SEV-ES guest, it is
expected that communication between the guest and the hypervisor is
performed to shared memory using the GHCB. In order to use the GHCB, the
guest must have been in long mode, otherwise writes by the guest to the
GHCB would be encrypted and not be able to be comprehended by the
hypervisor.
Create a new helper function, is_64_bit_hypercall(), that assumes the
guest is in 64-bit mode when the guest has protected state, and returns
true, otherwise invoking is_64_bit_mode() to determine the mode. Update
the hypercall related routines to use is_64_bit_hypercall() instead of
is_64_bit_mode().
Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to
this helper function for a guest running with protected state.
Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Differentiate between IRQ and NMI for KVM's PMC overflow callback, which
was originally invoked in response to an NMI that arrived while the guest
was running, but was inadvertantly changed to fire on IRQs as well when
support for perf without PMU/NMI was added to KVM. In practice, this
should be a nop as the PMC overflow callback shouldn't be reached, but
it's a cheap and easy fix that also better documents the situation.
Note, this also doesn't completely prevent false positives if perf
somehow ends up calling into KVM, e.g. an NMI can arrive in host after
KVM sets its flag.
Fixes: dd60d21706 ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-12-seanjc@google.com
Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.
Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-11-seanjc@google.com
This variable was renamed to kvm_has_noapic_vcpu in commit
6e4e3b4df4 ("KVM: Stop using deprecated jump label APIs").
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211021185449.3471763-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Get LA57 from the role_regs, which are initialized from the vCPU even
though TDP is enabled, instead of pulling the value directly from the
vCPU when computing the guest's root_level for TDP MMUs. Note, the check
is inside an is_long_mode() statement, so that requirement is not lost.
Use role_regs even though the MMU's role is available and arguably
"better". A future commit will consolidate the guest root level logic,
and it needs access to EFER.LMA, which is not tracked in the role (it
can't be toggled on VM-Exit, unlike LA57).
Drop is_la57_mode() as there are no remaining users, and to discourage
pulling MMU state from the vCPU (in the future).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-41-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers. Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210505002735.1684165-9-seanjc@google.com
Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants. I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use. Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_write_tsc() was renamed and made static since commit 0c899c25d7
("KVM: x86: do not attempt TSC synchronization on guest writes"). Remove
its unused declaration.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210326070334.12310-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1. Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1. E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
Debugging VM-Enter is as painful on SVM as it is on VMX.
Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
to avoid any collisions with the uber-concise "CC".
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wallclock on Xen is written in the shared_info page.
To that purpose, export kvm_write_wall_clock() and pass on the GPA of
its location to populate the shared_info wall clock data.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Now that KVM is using static calls, calling vmx_vcpu_run and
vmx_sync_pir_to_irr does not incur anymore the cost of a
retpoline.
Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST
in vendor code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are
covered here except for 'pmu_ops and 'nested ops'.
Here are some numbers running cpuid in a loop of 1 million calls averaged
over 5 runs, measured in the vm (lower is better).
Intel Xeon 3000MHz:
|default |mitigations=off
-------------------------------------
vanilla |.671s |.486s
static call|.573s(-15%)|.458s(-6%)
AMD EPYC 2500MHz:
|default |mitigations=off
-------------------------------------
vanilla |.710s |.609s
static call|.664s(-6%) |.609s(0%)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the instruction decode part out of x86_emulate_instruction() for it
to be used in other places. Also kvm_clear_exception_queue() is moved
inside the if-statement as it doesn't apply when KVM are coming back from
userspace.
Co-developed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210126081831.570253-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
svm_get/set_msr, so the x86 common part can be safely moved to VMX.
This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.
Add vmx_supported_debugctl() to refactor the throwing logic of #GP.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
[Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
to guest". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 7a873e4555 ("KVM: selftests: Verify supported CR4 bits can be set
before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even
when PCID support is missing:
==== Test Assertion Failure ====
x86_64/set_sregs_test.c:41: rc
pid=6956 tid=6956 - Invalid argument
1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41
2 0x00000000004014fc: main at set_sregs_test.c:119
3 0x00007f2d9346d041: ?? ??:0
4 0x000000000040164d: _start at ??:?
KVM allowed unsupported CR4 bit (0x20000)
Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make
kvm_is_valid_cr4() fail.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210201142843.108190-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
hardware will restore certain registers on VMEXIT, but not save them on
VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
following changes:
General vCPU load changes:
- During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
area as these registers will be restored on VMEXIT.
General vCPU put changes:
- Do not attempt to restore registers that SEV-ES hardware has already
restored on VMEXIT.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.
For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
so that both the hypervisor and guest can read or write to it and each
see the contents.
The GHCB specification provides software-defined VMGEXIT exit codes to
indicate a request for an MMIO read or an MMIO write. Add support to
recognize the MMIO requests and invoke SEV-ES specific routines that
can complete the MMIO operation. These routines use common KVM support
to complete the MMIO operation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_machine_check to x86.h to avoid two exact copies
of the same function in kvm.c and svm.c.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).
No functional changed intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recent introduction of the userspace msr filtering added code that uses
negative error codes for cases that result in either #GP delivery to
the guest, or handled by the userspace msr filtering.
This breaks an assumption that a negative error code returned from the
msr emulation code is a semi-fatal error which should be returned
to userspace via KVM_RUN ioctl and usually kill the guest.
Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
and by adding a new KVM_MSR_RET_FILTERED error code for the
userspace filtered msrs.
Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201101115523.115780-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
fires on a different pCPU which vCPU is running on. This kick costs about
1000 clock cycles and we don't need this when injecting already-expired
timer or when using the VMX preemption timer because
kvm_lapic_expired_hv_timer() is called from the target vCPU.
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.
kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-2-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.
One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP
The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.
This is based on a patch from Sean Christopherson, who suggested this idea.
Fixes: 6441fa6178 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:
"Any MBZ bit of CR3 is set."
"Any MBZ bit of CR4 is set."
Suggeted-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-3-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR4.VMXE is reserved unless the VMX CPUID bit is set. On Intel,
it is also tested by vmx_set_cr4, but AMD relies on kvm_valid_cr4,
so fix it.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-2-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR accesses can be one of:
(1) KVM internal access,
(2) userspace access (e.g., via KVM_SET_MSRS ioctl),
(3) guest access.
The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack. It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()). Ideally we only want to trap cases (2)
or (3), but not (1) above.
To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr(). To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds a fastpath_t typedef since enum lines are a bit long, and replace
EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values.
- EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop,
but it would skip invoking the exit handler.
- EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered
without invoking the exit handler or going
back to vcpu_run
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce kvm_vcpu_exit_request() helper, we need to check some conditions
before enter guest again immediately, we skip invoking the exit handler and
go through full run loop if complete fastpath but there is stuff preventing
we enter guest again immediately.
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's
EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g.
to flush L2's context during nested VM-Enter.
Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where
the flush is directly associated with vCPU-scoped instruction emulation,
i.e. MOV CR3 and INVPCID.
Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to
make it clear that it deliberately requests a flush of all contexts.
Service any pending flush request on nested VM-Exit as it's possible a
nested VM-Exit could occur after requesting a flush for L2. Add the
same logic for nested VM-Enter even though it's _extremely_ unlikely
for flush to be pending on nested VM-Enter, but theoretically possible
(in the future) due to RSM (SMM) emulation.
[*] Intel also has an Address Space Identifier (ASID) concept, e.g.
EPTP+VPID+PCID == ASID, it's just not documented in the SDM because
the rules of invalidation are different based on which piece of the
ASID is being changed, i.e. whether the EPTP, VPID, or PCID context
must be invalidated.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-25-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>