- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into
KVM's 'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to
a simpler state and less shared data between EL1 and EL2 in
the nVHE case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be
unmapped from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once
the vcpu xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and
page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHhxvsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPZkAf+Nz92UL/5nNGcdHtE4m7AToMmitE9
bYkesf9BMQvAe5wjkABLuoHGi6ay4jabo4fiGzbdkiK7lO5YgfsWiMB3/MT5fl4E
jRPzaVQabp3YZLM8UYCBmfUVuRj524S967SfSRe0AvYjDEH8y7klPf4+7sCsFT0/
Px9Vf2KGuOlf0eM78yKg4rGaF0jS22eLgXm6FfNMY8/e29ZAo/jyUmqBY+Z2xxZG
aWhceDtSheW1jwLHLj3nOlQJvHTn8LVGXBE/R8Gda3ZjrBV2rKaDi4Fh+HD+dz86
2zVXwzQ7uck2CMW73GMoXMTWoKSHMyvlBOs1BdvBm4UsnGcXR+q8IFCeuQ==
=s73m
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"RISCV:
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into KVM's
'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to a
simpler state and less shared data between EL1 and EL2 in the nVHE
case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be unmapped
from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once the vcpu
xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen
emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency
checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (268 commits)
x86/fpu: Fix inline prefix warnings
selftest: kvm: Add amx selftest
selftest: kvm: Move struct kvm_x86_state to header
selftest: kvm: Reorder vcpu_load_state steps for AMX
kvm: x86: Disable interception for IA32_XFD on demand
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
kvm: selftests: Add support for KVM_CAP_XSAVE2
kvm: x86: Add support for getting/setting expanded xstate buffer
x86/fpu: Add uabi_size to guest_fpu
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
x86/fpu: Add guest support to xfd_enable_feature()
...
Always intercepting IA32_XFD causes non-negligible overhead when this
register is updated frequently in the guest.
Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
with a non-zero value.
Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
with the software states in fpstate and the per-cpu xfd cache. This
leads to two additional changes accordingly:
- Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
software states back in-sync with the MSR, before handle_exit_irqoff()
is called.
- Always trap #NM once write interception is disabled for IA32_XFD.
The #NM exception is rare if the guest doesn't use dynamic
features. Otherwise, there is at most one exception per guest
task given a dynamic feature.
p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not guaranteed
to be preserved. This clarification avoids adding mess for a creative
guest which sets IA32_XFD[18]=1 before saving active AMX state to
its own storage.
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set
xstate data from/to KVM. This doesn't work when dynamic xfeatures
(e.g. AMX) are exposed to the guest as they require a larger buffer
size.
Introduce a new capability (KVM_CAP_XSAVE2). Userspace VMM gets the
required xstate buffer size via KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
KVM_SET_XSAVE is extended to work with both legacy and new capabilities
by doing properly-sized memdup_user() based on the guest fpu container.
KVM_GET_XSAVE is kept for backward-compatible reason. Instead,
KVM_GET_XSAVE2 is introduced under KVM_CAP_XSAVE2 as the preferred
interface for getting xstate buffer (4KB or larger size) from KVM
(Link: https://lkml.org/lkml/2021/12/15/510)
Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
Signed-off-by: Guang Zeng <guang.zeng@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-19-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend CPUID emulation to support XFD, AMX_TILE, AMX_INT8 and
AMX_BF16. Adding those bits into kvm_cpu_caps finally activates all
previous logics in this series.
Hide XFD on 32bit host kernels. Otherwise it leads to a weird situation
where KVM tells userspace to migrate MSR_IA32_XFD and then rejects
attempts to read/write the MSR.
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-17-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17
is for tilecfg and bit 18 is for tiledata.
The value of XCR0[17:18] is always either 00b or 11b. Also, SDM
recommends that only 64-bit operating systems enable Intel AMX by
setting XCR0[18:17]. 32-bit host kernel never sets the tile bits in
vcpu->arch.guest_supported_xcr0.
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-16-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This saves one unnecessary VM-exit in guest #NM handler, given that the
MSR is already restored with the guest value before the guest is resumed.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-15-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulate read/write to IA32_XFD_ERR MSR.
Only the saved value in the guest_fpu container is touched in the
emulation handler. Actual MSR update is handled right before entering
the guest (with preemption disabled)
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-14-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guest IA32_XFD_ERR is generally modified in two places:
- Set by CPU when #NM is triggered;
- Cleared by guest in its #NM handler;
Intercept #NM for the first case when a nonzero value is written
to IA32_XFD. Nonzero indicates that the guest is willing to do
dynamic fpstate expansion for certain xfeatures, thus KVM needs to
manage and virtualize guest XFD_ERR properly. The vcpu exception
bitmap is updated in XFD write emulation according to guest_fpu::xfd.
Save the current XFD_ERR value to the guest_fpu container in the #NM
VM-exit handler. This must be done with interrupt disabled, otherwise
the unsaved MSR value may be clobbered by host activity.
The saving operation is conducted conditionally only when guest_fpu:xfd
includes a non-zero value. Doing so also avoids misread on a platform
which doesn't support XFD but #NM is triggered due to L1 interception.
Queueing #NM to the guest is postponed to handle_exception_nmi(). This
goes through the nested_vmx check so a virtual vmexit is queued instead
when #NM is triggered in L2 but L1 wants to intercept it.
Restore the host value (always ZERO outside of the host #NM
handler) before enabling interrupt.
Restore the guest value from the guest_fpu container right before
entering the guest (with interrupt disabled).
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-13-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel's eXtended Feature Disable (XFD) feature allows the software
to dynamically adjust fpstate buffer size for XSAVE features which
have large state.
Because guest fpstate has been expanded for all possible dynamic
xstates at KVM_SET_CPUID2, emulation of the IA32_XFD MSR is
straightforward. For write just call fpu_update_guest_xfd() to
update the guest fpu container once all the sanity checks are passed.
For read simply return the cached value in the container.
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-11-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM can request fpstate expansion in two approaches:
1) When intercepting guest updates to XCR0 and XFD MSR;
2) Before vcpu runs (e.g. at KVM_SET_CPUID2);
The first option doesn't waste memory for legacy guest if it doesn't
support XFD. However doing so introduces more complexity and also
imposes an order requirement in the restoring path, i.e. XCR0/XFD
must be restored before XSTATE.
Given that the agreement is to do the static approach. This is
considered a better tradeoff though it does waste 8K memory for
legacy guest if its CPUID includes dynamically-enabled xfeatures.
Successful fpstate expansion requires userspace VMM to acquire
guest xstate permissions before calling KVM_SET_CPUID2.
Also take the chance to adjust the indent in kvm_set_cpuid().
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-9-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
LIVEPATCH as the backtrace misses the function which is being fixed up.
- Add Straight Light Speculation mitigation support which uses a new
compiler switch -mharden-sls= which sticks an INT3 after a RET or an
indirect branch in order to block speculation after them. Reportedly,
CPUs do speculate behind such insns.
- The usual set of cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHfKA0ACgkQEsHwGGHe
VUqLJg/2I2X2xXr5filJVaK+sQgmvDzk67DKnbxRBW2xcPF+B5sSW5yhe3G5UPW7
SJVdhQ3gHcTiliGGlBf/VE7KXbqxFN0vO4/VFHZm78r43g7OrXTxz6WXXQRJ1n67
U3YwRH3b6cqXZNFMs+X4bJt6qsGJM1kdTTZ2as4aERnaFr5AOAfQvfKbyhxLe/XA
3SakfYISVKCBQ2RkTfpMpwmqlsatGFhTC5IrvuDQ83dDsM7O+Dx1J6Gu3fwjKmie
iVzPOjCh+xTpZQp/SIZmt7MzoduZvpSym4YVyHvEnMiexQT4AmyaRthWqrhnEXY/
qOvj8/XIqxmix8EaooGqRIK0Y2ZegxkPckNFzaeC3lsWohwMIGIhNXwHNEeuhNyH
yvNGAW9Cq6NeDRgz5MRUXcimYw4P4oQKYLObS1WqFZhNMqm4sNtoEAYpai/lPYfs
zUDckgXF2AoPOsSqy3hFAVaGovAgzfDaJVzkt0Lk4kzzjX2WQiNLhmiior460w+K
0l2Iej58IajSp3MkWmFH368Jo8YfUVmkjbbpsmjsBppA08e1xamJB7RmswI/Ezj6
s5re6UioCD+UYdjWx41kgbvYdvIkkZ2RLrktoZd/hqHrOLWEIiwEbyFO2nRFJIAh
YjvPkB1p7iNuAeYcP1x9Ft9GNYVIsUlJ+hK86wtFCqy+abV+zQ==
=R52z
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Get rid of all the .fixup sections because this generates
misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
LIVEPATCH as the backtrace misses the function which is being fixed
up.
- Add Straight Line Speculation mitigation support which uses a new
compiler switch -mharden-sls= which sticks an INT3 after a RET or an
indirect branch in order to block speculation after them. Reportedly,
CPUs do speculate behind such insns.
- The usual set of cleanups and improvements
* tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
x86/entry_32: Fix segment exceptions
objtool: Remove .fixup handling
x86: Remove .fixup section
x86/word-at-a-time: Remove .fixup usage
x86/usercopy: Remove .fixup usage
x86/usercopy_32: Simplify __copy_user_intel_nocache()
x86/sgx: Remove .fixup usage
x86/checksum_32: Remove .fixup usage
x86/vmx: Remove .fixup usage
x86/kvm: Remove .fixup usage
x86/segment: Remove .fixup usage
x86/fpu: Remove .fixup usage
x86/xen: Remove .fixup usage
x86/uaccess: Remove .fixup usage
x86/futex: Remove .fixup usage
x86/msr: Remove .fixup usage
x86/extable: Extend extable functionality
x86/entry_32: Remove .fixup usage
x86/entry_64: Remove .fixup usage
x86/copy_mc_64: Remove .fixup usage
...
"Cleanup of the perf/kvm interaction."
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvbkACgkQEsHwGGHe
VUrX7w/9FwKUm0WlGcQIAOSdWk85N2qAVH3brYcQHNpTCVe68TOqTCrxCDrGgyUq
2XnCOim99MUlnsVU6QRZqF4yJ8S1tGrc0COJ/qR4SGntucu0oYuDe2aMVq+mWUD7
/IThA0oMRfhki9WwAyUuyCrXzk4blZdlrXyYIRMJGl9xeGNy3cvUtU8f68Kiy22E
OcmQ/o9Etsr38dueAMU1KYEmgSTvG47rS8nfyRUu3QpJHbyLmRXH32PQrm3tduxS
Bw3gMAH5vqq1UDZJ8ZvsPsO0vFX7dtnKEwEKz4qdtRWk9gi8oLGHIwIXC+VtNqpf
mCmX33Jw8uFz9h3JhE84J0j/CgsWHoU6MOs0MOch4Tb69/BfCjQnw1enImhejG8q
YEIDjJf/vgRNaw9PYshiTHT+EJTe9inT3S4eK/ynLRDUEslAqyWZZm7bUE/XrEDi
yRyGIxry/hNZVvRkXT9QBw32fpgnIH2NAMPLEjJSGCRxT89Tfqz0aRDfacCuHTTh
P8pAeiDuy/6RkDlQckOZJWOFFh2IHsykX2l3IJcHqVRqt4ob9b+SZB5qoH/Mv9qb
MSAqdFUupYZFC+6XuPAeX5/Mo+wSkP+pYYSbWNxjUa0yNiYecOjE7/8T2SB2y6Mx
lk2L0ypsZUYSmpHSfvOdPmf6ucj19/5B4+VCX6PQfcNJTnvvhTE=
=tU5G
-----END PGP SIGNATURE-----
Merge tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Borislav Petkov:
"Cleanup of the perf/kvm interaction."
* tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Drop guest callback (un)register stubs
KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c
KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y
KVM: arm64: Convert to the generic perf callbacks
KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
KVM: Move x86's perf guest info callbacks to generic KVM
KVM: x86: More precisely identify NMI from guest when handling PMI
KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable
perf/core: Use static_call to optimize perf_guest_info_callbacks
perf: Force architectures to opt-in to guest callbacks
perf: Add wrappers for invoking guest callbacks
perf/core: Rework guest callbacks to prepare for static_call support
perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv
perf: Stop pretending that perf can handle multiple guest callbacks
KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
KVM: x86: Register perf callbacks after calling vendor's hardware_setup()
perf: Protect perf_guest_cbs with RCU
to use it in SEV and TDX guests
- An include fix and reorg to allow for removing set_fs in UML later
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcEl4ACgkQEsHwGGHe
VUqFTQ/+K9Kb6X0+r7wBSRTeAIWaYewmgOdf+7rpFVyFqQtNecKbuSAWGgFnEHc8
8HUB/krNa+odtx7mAy73wNALUaPmR0KUg6O+YKrvT6LHt8DLlGl5u0g/hihzFdAB
PW7auuxqt9TvK1i8PkYAI+W7t93o4mw4LzgDCVvoLPQUutRZEV1gHRht8Tn8SjaN
3EmEiazpFDrXNGWl/3rnS0qIyvtiZu7KNtibE6ljbUgse9cgxOt733mykH6eO9RJ
hXOfewKML72UxmgWig01pElgLaXeYI5rpSoG7usm4FwwYh+tmBIA8S/EoeE24gn0
e82lxwRCcHjqUDRp2//gz16sYhs//K6bcViT/4FtnL33e2CjK2/J4MwHPn9zgimO
VvxSdAes7UFiA/gDIomFt3gJij+hfy4TGKg5d3326Nm9rsQLpxg49WkozYJZ8m/f
75VVlC4BAj9SnYLQYhSm9buF7pIXmfwN3yWkYJsebl18C6/4FXLLomiqOgWpo3mG
D0e+CXhLZsEaU5NTiVuaPySzjtpRUzmfWf3S9GifJZex0rX+et7+mqIuC92aHbtD
Dc+nNFX/D77Fq8Uoe8bIEt8QsnjdACov1TI/S8h2rSjt5R/Lyg73qh0CpN0jtQ+S
9dUooJWwE4RXnuVMpFq/Xea/BYj1lQ72kMeyFiCNc0/hnzYhZNM=
=NBcE
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
"The pile which we cannot find the proper topic for so we stick it in
x86/misc:
- Add support for decoding instructions which do MMIO accesses in
order to use it in SEV and TDX guests
- An include fix and reorg to allow for removing set_fs in UML later"
* tag 'x86_misc_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mtrr: Remove the mtrr_bp_init() stub
x86/sev-es: Use insn_decode_mmio() for MMIO implementation
x86/insn-eval: Introduce insn_decode_mmio()
x86/insn-eval: Introduce insn_get_modrm_reg_ptr()
x86/insn-eval: Handle insn_get_opcode() failure
KVM_GET_SUPPORTED_CPUID should not include any dynamic xstates in
CPUID[0xD] if they have not been requested with prctl. Otherwise
a process which directly passes KVM_GET_SUPPORTED_CPUID to
KVM_SET_CPUID2 would now fail even if it doesn't intend to use a
dynamically enabled feature. Userspace must know that prctl is
required and allocate >4K xstate buffer before setting any dynamic
bit.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-5-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CPUID.0xD.1.EBX enumerates the size of the XSAVE area (in compacted
format) required by XSAVES. If CPUID.0xD.i.ECX[1] is set for a state
component (i), this state component should be located on the next
64-bytes boundary following the preceding state component in the
compacted layout.
Fix xstate_required_size() to follow the alignment rule. AMX is the
first state component with 64-bytes alignment to catch this bug.
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-4-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both source and dest vms' kvm->locks are held in sev_lock_two_vms.
Mark one with a different subtype to avoid false positives from lockdep.
Fixes: c9d61dcb0b (KVM: SEV: accept signals in sev_lock_two_vms)
Reported-by: Yiru Xu <xyru1999@gmail.com>
Tested-by: Jinrong Liang <cloudliang@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1641364863-26331-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Normally guests will set up CR3 themselves, but some guests, such as
kselftests, and potentially CONFIG_PVH guests, rely on being booted
with paging enabled and CR3 initialized to a pre-allocated page table.
Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
is too late, since it will have switched over to using the VMSA page
prior to that point, with the VMSA CR3 copied from the VMCB initial
CR3 value: 0.
Address this by sync'ing the CR3 value into the VMCB save area
immediately when KVM_SET_SREGS* is issued so it will find it's way into
the initial VMSA.
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
[Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
new hook. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use asm-goto-output for smaller fast path code.
Message-Id: <YbcbbGW2GcMx6KpD@hirez.programming.kicks-ass.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When dirty ring logging is enabled, any dirty logging without an active
vCPU context will cause a kernel oops. But we've already declared that
the shared_info page doesn't get dirty tracking anyway, since it would
be kind of insane to mark it dirty every time we deliver an event channel
interrupt. Userspace is supposed to just assume it's always dirty any
time a vCPU can run or event channels are routed.
So stop using the generic kvm_write_wall_clock() and just write directly
through the gfn_to_pfn_cache that we already have set up.
We can make kvm_write_wall_clock() static in x86.c again now, but let's
not remove the 'sec_hi_ofs' argument even though it's not used yet. At
some point we *will* want to use that for KVM guests too.
Fixes: 629b534884 ("KVM: x86/xen: update wallclock region")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This adds basic support for delivering 2 level event channels to a guest.
Initially, it only supports delivery via the IRQ routing table, triggered
by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
function which will use the pre-mapped shared_info page if it already
exists and is still valid, while the slow path through the irqfd_inject
workqueue will remap the shared_info page if necessary.
It sets the bits in the shared_info page but not the vcpu_info; that is
deferred to __kvm_xen_has_interrupt() which raises the vector to the
appropriate vCPU.
Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping
of the Xen shared_info page so that it can be accessed in atomic context.
Note that we do not participate in dirty tracking for the shared info
page and we do not explicitly mark it dirty every single tim we deliver
an event channel interrupts. We wouldn't want to do that even if we *did*
have a valid vCPU context with which to do so.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM retires a guest branch instruction through emulation,
increment any vPMCs that are configured to monitor "branch
instructions retired," and update the sample period of those counters
so that they will overflow at the right time.
Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
- Split the code to increment "branch instructions retired" into a
separate commit.
- Moved/consolidated the calls to kvm_pmu_trigger_event() in the
emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
that code.
]
Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211130074221.93635-7-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM retires a guest instruction through emulation, increment any
vPMCs that are configured to monitor "instructions retired," and
update the sample period of those counters so that they will overflow
at the right time.
Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
- Split the code to increment "branch instructions retired" into a
separate commit.
- Added 'static' to kvm_pmu_incr_counter() definition.
- Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
PERF_EVENT_STATE_ACTIVE.
]
Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <jmattson@google.com>
[likexu:
- Drop checks for pmc->perf_event or event state or event type
- Increase a counter once its umask bits and the first 8 select bits are matched
- Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
- Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
- Add counter enable and CPL check for kvm_pmu_trigger_event();
]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-6-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Depending on whether intr should be triggered or not, KVM registers
two different event overflow callbacks in the perf_event context.
The code skeleton of these two functions is very similar, so
the pmc->intr can be stored into pmc from pmc_reprogram_counter()
which provides smaller instructions footprint against the
u-architecture branch predictor.
The __kvm_perf_overflow() can be called in non-nmi contexts
and a flag is needed to distinguish the caller context and thus
avoid a check on kvm_is_in_guest(), otherwise we might get
warnings from suspicious RCU or check_preemption_disabled().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-5-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since we set the same semantic event value for the fixed counter in
pmc->eventsel, returning the perf_hw_id for the fixed counter via
find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
with the help of pmc_is_fixed() check.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-4-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The find_arch_event() returns a "unsigned int" value,
which is used by the pmc_reprogram_counter() to
program a PERF_TYPE_HARDWARE type perf_event.
The returned value is actually the kernel defined generic
perf_hw_id, let's rename it to pmc_perf_hw_id() with simpler
incoming parameters for better self-explanation.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-3-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current pmc->eventsel for fixed counter is underutilised. The
pmc->eventsel can be setup for all known available fixed counters
since we have mapping between fixed pmc index and
the intel_arch_events array.
Either gp or fixed counter, it will simplify the later checks for
consistency between eventsel and perf_hw_id.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because IceLake has 4 fixed performance counters but KVM only
supports 3, it is possible for reprogram_fixed_counters to pass
to reprogram_fixed_counter an index that is out of bounds for the
fixed_pmc_events array.
Ultimately intel_find_fixed_event, which is the only place that uses
fixed_pmc_events, handles this correctly because it checks against the
size of fixed_pmc_events anyway. Every other place operates on the
fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
However, it is cleaner if the unsupported performance counters are culled
early on in reprogram_fixed_counters.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When !CR0_PG -> CR0_PG, vcpu->arch.cr3 becomes active, but GUEST_CR3 is
still vmx->ept_identity_map_addr if EPT + !URG. So VCPU_EXREG_CR3 is
considered to be dirty and GUEST_CR3 needs to be updated in this case.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-4-jiangshanlai@gmail.com>
Fixes: c62c7bd4f9 ("KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For shadow paging, the page table needs to be reconstructed before the
coming VMENTER if the guest PDPTEs is changed.
But not all paths that call load_pdptrs() will cause the page tables to be
reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
are used to launch later reconstruction.
The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.
The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
load_pdptrs() when rewriting the CR3 with the same value.
The commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.
Guests like linux would keep the PDPTEs unchanged for every instance of
pagetable, so this missing reconstruction has no problem for linux
guests.
Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The host CR3 in the vcpu thread can only be changed when scheduling,
so commit 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
changed vmx.c to only save it in vmx_prepare_switch_to_guest().
However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
vmx_set_host_fs_gs() is called in both places, so rename it to
vmx_set_vmcs_host_state() and make it update HOST_CR3.
Fixes: 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 24cd19a28c.
Sean Christopherson reports:
"Commit 24cd19a28c ('KVM: X86: Update mmu->pdptrs only when it is
changed') breaks nested VMs with EPT in L0 and PAE shadow paging in L2.
Reproducing is trivial, just disable EPT in L1 and run a VM. I haven't
investigating how it breaks things."
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an IS_ENABLED() check in setup_arch() and call pat_disable()
directly if MTRRs are not supported. This allows to remove the
<asm/memtype.h> include in <asm/mtrr.h>, which pull in lowlevel x86
headers that should not be included for UML builds and will cause build
warnings with a following patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211215165612.554426-2-hch@lst.de
Pick commit fdba608f15 ("KVM: VMX: Wake vCPU when delivering posted
IRQ even if vCPU == this vCPU"). In addition to fixing a bug, it
also aligns the non-nested and nested usage of triggering posted
interrupts, allowing for additional cleanups.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
If a vIRQ is delivered from asynchronous context, the target vCPU can be
the currently running vCPU and can also be blocking, in which case
skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
be a wake event for the vCPU.
The "do nothing" logic when "vcpu == running_vcpu" mostly works only
because the majority of calls to ->deliver_posted_interrupt(), especially
when using posted interrupts, come from synchronous KVM context. But if
a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
use posted interrupts, wake events from the device will be delivered to
KVM from IRQ context, e.g.
vfio_msihandler()
|
|-> eventfd_signal()
|
|-> ...
|
|-> irqfd_wakeup()
|
|->kvm_arch_set_irq_inatomic()
|
|-> kvm_irq_delivery_to_apic_fast()
|
|-> kvm_apic_set_irq()
This also aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.
Fixes: 379a3c8ee4 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
Cc: stable@vger.kernel.org
Reported-by: Longpeng (Mike) <longpeng2@huawei.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthesize a triple fault if L2 guest state is invalid at the time of
VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
guest state via ioctls(), e.g. KVM_SET_SREGS. KVM should never emulate
invalid guest state, since from L1's perspective, it's architecturally
impossible for L2 to have invalid state while L2 is running in hardware.
E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
or #GP.
Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
can trigger this scenario, as nested VM-Enter correctly rejects any
attempt to enter L2 with invalid state.
RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
Following AMD's SMRAM layout is important as AMD's layout saves/restores
the descriptor cache information, including CS.RPL and SS.RPL, and also
defines all the fields relevant to invalid guest state as read-only, i.e.
so long as the vCPU had valid state before the SMI, which is guaranteed
for L2, RSM will generate valid state unless SMRAM was modified. Intel's
layout saves/restores only the selector, which means that scenarios where
the selector and cached RPL don't match, e.g. conforming code segments,
would yield invalid guest state. Intel CPUs fudge around this issued by
stuffing SS.RPL and CS.RPL on RSM. Per Intel's SDM on the "Default
Treatment of RSM", paraphrasing for brevity:
IF internal storage indicates that the [CPU was post-VMXON]
THEN
enter VMX operation (root or non-root);
restore VMX-critical state as defined in Section 34.14.1;
set to their fixed values any bits in CR0 and CR4 whose values must
be fixed in VMX operation [unless coming from an unrestricted guest];
IF RFLAGS.VM = 0 AND (in VMX root operation OR the
“unrestricted guest” VM-execution control is 0)
THEN
CS.RPL := SS.DPL;
SS.RPL := SS.DPL;
FI;
restore current VMCS pointer;
FI;
Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
will sythesize TRIPLE_FAULT in this scenario. KVM's behavior is allowed
as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
only way for CR0 and/or CR4 to have illegal values is if they were
modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
section states "modifying these registers will result in unpredictable
behavior".
KVM's ioctl() behavior is less straightforward. Because KVM allows
ioctls() to be executed in any order, rejecting an ioctl() if it would
result in invalid L2 guest state is not an option as KVM cannot know if
a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE. Ideally, KVM would
reject KVM_RUN if L2 contained invalid guest state, but that carries the
risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
exited to userspace. Setting a flag/request to detect such a scenario is
undesirable because (a) it's extremely unlikely to add value to KVM as a
whole, and (b) KVM would need to consider ioctl() interactions with such
a flag, e.g. if userspace migrated the vCPU while the flag were set.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-3-seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert a relatively recent change that set vmx->fail if the vCPU is in L2
and emulation_required is true, as that behavior is completely bogus.
Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
(a) it's impossible to have both a VM-Fail and VM-Exit
(b) vmcs.EXIT_REASON is not modified on VM-Fail
(c) emulation_required refers to guest state and guest state checks are
always VM-Exits, not VM-Fails.
For KVM specifically, emulation_required is handled before nested exits
in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
firing when tearing down the VM as KVM never expects vmx->fail to be set
when L2 is active, KVM always reflects those errors into L1.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
nested_vmx_vmexit+0x16bd/0x17e0
arch/x86/kvm/vmx/nested.c:4547
Modules linked in:
CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
Call Trace:
vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
__fput+0x3fc/0x870 fs/file_table.c:280
task_work_run+0x146/0x1c0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0x705/0x24f0 kernel/exit.c:832
do_group_exit+0x168/0x2d0 kernel/exit.c:929
get_signal+0x1740/0x2120 kernel/signal.c:2852
arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c8607e4a08 ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_run struct's if_flag is a part of the userspace/kernel API. The
SEV-ES patches failed to set this flag because it's no longer needed by
QEMU (according to the comment in the source code). However, other
hypervisors may make use of this flag. Therefore, set the flag for
guests with encrypted registers (i.e., with guest_state_protected set).
Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Signed-off-by: Marc Orr <marcorr@google.com>
Message-Id: <20211209155257.128747-1-marcorr@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The fixed counter 3 is used for the Topdown metrics, which hasn't been
enabled for KVM guests. Userspace accessing to it will fail as it's not
included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
which have this counter.
To reproduce it on ICX+ machines, ./state_test reports:
==== Test Assertion Failure ====
lib/x86_64/processor.c:1078: r == nmsrs
pid=4564 tid=4564 - Argument list too long
1 0x000000000040b1b9: vcpu_save_state at processor.c:1077
2 0x0000000000402478: main at state_test.c:209 (discriminator 6)
3 0x00007fbe21ed5f92: ?? ??:0
4 0x000000000040264d: _start at ??:?
Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)
With this patch, it works well.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Play nice with a NULL shadow page when checking for an obsolete root in
the page fault handler by flagging the page fault as stale if there's no
shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
Invalidating memslots, which is the only case where _all_ roots need to
be reloaded, requests all vCPUs to reload their MMUs while holding
mmu_lock for lock.
The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
backed by a shadow page. Running with TDP disabled or with nested NPT
explodes spectaculary due to dereferencing a NULL shadow page pointer.
Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
root. Zapping shadow pages in response to guest activity, e.g. when the
guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen
with a completely valid root shadow page. This is a bit of a moot point
as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
be cleaned up in the future.
Fixes: a955cad84c ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211209060552.2956723-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should
not depend on guest visible CPUID entries, even if just to allow
creating/restoring guest MSRs and CPUIDs in any sequence.
Fixes: 27461da310 ("KVM: x86/pmu: Support full width counting")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211216165213.338923-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the vmread exceptin path, use the, thus far, unused output register
to push the @fault argument onto the stack. This, in turn, enables the
exception handler to not do pushes and only modify that register when
an exception does occur.
As noted by Sean the input constraint needs to be changed to "=&r" to
avoid the value and field occupying the same register.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.781308550@infradead.org
KVM instruction emulation has a gnarly hack where the .fixup does a
return, however there's already a ret right after the 10b label, so
mark that as 11 and have the exception clear %esi to remove the
.fixup.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.722157053@infradead.org
Replace a WARN with a comment to call out that userspace can modify RCX
during an exit to userspace to handle string I/O. KVM doesn't actually
support changing the rep count during an exit, i.e. the scenario can be
ignored, but the WARN needs to go as it's trivial to trigger from
userspace.
Cc: stable@vger.kernel.org
Fixes: 3b27de2718 ("KVM: x86: split the two parts of emulator_pio_in")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211025201311.1881846-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the SDM:
If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an
attempt to clear CR0.PG causes a general-protection exception (#GP).
Software should transition to compatibility mode and clear CR4.PCIDE
before attempting to disable paging.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not bail early if there are no bits set in the sparse banks for a
non-sparse, a.k.a. "all CPUs", IPI request. Per the Hyper-V spec, it is
legal to have a variable length of '0', e.g. VP_SET's BankContents in
this case, if the request can be serviced without the extra info.
It is possible that for a given invocation of a hypercall that does
accept variable sized input headers that all the header input fits
entirely within the fixed size header. In such cases the variable sized
input header is zero-sized and the corresponding bits in the hypercall
input should be set to zero.
Bailing early results in KVM failing to send IPIs to all CPUs as expected
by the guest.
Fixes: 214ff83d44 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Splitting kvm_main.c out into smaller and better-organized files is
slightly non-trivial when it involves editing a bunch of per-arch
KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211121125451.9489-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>