- Ensure that the memory occupied by ACPI tables on x86 will always
be reserved to prevent it from being allocated for other purposes
which was possible in some cases (Rafael Wysocki).
- Fix the ACPI device enumeration code to prevent it from attempting
to evaluate the _STA control method for devices with unmet
dependencies which is likely to fail (Hans de Goede).
- Fix the handling of CPU0 wakeup in the ACPI processor driver to
prevent CPU0 online failures from occurring (Vitaly Kuznetsov).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmBnNboSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxM8EP/ijQgQURrTha3167d7o1e5tABBP57qaa
9w8biWSfzDhOY/8KvTfDGV38Hd8jmEoN1s1t6HitXIrzVFnLoI8x/1YrFCRvq9za
rPpnneROfOSNP3KdrYa4T6IF1O/Zp5hRTpp72n3+iBVukSSbN+p8+u7Q26OW2Vgx
OWF480ZZVgrKr1p1zjK5GzxVJV6UhM5L6rH5ZoCYGRbSaQOUgewd75/2IVhUOTKC
Sb4ua1MNa1TXR1YFKr5GYuhrg6B4J78WIXwXgX0HxDOy6fSt7wSUK4u6vLbG8UnU
uyyNlzhm5LYWOlJlJxfJpfzlNfukeKmONaYROmqTR3D090Zb382jkPYjJIw+VPsx
EG5CPvqGYDW75x2kDe9p61YfXDgxWu2Qstx0Pek1oPubUXT5/WmuN10CcHm0TF3O
j3fLwGUGByWRWOChmDVopXHyIcr1lbNm+wTYBts2AcygYfzo85ZuWtQXMUcsO9B5
ORvz/ejFxOm62HrtN2cn5aIJg2he1dL8DgAUO7nPJsgs0k9d3BgXODNt61d+EnqZ
4Fxs32s/6wVZQozpfEae+X3sdRpp5bSHOBOnOLTT8NGbBvrtcbrjQ6PaN3mQlbmw
t6bnaYvO8kPwD/HvAAhmJb01alTtcGCccxReCeZLIVGFS7Cm69Zm9jTLfpaGlffF
pGJoSYTSMxYP
=8KTH
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI tables management issue, an issue related to the
ACPI enumeration of devices and CPU wakeup in the ACPI processor
driver.
Specifics:
- Ensure that the memory occupied by ACPI tables on x86 will always
be reserved to prevent it from being allocated for other purposes
which was possible in some cases (Rafael Wysocki).
- Fix the ACPI device enumeration code to prevent it from attempting
to evaluate the _STA control method for devices with unmet
dependencies which is likely to fail (Hans de Goede).
- Fix the handling of CPU0 wakeup in the ACPI processor driver to
prevent CPU0 online failures from occurring (Vitaly Kuznetsov)"
* tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
ACPI: scan: Fix _STA getting called on devices with unmet dependencies
ACPI: tables: x86: Reserve memory occupied by ACPI tables
Commit 496121c021 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):
# echo 0 > /sys/devices/system/cpu/cpu0/online
# echo 1 > /sys/devices/system/cpu/cpu0/online
<10 seconds delay>
-bash: echo: write error: Input/output error
In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:
/*
* If NMI wants to wake up CPU0, start CPU0.
*/
if (wakeup_cpu0())
start_cpu0();
cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
- NMI is sent to CPU0
- wakeup_cpu0_nmi() works as expected
- we get back to while (1) loop in acpi_idle_play_dead()
- safe_halt() puts CPU0 to sleep again.
The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.
Fixes: 496121c021 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When guest time is reset with KVM_SET_CLOCK(0), it is possible for
'hv_clock->system_time' to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does (masterclock in use case):
hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.
Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
going negative.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210331124130.337992-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
pvclock_gtod_sync_lock can be taken with interrupts disabled if the
preempt notifier calls get_kvmclock_ns to update the Xen
runstate information:
spin_lock include/linux/spinlock.h:354 [inline]
get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062
So change the users of the spinlock to spin_lock_irqsave and
spin_unlock_irqrestore.
Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no need to include changes to vcpu->requests into
the pvclock_gtod_sync_lock critical section. The changes to
the shared data structures (in pvclock_update_vm_gtod_copy)
already occur under the lock.
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixing nested_vmcb_check_save to avoid all TOC/TOU races
is a bit harder in released kernels, so do the bare minimum
by avoiding that EFER.SVME is cleared. This is problematic
because svm_set_efer frees the data structures for nested
virtualization if EFER.SVME is cleared.
Also check that EFER.SVME remains set after a nested vmexit;
clearing it could happen if the bit is zero in the save area
that is passed to KVM_SET_NESTED_STATE (the save area of the
nested state corresponds to the nested hypervisor's state
and is restored on the next nested vmexit).
Cc: stable@vger.kernel.org
Fixes: 2fcf4876ad ("KVM: nSVM: implement on demand allocation of the nested state")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Avoid races between check and use of the nested VMCB controls. This
for example ensures that the VMRUN intercept is always reflected to the
nested hypervisor, instead of being processed by the host. Without this
patch, it is possible to end up with svm->nested.hsave pointing to
the MSR permission bitmap for nested guests.
This bug is CVE-2021-29657.
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@vger.kernel.org
Fixes: 2fcf4876ad ("KVM: nSVM: implement on demand allocation of the nested state")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prevent the TDP MMU from yielding when zapping a gfn range during NX
page recovery. If a flush is pending from a previous invocation of the
zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
has not accumulated a flush for the current invocation, then yielding
will release mmu_lock with stale TLB entries.
That being said, this isn't technically a bug fix in the current code, as
the TDP MMU will never yield in this case. tdp_mmu_iter_cond_resched()
will yield if and only if it has made forward progress, as defined by the
current gfn vs. the last yielded (or starting) gfn. Because zapping a
single shadow page is guaranteed to (a) find that page and (b) step
sideways at the level of the shadow page, the TDP iter will break its loop
before getting a chance to yield.
But that is all very, very subtle, and will break at the slightest sneeze,
e.g. zapping while holding mmu_lock for read would break as the TDP MMU
wouldn't be guaranteed to see the present shadow page, and thus could step
sideways at a lower level.
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-4-seanjc@google.com>
[Add lockdep assertion. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
does the flush itself if and only if it yields (which it will never do in
this particular scenario), and otherwise expects the caller to do the
flush. If pages are zapped from the TDP MMU but not the legacy MMU, then
no flush will occur.
Fixes: 29cf0f5007 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-3-seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When flushing a range of GFNs across multiple roots, ensure any pending
flush from a previous root is honored before yielding while walking the
tables of the current root.
Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites its local
"flush" with the result to avoid redundant flushes. zap_gfn_range()
preserves and return the incoming "flush", unless of course the flush was
performed prior to yielding and no new flush was triggered.
Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Cc: stable@vger.kernel.org
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Building kvm module out-of-source with,
make -C $SRC O=$BIN M=arch/x86/kvm
fails to find "irq.h" as the include dir passed to cflags-y does not
prefix the source dir. Fix this by prefixing $(srctree) to the include
dir path.
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <20210324124347.18336-1-sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs are only available when
X86_FEATURE_PERFCTR_CORE CPUID bit was exposed to the guest. KVM, however,
allows these MSRs unconditionally because kvm_pmu_is_valid_msr() ->
amd_msr_idx_to_pmc() check always passes and because kvm_pmu_set_msr() ->
amd_pmu_set_msr() doesn't fail.
In case of a counter (CTRn), no big harm is done as we only increase
internal PMC's value but in case of an eventsel (CTLn), we go deep into
perf internals with a non-existing counter.
Note, kvm_get_msr_common() just returns '0' when these MSRs don't exist
and this also seems to contradict architectural behavior which is #GP
(I did check one old Opteron host) but changing this status quo is a bit
scarier.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210323084515.1346540-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_write_tsc() was renamed and made static since commit 0c899c25d7
("KVM: x86: do not attempt TSC synchronization on guest writes"). Remove
its unused declaration.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210326070334.12310-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_msr_ignored_check function never uses vcpu argument. Clean up the
function and invokers.
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20210313051032.4171-1-lihaiwei.kernel@gmail.com>
Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The following problem has been reported by George Kennedy:
Since commit 7fef431be9 ("mm/page_alloc: place pages to tail
in __free_pages_core()") the following use after free occurs
intermittently when ACPI tables are accessed.
BUG: KASAN: use-after-free in ibft_init+0x134/0xc49
Read of size 4 at addr ffff8880be453004 by task swapper/0/1
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc1-7a7fd0d #1
Call Trace:
dump_stack+0xf6/0x158
print_address_description.constprop.9+0x41/0x60
kasan_report.cold.14+0x7b/0xd4
__asan_report_load_n_noabort+0xf/0x20
ibft_init+0x134/0xc49
do_one_initcall+0xc4/0x3e0
kernel_init_freeable+0x5af/0x66b
kernel_init+0x16/0x1d0
ret_from_fork+0x22/0x30
ACPI tables mapped via kmap() do not have their mapped pages
reserved and the pages can be "stolen" by the buddy allocator.
Apparently, on the affected system, the ACPI table in question is
not located in "reserved" memory, like ACPI NVS or ACPI Data, that
will not be used by the buddy allocator, so the memory occupied by
that table has to be explicitly reserved to prevent the buddy
allocator from using it.
In order to address this problem, rearrange the initialization of the
ACPI tables on x86 to locate the initial tables earlier and reserve
the memory occupied by them.
The other architectures using ACPI should not be affected by this
change.
Link: https://lore.kernel.org/linux-acpi/1614802160-29362-1-git-send-email-george.kennedy@oracle.com/
Reported-by: George Kennedy <george.kennedy@oracle.com>
Tested-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
- Fix build failure on Ubuntu with new GCC packages that turn on -fcf-protection
- Fix SME memory encryption PTE encoding bug - AFAICT the code worked on
4K page sizes (level 1) but had the wrong shift at higher page level orders
(level 2 and higher).
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmBgXdERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gJWxAAgAOWwGY3yq3kUEtIExosXZPlHCFjal3N
UXoVpQde4aBeZ9A4flMjkSZmTF5PVEN2npMz8ltnxU8NUJg4QR68UYiIE8BReARg
+JuyNXGdAu1XyT+dWdTFqL9xgA9t8dG13o4WbBqGDZagnLNuvjYhzJtsgw9FbNWZ
a1abBbcxpoZvSyQBHyqtuwoiWeeeFJiQZ02wZwxtonYHWVbBXEN5WhFL9Tc2kDJc
/Ic09O+FDhpe3I/PvCiMrkpVJuBnaDdve5zDPDzR+FRMeAj4AhNLIJiMFj17bJWD
eR6vCDoFz3EsbSdJz0XvHIZOSZnaiiC0ybTEv5nJTiRgDk+s6JDXWwDcJG+3yKJR
Fm5TLlnaU++E9lYLpyCbgrWkrv0F2u3GmnieFnOOyzRv8NlkZqrThApf3xGsavy+
qJZnXe5ftWp+mmIDw4DZDBVsJ8rBIflvURQxfG3SHkUc0iVsyUCrAK2eKYewk/dN
eC6FVPkCdx4Ys50wb+OR9Enhq3yKFyRuZ2zIeguUX30sqoapJL85M1vglS5DFoX/
pHcigRzBzFQOZhOh8Kq3VREOx0F+ioUfcZzmYdzjWSfXfpvqWFcLAIFgOv1hDfms
XQ60X/voG0tWd0ODKXqyx6oa0rqamigPjLJp/gtDKpQHORFaabvnTJTLwN6n8N1Q
syTWRiHMhi0=
=tM9n
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Two fixes:
- Fix build failure on Ubuntu with new GCC packages that turn
on -fcf-protection
- Fix SME memory encryption PTE encoding bug - AFAICT the code
worked on 4K page sizes (level 1) but had the wrong shift at
higher page level orders (level 2 and higher)"
* tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Turn off -fcf-protection for realmode targets
x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYF37OgAKCRCAXGG7T9hj
vp8hAP4h7mvjfkntbFXagrJK9pi2xVC9d/YO5nfa8/K3LcGVnQD/fKcU9ggPN9vI
GLnhyprGLcCA4aTL6Ogb37o9fDd4Yws=
=joIg
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains a small series with a more elegant fix of a problem
which was originally fixed in rc2"
* tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
Pull networking fixes from David Miller:
"Various fixes, all over:
1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu.
2) Always store the rx queue mapping in veth, from Maciej
Fijalkowski.
3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov.
4) Fix memory leak in octeontx2-af from Colin Ian King.
5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from
Yonghong Song.
6) Fix tx ptp stats in mlx5, from Aya Levin.
7) Check correct ip version in tun decap, fropm Roi Dayan.
8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit.
9) Work item memork leak in mlx5, from Shay Drory.
10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann.
11) Lack of preemptrion awareness in macvlan, from Eric Dumazet.
12) Fix data race in pxa168_eth, from Pavel Andrianov.
13) Range validate stab in red_check_params(), from Eric Dumazet.
14) Inherit vlan filtering setting properly in b53 driver, from
Florian Fainelli.
15) Fix rtnl locking in igc driver, from Sasha Neftin.
16) Pause handling fixes in igc driver, from Muhammad Husaini
Zulkifli.
17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits.
18) Use after free in qlcnic, from Lv Yunlong.
19) fix crash in fritzpci mISDN, from Tong Zhang.
20) Premature rx buffer reuse in igb, from Li RongQing.
21) Missing termination of ip[a driver message handler arrays, from
Alex Elder.
22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25
driver, from Xie He.
23) Use after free in c_can_pci_remove(), from Tong Zhang.
24) Uninitialized variable use in nl80211, from Jarod Wilson.
25) Off by one size calc in bpf verifier, from Piotr Krysiuk.
26) Use delayed work instead of deferrable for flowtable GC, from
Yinjun Zhang.
27) Fix infinite loop in NPC unmap of octeontx2 driver, from
Hariprasad Kelam.
28) Fix being unable to change MTU of dwmac-sun8i devices due to lack
of fifo sizes, from Corentin Labbe.
29) DMA use after free in r8169 with WoL, fom Heiner Kallweit.
30) Mismatched prototypes in isdn-capi, from Arnd Bergmann.
31) Fix psample UAPI breakage, from Ido Schimmel"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits)
psample: Fix user API breakage
math: Export mul_u64_u64_div_u64
ch_ktls: fix enum-conversion warning
octeontx2-af: Fix memory leak of object buf
ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation
net: bridge: don't notify switchdev for local FDB addresses
net/sched: act_ct: clear post_ct if doing ct_clear
net: dsa: don't assign an error value to tag_ops
isdn: capi: fix mismatched prototypes
net/mlx5: SF, do not use ecpu bit for vhca state processing
net/mlx5e: Fix division by 0 in mlx5e_select_queue
net/mlx5e: Fix error path for ethtool set-priv-flag
net/mlx5e: Offload tuple rewrite for non-CT flows
net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
net/mlx5: Add back multicast stats for uplink representor
net: ipconfig: ic_dev can be NULL in ic_close_devs
MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one
docs: networking: Fix a typo
r8169: fix DMA being used after buffer free if WoL is enabled
net: ipa: fix init header command validation
...
This partially reverts commit 882213990d ("xen: fix p2m size in dom0
for disabled memory hotplug case")
There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.
Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in
MEMORY_HOTPLUG.
Leave the check added to __set_phys_to_machine and the adjusted
comment about EXTRA_MEM_RATIO.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210324122424.58685-3-roger.pau@citrix.com
[boris: fixed formatting issues]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The Xen memory hotplug limit should depend on the memory hotplug
generic option, rather than the Xen balloon configuration. It's
possible to have a kernel with generic memory hotplug enabled, but
without Xen balloon enabled, at which point memory hotplug won't work
correctly due to the size limitation of the p2m.
Rename the option to XEN_MEMORY_HOTPLUG_LIMIT since it's no longer
tied to ballooning.
Fixes: 9e2369c06c ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210324122424.58685-2-roger.pau@citrix.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The new Ubuntu GCC packages turn on -fcf-protection globally,
which causes a build failure in the x86 realmode code:
cc1: error: ‘-fcf-protection’ is not compatible with this target
Turn it off explicitly on compilers that understand this option.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210323124846.1584944-1-arnd@kernel.org
The pfn variable contains the page frame number as returned by the
pXX_pfn() functions, shifted to the right by PAGE_SHIFT to remove the
page bits. After page protection computations are done to it, it gets
shifted back to the physical address using page_level_shift().
That is wrong, of course, because that function determines the shift
length based on the level of the page in the page table but in all the
cases, it was shifted by PAGE_SHIFT before.
Therefore, shift it back using PAGE_SHIFT to get the correct physical
address.
[ bp: Rewrite commit message. ]
Fixes: dfaaec9033 ("x86: Add support for changing memory encryption attribute in early boot")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/81abbae1657053eccc535c16151f63cd049dcb97.1616098294.git.isaku.yamahata@intel.com
devicetree-node lookups.
- Restore the IRQ2 ignore logic
- Fix get_nr_restart_syscall() to return the correct restart syscall number.
Split in a 4-patches set to avoid kABI breakage when backporting to dead
kernels.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmBXJu0ACgkQEsHwGGHe
VUrCkQ/9Et5W76HMQfHccluks2i2yNXgd7nROhIt0iMS1Ph86AWYJZmMZ2dbaqW8
nORU20ziHme+9PScmcJb2LdJxIRDtYNs1J811IYeKNpvj8KHXtV2VYCVG9UcL21E
FmUlZf5oINiDMzu3q4SuqHw9t7X6RCItolQIRmQHDXqPraFhBxji2VOFXDIg+qhf
a4sBz6UfxA4a/b7d/KxHxNvuQE5Cluc9gninhtaYh1b7OQZJX4+vTa3W5V4kK0df
ohOH5pnJp9V7qH2CmB3UcGWJTxHeLbm4E0KYkyasnKG9M0KmIvJ6jNARlRAo3hAF
hn9D4xLtsnIWjtO6xEVdF7kSizkYZRPay5kX88quvlSa0FkkPnsUvFtW79Yi3ZNy
vL2NAu2biqNQyo7ZWVffJns2DrJwYZ6KOGA6oUBwTUBfieF9KMdDew8IXRUMYNdO
LzW87Irf9eZj9c+b7Rtr0VofmKgRYwy1Lo8eVT+VGkV+nOTOB9rlAll2lYBq3aNA
W6ei0S5/1zaRF5aU6Qmnap4eb1X/tp845q6CPYa9kIsZwVyGFOa7iLeYcNn9qHdB
G6RW6CUh97A7wwxUYt5VGUscjYV2V9Ycv9HvIwrG/T7aezWnhI9ODtggzDgCnbls
og6N/+heLZ9G/DyxAEmHuazV2ItDPJq69gag/POHhXJaSUGbdbA=
=WfC4
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The freshest pile of shiny x86 fixes for 5.12:
- Add the arch-specific mapping between physical and logical CPUs to
fix devicetree-node lookups
- Restore the IRQ2 ignore logic
- Fix get_nr_restart_syscall() to return the correct restart syscall
number. Split in a 4-patches set to avoid kABI breakage when
backporting to dead kernels"
* tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/of: Fix CPU devicetree-node lookups
x86/ioapic: Ignore IRQ2 again
x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
x86: Move TS_COMPAT back to asm/thread_info.h
kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
I have handful of fixes for 5.12:
* A fix to the SBI remote fence numbers for hypervisor fences, which had
been transcribed in the wrong order in Linux. These fences are only
used with the KVM patches applied.
* A whole host of build warnings have been fixed, these should have no
functional change.
* A fix to init_resources() that prevents an off-by-one error from
causing an out-of-bounds array reference. This is manifesting during
boot on vexriscv.
* A fix to ensure the KASAN mappings are visible before proceeding to
use them.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmBVgV4THHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiTOWD/4l+uRCwTelZqm/G0yKSSAevAv5Crsc
Nzsa1uq7dOC+JLZ5y96SUng825WdGX+HiIf7QyUFPzpnqyYc4+ROwNb80ObPWQZU
dctatP2g9Jk2ImmJbGQVeDXKAiqrMM3hf1bOF3N3VV9DpqID0z/S8l8H9mz7x9yl
opd6kXxCPFKLgmAbMxcsytUduxZrJEcCpy3jPpIvjJ3BrzaGZlgjytqc2tYvbv/L
9i//evmGTCNXfQPrWEcMpBPbMf+aSzb/9Im8THB42jpJVQ7kx3txVg6d+wb73oGf
XHkm5mwrESAcnVGfxY5xRaaSK/L2k5Lg98J1K/BIHIKskjCTg5FdyrgeGwdtLg6T
FuXEvK29FJgfMb7k2Mf25l/Lglzi4q4LxBO4wcAUb1OpaVeK2kgYJr1eniSKrE/v
NF5/bD9h7sD1qbZLfk+lsTggBGfMBmthwp59jNb7V4cLkIFXwopgx2h/73jm6kn8
8fMCTlwOoktewbv0DdWCy0Sfaa0iCXMSJy+Y13GWlcEMvQn1VLtX7RbQzZq9X+tV
C/qkp1SdXfPG3vJbkNnZh/eS12F6vDauYJ814s3VAeJKOoMJWABB6Jm2SoBwFM6v
kpIRNzDyJ1oKhF4PxIrmGkv6PvRM/j5akspOwy/zdHB3FBVCGmyuoB9GE8Bg1Rw7
xyfdZthPDdvGyQ==
=XhDE
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
"A handful of fixes for 5.12:
- fix the SBI remote fence numbers for hypervisor fences, which had
been transcribed in the wrong order in Linux. These fences are only
used with the KVM patches applied.
- fix a whole host of build warnings, these should have no functional
change.
- fix init_resources() to prevent an off-by-one error from causing an
out-of-bounds array reference. This was manifesting during boot on
vexriscv.
- ensure the KASAN mappings are visible before proceeding to use
them"
* tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Correct SPARSEMEM configuration
RISC-V: kasan: Declare kasan_shallow_populate() static
riscv: Ensure page table writes are flushed when initializing KASAN vmalloc
RISC-V: Fix out-of-bounds accesses in init_resources()
riscv: Fix compilation error with Canaan SoC
ftrace: Fix spelling mistake "disabed" -> "disabled"
riscv: fix bugon.cocci warnings
riscv: process: Fix no prototype for arch_dup_task_struct
riscv: ftrace: Use ftrace_get_regs helper
riscv: process: Fix no prototype for show_regs
riscv: syscall_table: Reduce W=1 compilation warnings noise
riscv: time: Fix no prototype for time_init
riscv: ptrace: Fix no prototype warnings
riscv: sbi: Fix comment of __sbi_set_timer_v01
riscv: irq: Fix no prototype warning
riscv: traps: Fix no prototype warnings
RISC-V: correct enum sbi_ext_rfence_fid
__bpf_arch_text_poke does rewrite only for atomic nop5, emit_nops(xxx, 5)
emits non-atomic one which breaks fentry/fexit with k8 atomics:
P6_NOP5 == P6_NOP5_ATOMIC (0f1f440000 == 0f1f440000)
K8_NOP5 != K8_NOP5_ATOMIC (6666906690 != 6666666690)
Can be reproduced by doing "ideal_nops = k8_nops" in "arch_init_ideal_nops()
and running fexit_bpf2bpf selftest.
Fixes: e21aa34178 ("bpf: Fix fexit trampoline.")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210320000001.915366-1-sdf@google.com
Architectures that describe the CPU topology in devicetree and do not have
an identity mapping between physical and logical CPU ids must override the
default implementation of arch_match_cpu_phys_id().
Failing to do so breaks CPU devicetree-node lookups using of_get_cpu_node()
and of_cpu_device_node_get() which several drivers rely on. It also causes
the CPU struct devices exported through sysfs to point to the wrong
devicetree nodes.
On x86, CPUs are described in devicetree using their APIC ids and those
do not generally coincide with the logical ids, even if CPU0 typically
uses APIC id 0.
Add the missing implementation of arch_match_cpu_phys_id() so that CPU-node
lookups work also with SMP.
Apart from fixing the broken sysfs devicetree-node links this likely does
not affect current users of mainline kernels on x86.
Fixes: 4e07db9c8d ("x86/devicetree: Use CPU description from Device Tree")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210312092033.26317-1-johan@kernel.org
* new selftests
* fixes for migration with HyperV re-enlightenment enabled
* fix RCU/SRCU usage
* fixes for local_irq_restore misuse false positive
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmBUpO8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPj6Af+LSkDniR08Eh/x4GHdX+ZSA9EhNuP
PMqL+nDYvLXqc0XaErbZQpQbSP4aK7Tjly0LguZmNkBk17pnbjLb5Vv9hqJ30pM/
pI8bGgdh+KDO9LClfrgsaYgC+B4R+fwqqTIvtBYMilVZ96JwixFiODB4ntRQmZgd
xJS99jwjD8TO9pTYskKPf8y8yv5W9RH+wVQGXwc+T/sSzK/rcL4Jwt/ibO2FLcJK
gBRXJDVjMIlpxPrqqoejVB2FHQQe36Bns85QU3dz0QuXfDuuEvbShY/f4R1z32fT
RaccrvdMQtvgwS0l9Ij06PT0BdiG0EdZv/gOBUq5gVgx4XZyJTleJaVURw==
=WZP4
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Fixes for kvm on x86:
- new selftests
- fixes for migration with HyperV re-enlightenment enabled
- fix RCU/SRCU usage
- fixes for local_irq_restore misuse false positive"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
documentation/kvm: additional explanations on KVM_SET_BOOT_CPU_ID
x86/kvm: Fix broken irq restoration in kvm_wait
KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs
KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
selftests: kvm: add set_boot_cpu_id test
selftests: kvm: add _vm_ioctl
selftests: kvm: add get_msr_index_features
selftests: kvm: Add basic Hyper-V clocksources tests
KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment
KVM: x86: hyper-v: Track Hyper-V TSC page status
KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs
KVM: x86: hyper-v: Limit guest to writing zero to HV_X64_MSR_TSC_EMULATION_STATUS
KVM: x86/mmu: Store the address space ID in the TDP iterator
KVM: x86/mmu: Factor out tdp_iter_return_to_root
KVM: x86/mmu: Fix RCU usage when atomically zapping SPTEs
KVM: x86/mmu: Fix RCU usage in handle_removed_tdp_mmu_page
Vitaly ran into an issue with hotplugging CPU0 on an Amazon instance where
the matrix allocator claimed to be out of vectors. He analyzed it down to
the point that IRQ2, the PIC cascade interrupt, which is supposed to be not
ever routed to the IO/APIC ended up having an interrupt vector assigned
which got moved during unplug of CPU0.
The underlying issue is that IRQ2 for various reasons (see commit
af174783b9 ("x86: I/O APIC: Never configure IRQ2" for details) is treated
as a reserved system vector by the vector core code and is not accounted as
a regular vector. The Amazon BIOS has an routing entry of pin2 to IRQ2
which causes the IO/APIC setup to claim that interrupt which is granted by
the vector domain because there is no sanity check. As a consequence the
allocation counter of CPU0 underflows which causes a subsequent unplug to
fail with:
[ ... ] CPU 0 has 4294967295 vectors, 589 available. Cannot disable CPU
There is another sanity check missing in the matrix allocator, but the
underlying root cause is that the IO/APIC code lost the IRQ2 ignore logic
during the conversion to irqdomains.
For almost 6 years nobody complained about this wreckage, which might
indicate that this requirement could be lifted, but for any system which
actually has a PIC IRQ2 is unusable by design so any routing entry has no
effect and the interrupt cannot be connected to a device anyway.
Due to that and due to history biased paranoia reasons restore the IRQ2
ignore logic and treat it as non existent despite a routing entry claiming
otherwise.
Fixes: d32932d02e ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210318192819.636943062@linutronix.de
After commit 997acaf6b4 (lockdep: report broken irq restoration), the guest
splatting below during boot:
raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 1 PID: 169 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x26/0x30
Modules linked in: hid_generic usbhid hid
CPU: 1 PID: 169 Comm: systemd-udevd Not tainted 5.11.0+ #25
RIP: 0010:warn_bogus_irq_restore+0x26/0x30
Call Trace:
kvm_wait+0x76/0x90
__pv_queued_spin_lock_slowpath+0x285/0x2e0
do_raw_spin_lock+0xc9/0xd0
_raw_spin_lock+0x59/0x70
lockref_get_not_dead+0xf/0x50
__legitimize_path+0x31/0x60
legitimize_root+0x37/0x50
try_to_unlazy_next+0x7f/0x1d0
lookup_fast+0xb0/0x170
path_openat+0x165/0x9b0
do_filp_open+0x99/0x110
do_sys_openat2+0x1f1/0x2e0
do_sys_open+0x5c/0x80
__x64_sys_open+0x21/0x30
do_syscall_64+0x32/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xae
The new consistency checking, expects local_irq_save() and
local_irq_restore() to be paired and sanely nested, and therefore expects
local_irq_restore() to be called with irqs disabled.
The irqflags handling in kvm_wait() which ends up doing:
local_irq_save(flags);
safe_halt();
local_irq_restore(flags);
instead triggers it. This patch fixes it by using
local_irq_disable()/enable() directly.
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1615791328-2735-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to deal with noncoherent DMA, we should execute wbinvd on
all dirty pCPUs when guest wbinvd exits to maintain data consistency.
smp_call_function_many() does not execute the provided function on the
local core, therefore replace it by on_each_cpu_mask().
Reported-by: Nadav Amit <namit@vmware.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a plethora of issues with MSR filtering by installing the resulting
filter as an atomic bundle instead of updating the live filter one range
at a time. The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
the hardware MSR bitmaps won't be updated until the next VM-Enter, but
the relevant software struct is atomically updated, which is what KVM
really needs.
Similar to the approach used for modifying memslots, make arch.msr_filter
a SRCU-protected pointer, do all the work configuring the new filter
outside of kvm->lock, and then acquire kvm->lock only when the new filter
has been vetted and created. That way vCPU readers either see the old
filter or the new filter in their entirety, not some half-baked state.
Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
TOCTOU bug, but that's just the tip of the iceberg...
- Nothing is __rcu annotated, making it nigh impossible to audit the
code for correctness.
- kvm_add_msr_filter() has an unpaired smp_wmb(). Violation of kernel
coding style aside, the lack of a smb_rmb() anywhere casts all code
into doubt.
- kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
count before taking the lock.
- kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.
The entire approach of updating the live filter is also flawed. While
installing a new filter is inherently racy if vCPUs are running, fixing
the above issues also makes it trivial to ensure certain behavior is
deterministic, e.g. KVM can provide deterministic behavior for MSRs with
identical settings in the old and new filters. An atomic update of the
filter also prevents KVM from getting into a half-baked state, e.g. if
installing a filter fails, the existing approach would leave the filter
in a half-baked state, having already committed whatever bits of the
filter were already processed.
[*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com
Fixes: 1a155254ff ("KVM: x86: Introduce MSR filtering")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Reported-by: Yuan Yao <yaoyuan0329os@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210316184436.2544875-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When guest opts for re-enlightenment notifications upon migration, it is
in its right to assume that TSC page values never change (as they're only
supposed to change upon migration and the host has to keep things as they
are before it receives confirmation from the guest). This is mostly true
until the guest is migrated somewhere. KVM userspace (e.g. QEMU) will
trigger masterclock update by writing to HV_X64_MSR_REFERENCE_TSC, by
calling KVM_SET_CLOCK,... and as TSC value and kvmclock reading drift
apart (even slightly), the update causes TSC page values to change.
The issue at hand is that when Hyper-V is migrated, it uses stale (cached)
TSC page values to compute the difference between its own clocksource
(provided by KVM) and its guests' TSC pages to program synthetic timers
and in some cases, when TSC page is updated, this puts all stimer
expirations in the past. This, in its turn, causes an interrupt storm
and L2 guests not making much forward progress.
Note, KVM doesn't fully implement re-enlightenment notification. Basically,
the support for reenlightenment MSRs is just a stub and userspace is only
expected to expose the feature when TSC scaling on the expected destination
hosts is available. With TSC scaling, no real re-enlightenment is needed
as TSC frequency doesn't change. With TSC scaling becoming ubiquitous, it
likely makes little sense to fully implement re-enlightenment in KVM.
Prevent TSC page from being updated after migration. In case it's not the
guest who's initiating the change and when TSC page is already enabled,
just keep it as it is: TSC value is supposed to be preserved across
migration and TSC frequency can't change with re-enlightenment enabled.
The guest is doomed anyway if any of this is not true.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create an infrastructure for tracking Hyper-V TSC page status, i.e. if it
was updated from guest/host side or if we've failed to set it up (because
e.g. guest wrote some garbage to HV_X64_MSR_REFERENCE_TSC) and there's no
need to retry.
Also, in a hypothetical situation when we are in 'always catchup' mode for
TSC we can now avoid contending 'hv->hv_lock' on every guest enter by
setting the state to HV_TSC_PAGE_BROKEN after compute_tsc_page_parameters()
returns false.
Check for HV_TSC_PAGE_SET state instead of '!hv->tsc_ref.tsc_sequence' in
get_time_ref_counter() to properly handle the situation when we failed to
write the updated TSC page values to the guest.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.
Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul E. McKenney <paulmck@kernel.org> # for RCU
Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
MODULE_SUPPORTED_DEVICE was added in pre-git era and never was
implemented. We can safely remove it, because the kernel has grown
to have many more reliable mechanisms to determine if device is
supported or not.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When KVM_REQ_MASTERCLOCK_UPDATE request is issued (e.g. after migration)
we need to make sure no vCPU sees stale values in PV clock structures and
thus all vCPUs are kicked with KVM_REQ_CLOCK_UPDATE. Hyper-V TSC page
clocksource is global and kvm_guest_time_update() only updates in on vCPU0
but this is not entirely correct: nothing blocks some other vCPU from
entering the guest before we finish the update on CPU0 and it can read
stale values from the page.
Invalidate TSC page in kvm_gen_update_masterclock() to switch all vCPUs
to using MSR based clocksource (HV_X64_MSR_TIME_REF_COUNT).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_TSC_EMULATION_STATUS indicates whether TSC accesses are emulated
after migration (to accommodate for a different host TSC frequency when TSC
scaling is not supported; we don't implement this in KVM). Guest can use
the same MSR to stop TSC access emulation by writing zero. Writing anything
else is forbidden.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a spelling mistake in a comment, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Save the current_thread_info()->status of X86 in the new
restart_block->arch_data field so TS_COMPAT_RESTART can be removed again.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210201174716.GA17898@redhat.com
The comment in get_nr_restart_syscall() says:
* The problem is that we can get here when ptrace pokes
* syscall-like values into regs even if we're not in a syscall
* at all.
Yes, but if not in a syscall then the
status & (TS_COMPAT|TS_I386_REGS_POKED)
check below can't really help:
- TS_COMPAT can't be set
- TS_I386_REGS_POKED is only set if regs->orig_ax was changed by
32bit debugger; and even in this case get_nr_restart_syscall()
is only correct if the tracee is 32bit too.
Suppose that a 64bit debugger plays with a 32bit tracee and
* Tracee calls sleep(2) // TS_COMPAT is set
* User interrupts the tracee by CTRL-C after 1 sec and does
"(gdb) call func()"
* gdb saves the regs by PTRACE_GETREGS
* does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1
* PTRACE_CONT // TS_COMPAT is cleared
* func() hits int3.
* Debugger catches SIGTRAP.
* Restore original regs by PTRACE_SETREGS.
* PTRACE_CONT
get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the
tracee calls ia32_sys_call_table[219] == sys_madvise.
Add the sticky TS_COMPAT_RESTART flag which survives after return to user
mode. It's going to be removed in the next step again by storing the
information in the restart block. As a further cleanup it might be possible
to remove also TS_I386_REGS_POKED with that.
Test-case:
$ cvs -d :pserver:anoncvs:anoncvs@sourceware.org:/cvs/systemtap co ptrace-tests
$ gcc -o erestartsys-trap-debuggee ptrace-tests/tests/erestartsys-trap-debuggee.c --m32
$ gcc -o erestartsys-trap-debugger ptrace-tests/tests/erestartsys-trap-debugger.c -lutil
$ ./erestartsys-trap-debugger
Unexpected: retval 1, errno 22
erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421
Fixes: 609c19a385 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174709.GA17895@redhat.com
Move TS_COMPAT back to asm/thread_info.h, close to TS_I386_REGS_POKED.
It was moved to asm/processor.h by b9d989c721 ("x86/asm: Move the
thread_info::status field to thread_struct"), then later 37a8f7c383
("x86/asm: Move 'status' from thread_struct to thread_info") moved the
'status' field back but TS_COMPAT was forgotten.
Preparatory patch to fix the COMPAT case for get_nr_restart_syscall()
Fixes: 609c19a385 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210201174649.GA17880@redhat.com
On a Haswell machine, the perf_fuzzer managed to trigger this message:
[117248.075892] unchecked MSR access error: WRMSR to 0x3f1 (tried to
write 0x0400000000000000) at rIP: 0xffffffff8106e4f4
(native_write_msr+0x4/0x20)
[117248.089957] Call Trace:
[117248.092685] intel_pmu_pebs_enable_all+0x31/0x40
[117248.097737] intel_pmu_enable_all+0xa/0x10
[117248.102210] __perf_event_task_sched_in+0x2df/0x2f0
[117248.107511] finish_task_switch.isra.0+0x15f/0x280
[117248.112765] schedule_tail+0xc/0x40
[117248.116562] ret_from_fork+0x8/0x30
A fake event called VLBR_EVENT may use the bit 58 of the PEBS_ENABLE, if
the precise_ip is set. The bit 58 is reserved by the HW. Accessing the
bit causes the unchecked MSR access error.
The fake event doesn't support PEBS. The case should be rejected.
Fixes: 097e4311cd ("perf/x86: Add constraint to create guest LBR event without hw counter")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1615555298-140216-2-git-send-email-kan.liang@linux.intel.com
A repeatable crash can be triggered by the perf_fuzzer on some Haswell
system.
https://lore.kernel.org/lkml/7170d3b-c17f-1ded-52aa-cc6d9ae999f4@maine.edu/
For some old CPUs (HSW and earlier), the PEBS status in a PEBS record
may be mistakenly set to 0. To minimize the impact of the defect, the
commit was introduced to try to avoid dropping the PEBS record for some
cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates
the local pebs_status accordingly. However, it doesn't correct the PEBS
status in the PEBS record, which may trigger the crash, especially for
the large PEBS.
It's possible that all the PEBS records in a large PEBS have the PEBS
status 0. If so, the first get_next_pebs_record_by_bit() in the
__intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large
PEBS, the 'count' parameter must > 1. The second
get_next_pebs_record_by_bit() will crash.
Besides the local pebs_status, correct the PEBS status in the PEBS
record as well.
Fixes: 01330d7288 ("perf/x86: Allow zero PEBS status with only single active event")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1615555298-140216-1-git-send-email-kan.liang@linux.intel.com
Store the address space ID in the TDP iterator so that it can be
retrieved without having to bounce through the root shadow page. This
streamlines the code and fixes a Sparse warning about not properly using
rcu_dereference() when grabbing the ID from the root on the fly.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210315233803.2706477-5-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In tdp_mmu_iter_cond_resched there is a call to tdp_iter_start which
causes the iterator to continue its walk over the paging structure from
the root. This is needed after a yield as paging structure could have
been freed in the interim.
The tdp_iter_start call is not very clear and something of a hack. It
requires exposing tdp_iter fields not used elsewhere in tdp_mmu.c and
the effect is not obvious from the function name. Factor a more aptly
named function out of tdp_iter_start and call it from
tdp_mmu_iter_cond_resched and tdp_iter_start.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210315233803.2706477-4-bgardon@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a missing rcu_dereference in tdp_mmu_zap_spte_atomic.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210315233803.2706477-3-bgardon@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>