Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e0 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When stuffing the allowed secondary execution controls for nested VMX in
response to CPUID updates, don't set the allowed-1 bit for a feature that
isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.
WARN if KVM attempts to manipulate a feature that isn't supported. All
features that are currently stuffed are always advertised to L1 for
nested VMX if they are supported in KVM's base configuration, and no
additional features should ever be added to the CPUID-induced stuffing
(updating VMX MSRs in response to CPUID updates is a long-standing KVM
flaw that is slowly being fixed).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213062306.667649-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
feature is supported in hardware and enabled in KVM's base, non-nested
configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
obviously allows L1 to enable the feature for L2.
KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
that the guest value must be a strict subset of the supported host value.
Although no past commit explicitly enabled nested support for WAITPKG,
doing so is safe and functionally correct from an architectural
perspective as no additional KVM support is needed to virtualize TPAUSE,
UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
VM-Exits to L1 as necessary (commit bf653b78f9, "KVM: vmx: Introduce
handle_unexpected_vmexit and handle WAITPKG vmexit").
Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
hardware, i.e. always runs both L1 and L2 with the host's power management
settings for TPAUSE and UMWAIT. See commit bf09fb6cba ("KVM: VMX: Stop
context switching MSR_IA32_UMWAIT_CONTROL") for more details.
Fixes: e69e72faa3 ("KVM: x86: Add support for user wait instructions")
Cc: stable@vger.kernel.org
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20221213062306.667649-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly drop the result of kvm_vcpu_write_guest() when writing the
"launch state" as part of VMCLEAR emulation, and add a comment to call
out that KVM's behavior is architecturally valid. Intel's pseudocode
effectively says that VMCLEAR is a nop if the target VMCS address isn't
in memory, e.g. if the address points at MMIO.
Add a FIXME to call out that suppressing failures on __copy_to_user() is
wrong, as memory (a memslot) does exist in that case. Punt the issue to
the future as open coding kvm_vcpu_write_guest() just to make sure the
guest dies with -EFAULT isn't worth the extra complexity. The flaw will
need to be addressed if KVM ever does something intelligent on uaccess
failures, e.g. to support post-copy demand paging, but in that case KVM
will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open
code a core KVM helper.
No functional change intended.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527765 ("Error handling issues")
Fixes: 587d7e72ae ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220154224.526568-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zero out the valid_bank_mask when using the fast variant of
HVCALL_SEND_IPI_EX to send IPIs to all vCPUs. KVM requires the "var_cnt"
and "valid_bank_mask" inputs to be consistent even when targeting all
vCPUs. See commit bd1ba5732b ("KVM: x86: Get the number of Hyper-V
sparse banks from the VARHEAD field").
Fixes: 998489245d ("KVM: selftests: Hyper-V PV IPI selftest")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221219220416.395329-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a sanity check in kvm_handle_memory_failure() to assert that a valid
x86_exception structure is provided if the memory "failure" wants to
propagate a fault into the guest. If a memory failure happens during a
direct guest physical memory access, e.g. for nested VMX, KVM hardcodes
the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer
(because the exception struct would just be filled with garbage).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220153427.514032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_apic_hw_enabled() only needs to return bool, there is no place
to use the return value of MSR_IA32_APICBASE_ENABLE.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables
are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM
halves' and increase 'data_offset' otherwise. Coverity discovered, that in
one case both variables are incremented unconditionally. This doesn't seem
to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves'
data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also
takes into account 'hc->fast' but is still worth fixing.
To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to
'struct kvm_hv_hcall' as a union and use at call sites. This allows to
remove explicit 'data_offset'/'consumed_xmm_halves' parameters from
kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries()
helpers.
Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and
is not zeroed, consumers are supposed to initialize the appropriate field
if needed.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527764 ("Uninitialized variables")
Fixes: 260970862c ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208102700.959630-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When scanning userspace I/OAPIC entries, intercept EOI for level-triggered
IRQs if the current vCPU has a pending and/or in-service IRQ for the
vector in its local API, even if the vCPU doesn't match the new entry's
destination. This fixes a race between userspace I/OAPIC reconfiguration
and IRQ delivery that results in the vector's bit being left set in the
remote IRR due to the eventual EOI not being forwarded to the userspace
I/OAPIC.
Commit 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC
reconfigure race") fixed the in-kernel IOAPIC, but not the userspace
IOAPIC configuration, which has a similar race.
Fixes: 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208094415.12723-1-attofari@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current vPMU can reuse the same pmc->perf_event for the same
hardware event via pmc_pause/resume_counter(), but this optimization
does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1,
in_tx_cp=1"), where event->attr.sample_period is legally zero at creation,
thus making the perf call to perf_event_period() meaningless (no need to
adjust sample period in this case), and instead causing such reusable
perf_events to be repeatedly released and created.
Avoid releasing zero sample_period events by checking is_sampling_event()
to follow the previously enable/disable optimization.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20221207071506.15733-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We only check the register opcode value inside the restricted ring
section, move it into the main io_uring_register() function instead
and check it up front.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mostly small bug fixes and small updates. The only things of note is
a qla2xxx fix for crash on hotplug and timeout and the addition of a
user exposed abstraction layer for persistent reservation error return
handling (which necessitates the conversion of nvme.c as well as
SCSI).
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCY6SZISYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQkdAP9Juri0
ihkyA9tVx1ZslVOp8V8mWK3P2VROA4ArvcMRVwD/Qxf2REP8Fx2GIgC0sNaRedg3
+ncveg3EpZ1n/NXXeDw=
=q+XO
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"Mostly small bug fixes and small updates.
The only things of note is a qla2xxx fix for crash on hotplug and
timeout and the addition of a user exposed abstraction layer for
persistent reservation error return handling (which necessitates the
conversion of nvme.c as well as SCSI)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix crash when I/O abort times out
nvme: Convert NVMe errors to PR errors
scsi: sd: Convert SCSI errors to PR errors
scsi: core: Rename status_byte to sg_status_byte
block: Add error codes for common PR failures
scsi: sd: sd_zbc: Trace zone append emulation
scsi: libfc: Include the correct header
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmOkQmcACgkQ+7dXa6fL
C2vjNg/8CWHpUQj32SSASt5uQvndqBe3xyr+NPYRdNddcu/gS82UoMSuOdMh+afb
OhZ/yrkWzkraJVMgEc2mbe0xfGEN9TRQnld+/oy5Co2dxlLAtA/Iw3xZKKG5V5J1
CVE8V2SUtPC0ycJ4XLNuwfmaTEGxZjKju832V4qWvT8oz299Xl4MTsu3zN+Rqpih
TAAfokfMVTN57x6PTd+KCl8dmExRyRIq70Iu9OwHPF9lFFDVqGlzGPYJ+gPqSKxV
B0F/sW6y1djuyL8wFuZn+W1ECf3DnA9Ol2cSP6qEsWrymQkjY/9tntN52Hu22y9x
xP6MHXKQXF+gjmX7aokivTTcOSw6/ript1ykcaNlz7ZX31mxKQIsb++jHSWshs6f
7Ncbjffqg+L8CmgVvaQ63dNVBvHa+Y+9Os8H0t8DZ0DoY6Crv+W8ssQkjW3Lqdoq
DIlOFRKEbeXO0+hTM00te3NhP8sYKGtjup8Xuv8TMqye2hE8DvBu80qdvISBmglP
P8odB7Rlwxp9n7jkBUFdc86IrQOHchao1Q7xNY4RDe/CZc6smNBgwf7aK5TONZkk
qQGmVk2Ca/rFNQxXAV/iHRFCPJtTcdOk7b6kYWHFVj0E+r0iYNeMD4+hKfQK6W4X
u4MzrmX9qm8+zN4e+FSpMU7OEDw2Yi87KmGrz4nbvK6wNW7o6Go=
=B9ez
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20221222' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs update from David Howells:
"A fix for a couple of missing resource counter decrements, two small
cleanups of now-unused bits of code and a patch to remove writepage
support from afs"
* tag 'afs-next-20221222' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Stop implementing ->writepage()
afs: remove afs_cache_netfs and afs_zap_permits() declarations
afs: remove variable nr_servers
afs: Fix lost servers_outstanding count
- Don't stop building perf if python setuptools isn't installed, just
disable the affected perf feature.
- Remove explicit reference to python 2.x devel files, that warning is
about python-devel, no matter what version, being unavailable and thus
disabling the linking with libpython.
- Don't use -Werror=switch-enum when building the python support that
handles libtraceevent enumerations, as there is no good way to test
if some specific enum entry is available with the libtraceevent
installed on the system.
- Introduce 'perf lock contention' --type-filter and --lock-filter, to
filter by lock type and lock name:
$ sudo ./perf lock record -a -- ./perf bench sched messaging
$ sudo ./perf lock contention -E 5 -Y spinlock
contended total wait max wait avg wait type caller
802 1.26 ms 11.73 us 1.58 us spinlock __wake_up_common_lock+0x62
13 787.16 us 105.44 us 60.55 us spinlock remove_wait_queue+0x14
12 612.96 us 78.70 us 51.08 us spinlock prepare_to_wait+0x27
114 340.68 us 12.61 us 2.99 us spinlock try_to_wake_up+0x1f5
83 226.38 us 9.15 us 2.73 us spinlock folio_lruvec_lock_irqsave+0x5e
$ sudo ./perf lock contention -l
contended total wait max wait avg wait address symbol
57 1.11 ms 42.83 us 19.54 us ffff9f4140059000
15 280.88 us 23.51 us 18.73 us ffffffff9d007a40 jiffies_lock
1 20.49 us 20.49 us 20.49 us ffffffff9d0d50c0 rcu_state
1 9.02 us 9.02 us 9.02 us ffff9f41759e9ba0
$ sudo ./perf lock contention -L jiffies_lock,rcu_state
contended total wait max wait avg wait type caller
15 280.88 us 23.51 us 18.73 us spinlock tick_sched_do_timer+0x93
1 20.49 us 20.49 us 20.49 us spinlock __softirqentry_text_start+0xeb
$ sudo ./perf lock contention -L ffff9f4140059000
contended total wait max wait avg wait type caller
38 779.40 us 42.83 us 20.51 us spinlock worker_thread+0x50
11 216.30 us 39.87 us 19.66 us spinlock queue_work_on+0x39
8 118.13 us 20.51 us 14.77 us spinlock kthread+0xe5
- Fix splitting CC into compiler and options when checking if a option
is present in clang to build the python binding, needed in systems
such as yocto that set CC to, e.g.: "gcc --sysroot=/a/b/c".
- Refresh metris and events for Intel systems: alderlake. alderlake-n,
bonnell, broadwell, broadwellde, broadwellx, cascadelakex,
elkhartlake, goldmont, goldmontplus, haswell, haswellx, icelake,
icelakex, ivybridge, ivytown, jaketown, knightslanding, meteorlake,
nehalemep, nehalemex, sandybridge, sapphirerapids, silvermont, skylake,
skylakex, snowridgex, tigerlake, westmereep-dp, westmereep-sp,
westmereex.
- Add vendor events files (JSON) for AMD Zen 4, from sections 2.1.15.4
"Core Performance Monitor Counters", 2.1.15.5 "L3 Cache Performance
Monitor Counter"s and Section 7.1 "Fabric Performance Monitor Counter
(PMC) Events" in the Processor Programming Reference (PPR) for AMD
Family 19h Model 11h Revision B1 processors.
This constitutes events which capture op dispatch, execution and
retirement, branch prediction, L1 and L2 cache activity, TLB activity,
L3 cache activity and data bandwidth for various links and interfaces in
the Data Fabric.
- Also, from the same PPR are metrics taken from Section 2.1.15.2
"Performance Measurement", including pipeline utilization, which are
new to Zen 4 processors and useful for finding performance bottlenecks
by analyzing activity at different stages of the pipeline.
- Greatly improve the 'srcline', 'srcline_from', 'srcline_to' and
'srcfile' sort keys performance by postponing calling the external
addr2line utility to the collapse phase of histogram bucketing.
- Fix 'perf test' "all PMU test" to skip parametrized events, that
requires setting up and are not supported by this test.
- Update tools/ copies of kernel headers: features, disabled-features,
fscrypt.h, i915_drm.h, msr-index.h, power pc syscall table and kvm.h.
- Add .DELETE_ON_ERROR special Makefile target to clean up partially
updated files on error.
- Simplify the mksyscalltbl script for arm64 by avoiding to run the host
compiler to create the syscall table, do it all just with the shell
script.
- Further fixes to honour quiet mode (-q).
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCY6SJ+gAKCRCyPKLppCJ+
J5JSAQCSokw2lsIqelDfoBfOQcMwah4ogW1vuO5KiepHgGOjuwD/d+65IxFIRA/h
tJjAtq4fReyi4u4eTc1aLgUwFh7V0ws=
=rneN
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v6.2-2-2022-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull more perf tools updates from Arnaldo Carvalho de Melo:
"perf tools fixes and improvements:
- Don't stop building perf if python setuptools isn't installed, just
disable the affected perf feature.
- Remove explicit reference to python 2.x devel files, that warning
is about python-devel, no matter what version, being unavailable
and thus disabling the linking with libpython.
- Don't use -Werror=switch-enum when building the python support that
handles libtraceevent enumerations, as there is no good way to test
if some specific enum entry is available with the libtraceevent
installed on the system.
- Introduce 'perf lock contention' --type-filter and --lock-filter,
to filter by lock type and lock name:
$ sudo ./perf lock record -a -- ./perf bench sched messaging
$ sudo ./perf lock contention -E 5 -Y spinlock
contended total wait max wait avg wait type caller
802 1.26 ms 11.73 us 1.58 us spinlock __wake_up_common_lock+0x62
13 787.16 us 105.44 us 60.55 us spinlock remove_wait_queue+0x14
12 612.96 us 78.70 us 51.08 us spinlock prepare_to_wait+0x27
114 340.68 us 12.61 us 2.99 us spinlock try_to_wake_up+0x1f5
83 226.38 us 9.15 us 2.73 us spinlock folio_lruvec_lock_irqsave+0x5e
$ sudo ./perf lock contention -l
contended total wait max wait avg wait address symbol
57 1.11 ms 42.83 us 19.54 us ffff9f4140059000
15 280.88 us 23.51 us 18.73 us ffffffff9d007a40 jiffies_lock
1 20.49 us 20.49 us 20.49 us ffffffff9d0d50c0 rcu_state
1 9.02 us 9.02 us 9.02 us ffff9f41759e9ba0
$ sudo ./perf lock contention -L jiffies_lock,rcu_state
contended total wait max wait avg wait type caller
15 280.88 us 23.51 us 18.73 us spinlock tick_sched_do_timer+0x93
1 20.49 us 20.49 us 20.49 us spinlock __softirqentry_text_start+0xeb
$ sudo ./perf lock contention -L ffff9f4140059000
contended total wait max wait avg wait type caller
38 779.40 us 42.83 us 20.51 us spinlock worker_thread+0x50
11 216.30 us 39.87 us 19.66 us spinlock queue_work_on+0x39
8 118.13 us 20.51 us 14.77 us spinlock kthread+0xe5
- Fix splitting CC into compiler and options when checking if a
option is present in clang to build the python binding, needed in
systems such as yocto that set CC to, e.g.: "gcc --sysroot=/a/b/c".
- Refresh metris and events for Intel systems: alderlake.
alderlake-n, bonnell, broadwell, broadwellde, broadwellx,
cascadelakex, elkhartlake, goldmont, goldmontplus, haswell,
haswellx, icelake, icelakex, ivybridge, ivytown, jaketown,
knightslanding, meteorlake, nehalemep, nehalemex, sandybridge,
sapphirerapids, silvermont, skylake, skylakex, snowridgex,
tigerlake, westmereep-dp, westmereep-sp, westmereex.
- Add vendor events files (JSON) for AMD Zen 4, from sections
2.1.15.4 "Core Performance Monitor Counters", 2.1.15.5 "L3 Cache
Performance Monitor Counter"s and Section 7.1 "Fabric Performance
Monitor Counter (PMC) Events" in the Processor Programming
Reference (PPR) for AMD Family 19h Model 11h Revision B1
processors.
This constitutes events which capture op dispatch, execution and
retirement, branch prediction, L1 and L2 cache activity, TLB
activity, L3 cache activity and data bandwidth for various links
and interfaces in the Data Fabric.
- Also, from the same PPR are metrics taken from Section 2.1.15.2
"Performance Measurement", including pipeline utilization, which
are new to Zen 4 processors and useful for finding performance
bottlenecks by analyzing activity at different stages of the
pipeline.
- Greatly improve the 'srcline', 'srcline_from', 'srcline_to' and
'srcfile' sort keys performance by postponing calling the external
addr2line utility to the collapse phase of histogram bucketing.
- Fix 'perf test' "all PMU test" to skip parametrized events, that
requires setting up and are not supported by this test.
- Update tools/ copies of kernel headers: features,
disabled-features, fscrypt.h, i915_drm.h, msr-index.h, power pc
syscall table and kvm.h.
- Add .DELETE_ON_ERROR special Makefile target to clean up partially
updated files on error.
- Simplify the mksyscalltbl script for arm64 by avoiding to run the
host compiler to create the syscall table, do it all just with the
shell script.
- Further fixes to honour quiet mode (-q)"
* tag 'perf-tools-for-v6.2-2-2022-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (67 commits)
perf python: Fix splitting CC into compiler and options
perf scripting python: Don't be strict at handling libtraceevent enumerations
perf arm64: Simplify mksyscalltbl
perf build: Remove explicit reference to python 2.x devel files
perf vendor events amd: Add Zen 4 mapping
perf vendor events amd: Add Zen 4 metrics
perf vendor events amd: Add Zen 4 uncore events
perf vendor events amd: Add Zen 4 core events
perf vendor events intel: Refresh westmereex events
perf vendor events intel: Refresh westmereep-sp events
perf vendor events intel: Refresh westmereep-dp events
perf vendor events intel: Refresh tigerlake metrics and events
perf vendor events intel: Refresh snowridgex events
perf vendor events intel: Refresh skylakex metrics and events
perf vendor events intel: Refresh skylake metrics and events
perf vendor events intel: Refresh silvermont events
perf vendor events intel: Refresh sapphirerapids metrics and events
perf vendor events intel: Refresh sandybridge metrics and events
perf vendor events intel: Refresh nehalemex events
perf vendor events intel: Refresh nehalemep events
...
After we introduced a module parameter and quirk infrastructure for
picking the Microsoft GUID over the SOC vendor GUID we discovered
that lots and lots of systems are getting this wrong.
The table continues to grow, and is becoming unwieldy.
We don't really have any benefit to forcing vendors to populate the
AMD GUID. This is just extra work, and more and more vendors seem
to mess it up. As the Microsoft GUID is used by Windows as well,
it's very likely that it won't be messed up like this.
So drop all the quirks forcing it and the Rembrandt behavior. This
means that Cezanne or later effectively only run the Microsoft GUID
codepath with the exception of HP Elitebook 8*5 G9.
Fixes: fd894f05cf ("ACPI: x86: s2idle: If a new AMD _HID is missing assume Rembrandt")
Cc: stable@vger.kernel.org # 6.1
Reported-by: Benjamin Cheng <ben@bcheng.me>
Reported-by: bilkow@tutanota.com
Reported-by: Paul <paul@zogpog.com>
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2292
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216768
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Philipp Zabel <philipp.zabel@gmail.com>
Tested-by: Philipp Zabel <philipp.zabel@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
HP Elitebook 865 supports both the AMD GUID w/ _REV 2 and Microsoft
GUID with _REV 0. Both have very similar code but the AMD GUID
has a special workaround that is specific to a problem with
spurious wakeups on systems with Qualcomm WLAN.
This is believed to be a bug in the Qualcomm WLAN F/W (it doesn't
affect any other WLAN H/W). If this WLAN firmware is fixed this
quirk can be dropped.
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The apple-gmux driver only binds to old GMUX devices which have an
IORESOURCE_IO resource (using inb()/outb()) rather then memory-mapped
IO (IORESOURCE_MEM).
T2 MacBooks use the new style GMUX devices (with IORESOURCE_MEM access),
so these are not supported by the apple-gmux driver. This is not a problem
since they have working ACPI video backlight support.
But the apple_gmux_present() helper only checks if an ACPI device with
the "APP000B" HID is present, causing acpi_video_get_backlight_type()
to return acpi_backlight_apple_gmux disabling the acpi_video backlight
device.
Add a new apple_gmux_backlight_present() helper which checks that
the "APP000B" device actually is an old GMUX device with an IORESOURCE_IO
resource.
This fixes the acpi_video0 backlight no longer registering on T2 MacBooks.
Note people are working to add support for the new style GMUX to Linux:
https://github.com/kekrby/linux-t2/commits/wip/hybrid-graphics
Once this lands this patch should be reverted so that
acpi_video_get_backlight_type() also prefers the gmux on new style GMUX
MacBooks, but for now this is necessary to avoid regressing backlight
control on T2 Macs.
Fixes: 21245df307 ("ACPI: video: Add Apple GMUX brightness control detection")
Reported-and-tested-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Asus ExpertBook B2502 has the same keyboard issue as Asus Vivobook
K3402ZA/K3502ZA. The kernel overrides IRQ 1 to Edge_High when it
should be Active_Low.
This patch adds the ExpertBook B2502 model to the existing
quirk list of Asus laptops with this issue.
Fixes: b5f9223a10 ("ACPI: resource: Skip IRQ override on Asus Vivobook S5602ZA")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2142574
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit bfcdf58380 ("ACPI: resource: do IRQ override on LENOVO IdeaPad")
added an override for Lenovo IdeaPad 5 16ALC7. The 14ALC7 variant also
suffers from a broken touchscreen and trackpad.
Fixes: 9946e39fe8 ("ACPI: resource: skip IRQ override on AMD Zen platforms")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216804
Signed-off-by: Adrian Freund <adrian@freund.io>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Schenker XMG CORE 15 (M22) is Ryzen-6 based and needs IRQ overriding
for the keyboard to work. Adding an entry for this laptop to the
override_table makes the internal keyboard functional again.
Signed-off-by: Erik Schumacher <ofenfisch@googlemail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ACPI video detection code has a module parameter
`register_backlight_delay` which is currently configured to 8 seconds.
This means that if after 8 seconds of booting no native driver has created
a backlight device then the code will attempt to make an ACPI video
backlight device.
This was intended as a safety mechanism with the backlight overhaul that
occurred in kernel 6.1, but as it doesn't appear necesssary set it to be
disabled by default.
Suggested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On desktop APUs amdgpu doesn't create a native backlight device
as no eDP panels are found. However if the BIOS has reported
backlight control methods in the ACPI tables then an acpi_video0
backlight device will be made 8 seconds after boot.
This has manifested in a power slider on a number of desktop APUs
ranging from Ryzen 5000 through Ryzen 7000 on various motherboard
manufacturers. To avoid this, report to the acpi video detection
that the system does not have any panel connected in the native
driver.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783786
Reported-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current logic for the ACPI backlight detection will create
a backlight device if no native or vendor drivers have created
8 seconds after the system has booted if the ACPI tables
included backlight control methods.
If the GPU drivers have loaded, they may be able to report whether
any LCD panels were found. Allow using this information to factor
in whether to enable the fallback logic for making an acpi_video0
backlight device.
Suggested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, we shut down the filecache before trying to clean up the
stateids that depend on it. This leads to the kernel trying to free an
nfsd_file twice, and a refcount overput on the nf_mark.
Change the shutdown procedure to tear down all of the stateids prior
to shutting down the filecache.
Reported-and-tested-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Fixes: 5e113224c1 ("nfsd: nfsd_file cache entries should be per net namespace")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Noticed this build failure on archlinux:base when building with clang:
clang-14: error: optimization flag '-ffat-lto-objects' is not supported [-Werror,-Wignored-optimization-argument]
In tools/perf/util/setup.py we check if clang supports that option, but
since commit 3cad53a6f9 ("perf python: Account for multiple words
in CC") this got broken as in the common case where CC="clang":
>>> cc="clang"
>>> print(cc.split()[0])
clang
>>> option="-ffat-lto-objects"
>>> print(str(cc.split()[1:]) + option)
[]-ffat-lto-objects
>>>
And then the Popen will call clang with that bogus option name that in
turn will not produce the b"unknown argument" or b"is not supported"
that this function uses to detect if the option is not available and
thus later on clang will be called with an unknown/unsupported option.
Fix it by looking if really there are options in the provided CC
variable, and if so override 'cc' with the first token and append the
options to the 'option' variable.
Fixes: 3cad53a6f9 ("perf python: Account for multiple words in CC")
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Keeping <john@metanate.com>
Cc: Khem Raj <raj.khem@gmail.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Link: http://lore.kernel.org/lkml/Y6Rq5F5NI0v1QQHM@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We're trying to get rid of the ->writepage() hook[1]. Stop afs from using
it by unlocking the page and calling afs_writepages_region() rather than
folio_write_one().
A flag is passed to afs_writepages_region() to indicate that it should only
write a single region so that we don't flush the entire file in
->write_begin(), but do add other dirty data to the region being written to
try and reduce the number of RPC ops.
This requires ->migrate_folio() to be implemented, so point that at
filemap_migrate_folio() for files and also for symlinks and directories.
This can be tested by turning on the afs_folio_dirty tracepoint and then
doing something like:
xfs_io -c "w 2223 7000" -c "w 15000 22222" -c "w 23 7" /afs/my/test/foo
and then looking in the trace to see if the write at position 15000 gets
stored before page 0 gets dirtied for the write at position 23.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20221113162902.883850-1-hch@lst.de/ [1]
Link: https://lore.kernel.org/r/166876785552.222254.4403222906022558715.stgit@warthog.procyon.org.uk/ # v1
afs_zap_permits() has been removed since
commit be080a6f43 ("afs: Overhaul permit caching").
afs_cache_netfs has been removed since
commit 523d27cda1 ("afs: Convert afs to use the new fscache API").
so remove the declare for them from header file.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20220909070353.1160228-1-cuigaosheng1@huawei.com/
The afs_fs_probe_dispatcher() work function is passed a count on
net->servers_outstanding when it is scheduled (which may come via its
timer). This is passed back to the work_item, passed to the timer or
dropped at the end of the dispatcher function.
But, at the top of the dispatcher function, there are two checks which
skip the rest of the function: if the network namespace is being destroyed
or if there are no fileservers to probe. These two return paths, however,
do not drop the count passed to the dispatcher, and so, sometimes, the
destruction of a network namespace, such as induced by rmmod of the kafs
module, may get stuck in afs_purge_servers(), waiting for
net->servers_outstanding to become zero.
Fix this by adding the missing decrements in afs_fs_probe_dispatcher().
Fixes: f6cbb368bc ("afs: Actively poll fileservers to maintain NAT or firewall openings")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/167164544917.2072364.3759519569649459359.stgit@warthog.procyon.org.uk/
Some more small fixes and board quirks that came in since my last
update, the main one being the fixes from Kai for issues around the
attempts to get kexec working well on SOF based systems.
-----BEGIN PGP SIGNATURE-----
iQEyBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmOhn2kACgkQJNaLcl1U
h9Dfdgf47os8jUAaEuV3/pFl7OOh+L2jR2P5yCK60VHu0CfuHo3lynwpYvS/8wKN
XqYz0eeuYOWpFeZ12wZBY/Dnk2dwkXiqpv7e0ID0szAH9TezSlQ3MRno9hwWGloU
w3ntU5VeIYTKl91E2y5X9GMoDsfnfh751MsjXOcP40npjGEJpOtAO0z1sIXANSKz
ftceXGapvTokSp7mbk68BM5ivom4TM3eDSlQiOeMj2OeOhXRylx5tHeQV3FzVeB+
4K7bECzveDn/hTYBX2Lopn2stR1RF5S9HDynjo83YDKXOKUp8bJfEHK7R/3y3u56
eIwKgMmxb2eK2IwgjU/7sKP87ARz
=OaSY
-----END PGP SIGNATURE-----
Merge tag 'asoc-v6.2-3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Updates for v6.2
Some more small fixes and board quirks that came in since my last
update, the main one being the fixes from Kai for issues around the
attempts to get kexec working well on SOF based systems.
It seems that the firmware is broken and does not accept
the UAC_EP_CS_ATTR_SAMPLE_RATE URB. There is only one rate (48000Hz)
available in the descriptors for the output endpoint.
Create a new quirk QUIRK_FLAG_FIXED_RATE to skip the rate setup
when only one rate is available (fixed).
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216798
Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Link: https://lore.kernel.org/r/20221215153037.1163786-1-perex@perex.cz
Signed-off-by: Takashi Iwai <tiwai@suse.de>
- Make monitor structures read only
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY6J+vxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qohJAP9Yx3A4xmopkMjpfK1HBzuB7j4U7blN
2NhqKM626unbeQEAi3FhPRc5N/sGBdsUClYZIKau0p3ip1TVfYbhk8vSgwg=
=VcGm
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"I missed this minor hardening of the kernel in the first pull.
- Make monitor structures read only"
* tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv/monitors: Move monitor structure in rodata
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5yAHxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoWoAP9ZLmqgIqlH3Zcms31SR250kLXxsxT3
JHe82hiuI1I3fAD/Z93QLHw9wngLqIMx/wXsdFjTNOGGWdxfclSWI2qI6Q0=
=KaJg
-----END PGP SIGNATURE-----
Merge tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace probes updates from Steven Rostedt:
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
* tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Reject symbol/symstr type for uprobe
tracing/probes: Add symstr type for dynamic events
kprobes: kretprobe events missing on 2-core KVM guest
kprobes: Fix check for probe enabled in kill_kprobe()
test_kprobes: Fix implicit declaration error of test_kprobes
tracing: Fix race where eprobes can be called before the event
* Allow unloading KVM module
* Allow KVM user-space to set mvendorid, marchid, and mimpid
* Several fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOhy+QUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOdUwf+K3i8RHW1H8TF/JSrn1I6nURNLYhb
2wXzl3esOsfswtn6dxEvLEXivcKmD2G9bLpa2UIa3vw1Plg9tdce9IJ5qDodtxVL
mlISMUSgMNy+lelKJiG+l5Ld4oJ4HUY0yw/p3Ml9WUpra98UCB0sJ+FsqXr4ndi9
LxkQJrNyZkQcRH2IXjQhKjdjkepFTmkhKs/uCxAZvW9zfUmGX0dcp9W22PTbsapQ
IcaBKdVaNN3TXNSIdDCM2Iv+oBN7gJn1CbgFxhkp4L8eE5PvRjFw0QooFMn2TjDw
VflP3gIs/41+5tnoPWXGAkKFe/Z5aJjGjx6Yx0WnEEgoAG47RUHYsKIUjw==
=8ejV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull RISC-V kvm updates from Paolo Bonzini:
- Allow unloading KVM module
- Allow KVM user-space to set mvendorid, marchid, and mimpid
- Several fixes and cleanups
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
RISC-V: KVM: Add ONE_REG interface for mvendorid, marchid, and mimpid
RISC-V: KVM: Save mvendorid, marchid, and mimpid when creating VCPU
RISC-V: Export sbi_get_mvendorid() and friends
RISC-V: KVM: Move sbi related struct and functions to kvm_vcpu_sbi.h
RISC-V: KVM: Use switch-case in kvm_riscv_vcpu_set/get_reg()
RISC-V: KVM: Remove redundant includes of asm/csr.h
RISC-V: KVM: Remove redundant includes of asm/kvm_vcpu_timer.h
RISC-V: KVM: Fix reg_val check in kvm_riscv_vcpu_set_reg_config()
RISC-V: KVM: Simplify kvm_arch_prepare_memory_region()
RISC-V: KVM: Exit run-loop immediately if xfer_to_guest fails
RISC-V: KVM: use vma_lookup() instead of find_vma_intersection()
RISC-V: KVM: Add exit logic to main.c
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOgp5AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm5SD/9tduSZQW00aDm83HbEikWdCgQm0w37tyYl
C2+IwRwLF8pnAoSb6yaO7LZM9ZUYfoIfIlkHXkKhT1xNJ/XdeGDgwjOHi106iaEx
kG08DcFnUjyJ4Yh6hnnpnSepIo0ckwa18pSaE4smvmKZirj3it3O6xSspyBxtUcv
q6PvJDMN15aG6uLHq3xNZPzoI2KYXBDgwanyImRhdvLoOTiS9rok+F9e2ob3lzAa
PB+FOipQoKb7M6jbyfZe4KbeTiJh4EYEl5Qa6ebrDIkOTm7zjc8sQbCkNeI7osh+
D0FvEQ1Vsrjj5Bp6N9CmZcrmNagjEcAPbzguxAilrgw2/XvA8d0fymziGXvuyUEv
bSAx6lyJzfMLrvtubSqMhIF+8DlccQnnXz2ccacwvAfayytzNJjC9serU+czHA4O
ZkPTwZFjAmbn6q6SK3qaOCB9IgITHipj8R/ncGu9KjNvM2QgzM+OIrP0xGxtk6uI
ZGrt9nGMUmgjtaliQjiDVZomMewru1lRWPRAjfQ995gmVkejgapUHYoaDtDzaLKZ
Q9BaK5CC2jltGUuuoFEnXnwu/Eyvp9y++pKkz4Esb+/Wkst4qyGtr9DOSTnv1wKN
W20h3Z5vOAXXquvUJ5S3mQl8TNJHiBz+/CRB9PZG8XFtn8ubGo8XttGdgjQgyLM3
6FHzcZgeWw==
=TSec
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Various fixes for BFQ (Yu, Yuwei)
- Fix for loop command line parsing (Isaac)
- No need to specifically clear REQ_ALLOC_CACHE on IOPOLL downgrade
anymore (me)
- blk-iocost enum fix for newer gcc (Jiri)
- UAF fix for queue release (Ming)
- blk-iolatency error handling memory leak fix (Tejun)
* tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux:
block: don't clear REQ_ALLOC_CACHE for non-polled requests
block: fix use-after-free of q->q_usage_counter
block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHED
blk-iolatency: Fix memory leak on add_disk() failures
loop: Fix the max_loop commandline argument treatment when it is set to 0
block/blk-iocost (gcc13): keep large values in a new enum
block, bfq: replace 0/1 with false/true in bic apis
block, bfq: don't return bfqg from __bfq_bic_change_cgroup()
block, bfq: fix possible uaf for 'bfqq->bic'
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOgp3oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvjeD/4w17ERLignAko51qJFS+lcpjEWFYk63XZN
tFaZqGOscH9PertlQu5IstORa/OWY2iCzhi2waMvtHAI9YaT7jpxkgrUdfEoGyNL
6Ij5DIqnlIkZG+cUBXKq+xLhThssJECkqckcVPgtZIbCZzDAL/ffghH94sZY/LxA
+cwsloA24s0hjZX3Cm/RNQIgEBf2g4HNNA09Ft3Idd9tSL0WndqcHTasEGAC8K+Z
r9ZFsKCSVKB+6wUCYawO5xF+zfm5wA4sD1PXjVA1q++mwDm8BKmpcBG10v3grJ24
qzh+k8wMeiD7BJLDekEyWvklV7bIpbMZ2dzkdMI7n0Cs6WRumhLo+Enrh1l5l9YJ
wizqwWykGjWWyLm9QP5R249o7n/T6q7jKqmsBzN+3wWYasq4W7PIjr4hQZ2hqiAW
pUdaqvb0V91OQHjDHi4wI4xZnsmgr6eJhDz0JAd5wzc+g2Uav8GWs6wEaW++4HkL
IHWggX51oF3Mzjo+Lx0pfs3dkcA5vQ85KDcICeLXnv6HPm90ImZY4cTdSW+YYzlK
351GwaPOTepm4M+hLZHZYVj5pTQPIKAspxwbSNcYlQ4nLhVPfcKkQGZ6di4yHZaC
j8zT1opSmh4OqKA9mE/tCUf3s3e2YDmemDuyUKD56luAIw+rsxScC6HEPJPBxrmm
hZfEkgw/vQ==
=Jq/9
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.2-2022-12-19' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Improve the locking for timeouts. This was originally queued up for
the initial pull, but I messed up and it got missed. (Pavel)
- Fix an issue with running task_work from the wait path, causing some
inefficiencies (me)
- Add a clear of ->free_iov upfront in the 32-bit compat data
importing, so we ensure that it's always sane at completion time (me)
- Use call_rcu_hurry() for the eventfd signaling (Dylan)
- Ordering fix for multishot recv completions (Pavel)
- Add the io_uring trace header to the MAINTAINERS entry (Ammar)
* tag 'io_uring-6.2-2022-12-19' of git://git.kernel.dk/linux:
MAINTAINERS: io_uring: Add include/trace/events/io_uring.h
io_uring/net: fix cleanup after recycle
io_uring/net: ensure compat import handlers clear free_iov
io_uring: include task_work run after scheduling in wait for events
io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_work
io_uring: use call_rcu_hurry if signaling an eventfd
io_uring: fix overflow handling regression
io_uring: ease timeout flush locking requirements
io_uring: revise completion_lock locking
io_uring: protect cq_timeouts with timeout_lock
In GCC version 12.1 a checksum field was added.
This patch fixes a kernel crash occurring during boot when using
gcov-kernel with GCC version 12.2. The crash occurred on a system running
on i.MX6SX.
Link: https://lkml.kernel.org/r/20221220102318.3418501-1-rickaran@axis.com
Fixes: 977ef30a7d ("gcov: support GCC 12.1 and newer compilers")
Signed-off-by: Rickard x Andersson <rickaran@axis.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Martin Liska <mliska@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a test to the maple tree test suite for the spanning rebalance
insufficient node issue does not go undetected again.
Link: https://lkml.kernel.org/r/20221219161922.2708732-3-Liam.Howlett@oracle.com
Fixes: 54a611b605 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport contacted me off-list with a regression in running criu.
Periodic tests fail with an RCU stall during execution. Although rare, it
is possible to hit this with other uses so this patch should be backported
to fix the regression.
This patchset adds the fix and a test case to the maple tree test
suite.
This patch (of 2):
An insufficient node was causing an out-of-bounds access on the node in
mas_leaf_max_gap(). The cause was the faulty detection of the new node
being a root node when overwriting many entries at the end of the tree.
Fix the detection of a new root and ensure there is sufficient data prior
to entering the spanning rebalance loop.
Link: https://lkml.kernel.org/r/20221219161922.2708732-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20221219161922.2708732-2-Liam.Howlett@oracle.com
Fixes: 54a611b605 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Mike Rapoport <rppt@kernel.org>
Tested-by: Mike Rapoport <rppt@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit bbff39cc6c ("hugetlb: allocate vma lock for all sharable vmas")
removed the pmd sharable checks in the vma lock helper routines. However,
it left the functional version of helper routines behind #ifdef
CONFIG_ARCH_WANT_HUGE_PMD_SHARE. Therefore, the vma lock is not being
used for sharable vmas on architectures that do not support pmd sharing.
On these architectures, a potential fault/truncation race is exposed that
could leave pages in a hugetlb file past i_size until the file is removed.
Move the functional vma lock helpers outside the ifdef, and remove the
non-functional stubs. Since the vma lock is not just for pmd sharing,
rename the routine __vma_shareable_flags_pmd.
Link: https://lkml.kernel.org/r/20221212235042.178355-1-mike.kravetz@oracle.com
Fixes: bbff39cc6c ("hugetlb: allocate vma lock for all sharable vmas")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
USB support can be in a loadable module, and this causes a link failure
with KMSAN:
ERROR: modpost: "kmsan_handle_urb" [drivers/usb/core/usbcore.ko] undefined!
Export the symbol so it can be used by this module.
Link: https://lkml.kernel.org/r/20221215162710.3802378-1-arnd@kernel.org
Fixes: 553a80188a ("kmsan: handle memory sent to/from USB")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is needed for the vmap/vunmap declarations:
mm/kmsan/kmsan_test.c:316:9: error: implicit declaration of function 'vmap' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
^
mm/kmsan/kmsan_test.c:316:29: error: use of undeclared identifier 'VM_MAP'
vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
^
mm/kmsan/kmsan_test.c:322:3: error: implicit declaration of function 'vunmap' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
vunmap(vbuf);
^
Link: https://lkml.kernel.org/r/20221215163046.4079767-1-arnd@kernel.org
Fixes: 8ed691b02a ("kmsan: add tests for KMSAN")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When encountering any vma in the range with policy other than MPOL_BIND or
MPOL_PREFERRED_MANY, an error is returned without issuing a mpol_put on
the policy just allocated with mpol_dup().
This allows arbitrary users to leak kernel memory.
Link: https://lkml.kernel.org/r/20221215194621.202816-1-mathieu.desnoyers@efficios.com
Fixes: c6018b4b25 ("mm/mempolicy: add set_mempolicy_home_node syscall")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org> [5.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since 6.1 we have noticed random rpm install failures that were tracked to
mremap() returning -ENOMEM and to commit ca3d76b0aa ("mm: add merging
after mremap resize").
The problem occurs when mremap() expands a VMA in place, but using an
starting address that's not vma->vm_start, but somewhere in the middle.
The extension_pgoff calculation introduced by the commit is wrong in that
case, so vma_merge() fails due to pgoffs not being compatible. Fix the
calculation.
By the way it seems that the situations, where rpm now expands a vma from
the middle, were made possible also due to that commit, thanks to the
improved vma merging. Yet it should work just fine, except for the buggy
calculation.
Link: https://lkml.kernel.org/r/20221216163227.24648-1-vbabka@suse.cz
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359
Fixes: ca3d76b0aa ("mm: add merging after mremap resize")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jakub Matěna <matenajakub@gmail.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>