Граф коммитов

39673 Коммитов

Автор SHA1 Сообщение Дата
Thomas Gleixner b579d0c375 x86/fpu: Make os_xrstor_booting() private
It's only required in the xstate init code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.455836597@linutronix.de
2021-10-20 15:27:28 +02:00
Thomas Gleixner d06241f52c x86/fpu: Clean up CPU feature tests
Further disintegration of internal.h:

Move the CPU feature tests to a core header and remove the unused one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.401510559@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner 63e81807c1 x86/fpu: Move context switch and exit to user inlines into sched.h
internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.

Move the context switch and exit to user inlines into a separate header,
which is all that code needs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner 9603445549 x86/fpu: Mark fpu__init_prepare_fx_sw_frame() as __init
No need to keep it around.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.296435736@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner ca834defd3 x86/fpu: Rework copy_xstate_to_uabi_buf()
Prepare for replacing the KVM copy xstate to user function by extending
copy_xstate_to_uabi_buf() with a pkru argument which allows the caller to
hand in the pkru value, which is required for KVM because the guest PKRU is
not accessible via current. Fixup all callsites accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.191902137@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner ea4d6938d4 x86/fpu: Replace KVMs home brewed FPU copy from user
Copying a user space buffer to the memory buffer is already available in
the FPU core. The copy mechanism in KVM lacks sanity checks and needs to
use cpuid() to lookup the offset of each component, while the FPU core has
this information cached.

Make the FPU core variant accessible for KVM and replace the home brewed
mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.134065207@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner a0ff0611c2 x86/fpu: Move KVMs FPU swapping to FPU core
Swapping the host/guest FPU is directly fiddling with FPU internals which
requires 5 exports. The upcoming support of dynamically enabled states
would even need more.

Implement a swap function in the FPU core code and export that instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner 63cf05a19a x86/fpu/xstate: Mark all init only functions __init
No point to keep them around after boot.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.017919252@linutronix.de
2021-10-20 15:27:27 +02:00
Thomas Gleixner ffd3e504c9 x86/fpu/xstate: Provide and use for_each_xfeature()
These loops evaluating xfeature bits are really hard to read. Create an
iterator and use for_each_set_bit_from() inside which already does the right
thing.

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.958107505@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner 126fe04018 x86/fpu: Cleanup xstate xcomp_bv initialization
No point in having this duplicated all over the place with needlessly
different defines.

Provide a proper initialization function which initializes user buffers
properly and make KVM use it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.897664678@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner 509e7a30cd x86/fpu: Do not inherit FPU context for kernel and IO worker threads
There is no reason why kernel and IO worker threads need a full clone of
the parent's FPU state. Both are kernel threads which are not supposed to
use FPU. So copying a large state or doing XSAVE() is pointless. Just clean
out the minimally required state for those tasks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.839822981@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner 2d16a1876f x86/process: Clone FPU in copy_thread()
There is no reason to clone FPU in arch_dup_task_struct(). Quite the
contrary - it prevents optimizations. Move it to copy_thread().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.780714235@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner 01f9f62d3a x86/fpu: Remove pointless memset in fpu_clone()
Zeroing the forked task's FPU registers buffer to avoid leaking init
optimized stale data into the clone is a pointless exercise for the case
where the current task has TIF_NEED_FPU_LOAD set. In that case, the FPU
registers state is copied from current's FPU register buffer which can
contain stale init optimized data as well.

The alledged information leak is non-existant because this stale init
optimized data is used nowhere and cannot leak anywhere.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.722854569@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner dc2f39fd1b x86/fpu: Cleanup the on_boot_cpu clutter
Defensive programming is useful, but this on_boot_cpu debug is really
silly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.665080855@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner f5daf836f2 x86/fpu: Restrict xsaves()/xrstors() to independent states
These interfaces are really only valid for features which are independently
managed and not part of the task context state for various reasons.

Tighten the checks and adjust the misleading comments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.608492174@linutronix.de
2021-10-20 15:27:26 +02:00
Thomas Gleixner b50854eca0 x86/pkru: Remove useless include
PKRU code does not need anything from FPU headers. Include cpufeature.h
instead and fixup the resulting fallout in perf.

This is a preparation for FPU changes in order to prevent recursive include
hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.551522694@linutronix.de
2021-10-20 15:27:25 +02:00
Thomas Gleixner d2d926482c x86/fpu: Update stale comments
copy_fpstate_to_sigframe() does not have a slow path anymore. Neither does
the !ia32 restore in __fpu_restore_sig().

Update the comments accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.493570236@linutronix.de
2021-10-20 15:27:25 +02:00
Thomas Gleixner 9568bfb4f0 x86/fpu: Remove pointless argument from switch_fpu_finish()
Unused since the FPU switching rework.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.433135710@linutronix.de
2021-10-20 15:27:25 +02:00
Steven Rostedt (VMware) 8646698aef x86/ftrace: Remove fault protection code in prepare_ftrace_return
Removing the fault protection code when writing return_hooker
to stack. As Steven noted:

> That protection was there from the beginning due to being "paranoid",
> considering ftrace was bricking network cards. But that protection
> would not have even protected against that.

Link: https://lkml.kernel.org/r/20211008091336.33616-3-jolsa@kernel.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19 20:33:21 -04:00
Jiri Olsa 1e85010e17 x86/ftrace: Remove extra orig rax move
There's identical move 2 lines earlier.

Link: https://lkml.kernel.org/r/20211008091336.33616-2-jolsa@kernel.org

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19 20:33:21 -04:00
Tom Lendacky e7d445ab26 x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
When runtime support for converting between 4-level and 5-level pagetables
was added to the kernel, the SME code that built pagetables was updated
to use the pagetable functions, e.g. p4d_offset(), etc., in order to
simplify the code. However, the use of the pagetable functions in early
boot code requires the use of the USE_EARLY_PGTABLE_L5 #define in order to
ensure that the proper definition of pgtable_l5_enabled() is used.

Without the #define, pgtable_l5_enabled() is #defined as
cpu_feature_enabled(X86_FEATURE_LA57). In early boot, the CPU features
have not yet been discovered and populated, so pgtable_l5_enabled() will
return false even when 5-level paging is enabled. This causes the SME code
to always build 4-level pagetables to perform the in-place encryption.
If 5-level paging is enabled, switching to the SME pagetables results in
a page-fault that kills the boot.

Adding the #define results in pgtable_l5_enabled() using the
__pgtable_l5_enabled variable set in early boot and the SME code building
pagetables for the proper paging level.

Fixes: aad983913d ("x86/mm/encrypt: Simplify sme_populate_pgd() and sme_populate_pgd_large()")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> # 4.18.x
Link: https://lkml.kernel.org/r/2cb8329655f5c753905812d951e212022a480475.1634318656.git.thomas.lendacky@amd.com
2021-10-19 14:07:17 +02:00
Borislav Petkov c688bd5dc9 x86/sev: Carve out HV call's return value verification
Carve out the verification of the HV call return value into a separate
helper and make it more readable.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/YVbYWz%2B8J7iMTJjc@zn.tnic
2021-10-19 13:54:47 +02:00
Paolo Bonzini 9f1ee7b169 KVM: SEV-ES: reduce ghcb_sa_len to 32 bits
The size of the GHCB scratch area is limited to 16 KiB (GHCB_SCRATCH_AREA_LIMIT),
so there is no need for it to be a u64.  This fixes a build error on 32-bit
systems:

i686-linux-gnu-ld: arch/x86/kvm/svm/sev.o: in function `sev_es_string_io:
sev.c:(.text+0x110f): undefined reference to `__udivdi3'

Cc: stable@vger.kernel.org
Fixes: 019057bd73 ("KVM: SEV-ES: fix length of string I/O")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:19 -04:00
Hao Xiang d61863c66f KVM: VMX: Remove redundant handling of bus lock vmexit
Hardware may or may not set exit_reason.bus_lock_detected on BUS_LOCK
VM-Exits. Dealing with KVM_RUN_X86_BUS_LOCK in handle_bus_lock_vmexit
could be redundant when exit_reason.basic is EXIT_REASON_BUS_LOCK.

We can remove redundant handling of bus lock vmexit. Unconditionally Set
exit_reason.bus_lock_detected in handle_bus_lock_vmexit(), and deal with
KVM_RUN_X86_BUS_LOCK only in vmx_handle_exit().

Signed-off-by: Hao Xiang <hao.xiang@linux.alibaba.com>
Message-Id: <1634299161-30101-1-git-send-email-hao.xiang@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:19 -04:00
Sean Christopherson 9139a7a645 KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload
WARN if the static keys used to track if any vCPU has disabled its APIC
are left elevated at module exit.  Unlike the underflow case, nothing in
the static key infrastructure will complain if a key is left elevated,
and because an elevated key only affects performance, nothing in KVM will
fail if either key is improperly incremented.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211013003554.47705-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:18 -04:00
Sean Christopherson f7d8a19f9a Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET"
Revert a change to open code bits of kvm_lapic_set_base() when emulating
APIC RESET to fix an apic_hw_disabled underflow bug due to arch.apic_base
and apic_hw_disabled being unsyncrhonized when the APIC is created.  If
kvm_arch_vcpu_create() fails after creating the APIC, kvm_free_lapic()
will see the initialized-to-zero vcpu->arch.apic_base and decrement
apic_hw_disabled without KVM ever having incremented apic_hw_disabled.

Using kvm_lapic_set_base() in kvm_lapic_reset() is also desirable for a
potential future where KVM supports RESET outside of vCPU creation, in
which case all the side effects of kvm_lapic_set_base() are needed, e.g.
to handle the transition from x2APIC => xAPIC.

Alternatively, KVM could temporarily increment apic_hw_disabled (and call
kvm_lapic_set_base() at RESET), but that's a waste of cycles and would
impact the performance of other vCPUs and VMs.  The other subtle side
effect is that updating the xAPIC ID needs to be done at RESET regardless
of whether the APIC was previously enabled, i.e. kvm_lapic_reset() needs
an explicit call to kvm_apic_set_xapic_id() regardless of whether or not
kvm_lapic_set_base() also performs the update.  That makes stuffing the
enable bit at vCPU creation slightly more palatable, as doing so affects
only the apic_hw_disabled key.

Opportunistically tweak the comment to explicitly call out the connection
between vcpu->arch.apic_base and apic_hw_disabled, and add a comment to
call out the need to always do kvm_apic_set_xapic_id() at RESET.

Underflow scenario:

  kvm_vm_ioctl() {
    kvm_vm_ioctl_create_vcpu() {
      kvm_arch_vcpu_create() {
        if (something_went_wrong)
          goto fail_free_lapic;
        /* vcpu->arch.apic_base is initialized when something_went_wrong is false. */
        kvm_vcpu_reset() {
          kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) {
            vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
          }
        }
        return 0;
      fail_free_lapic:
        kvm_free_lapic() {
          /* vcpu->arch.apic_base is not yet initialized when something_went_wrong is true. */
          if (!(vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE))
            static_branch_slow_dec_deferred(&apic_hw_disabled); // <= underflow bug.
        }
        return r;
      }
    }
  }

This (mostly) reverts commit 421221234a.

Fixes: 421221234a ("KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET")
Reported-by: syzbot+9fc046ab2b0cf295a063@syzkaller.appspotmail.com
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211013003554.47705-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:18 -04:00
Paolo Bonzini fa13843d15 KVM: X86: fix lazy allocation of rmaps
If allocation of rmaps fails, but some of the pointers have already been written,
those pointers can be cleaned up when the memslot is freed, or even reused later
for another attempt at allocating the rmaps.  Therefore there is no need to
WARN, as done for example in memslot_rmap_alloc, but the allocation *must* be
skipped lest KVM will overwrite the previous pointer and will indeed leak memory.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:17 -04:00
Peter Gonda baa1e5ca17 KVM: SEV-ES: Set guest_state_protected after VMSA update
The refactoring in commit bb18a67774 ("KVM: SEV: Acquire
vcpu mutex when updating VMSA") left behind the assignment to
svm->vcpu.arch.guest_state_protected; add it back.

Signed-off-by: Peter Gonda <pgonda@google.com>
[Delta between v2 and v3 of Peter's patch, which had already been
 committed; the commit message is my own. - Paolo]
Fixes: bb18a67774 ("KVM: SEV: Acquire vcpu mutex when updating VMSA")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:07:17 -04:00
Linus Torvalds 60ebc28b07 - Add Sapphire Rapids to the list of CPUs supporting the SMI count MSR
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFr9jUACgkQEsHwGGHe
 VUp23g//XVP10DoYN3kX+wNre2fUDVjwRQ15s0fpLkCyCCYqlSyfj0ijozmvefY1
 jvTkfsQUVPmFDsdh7n0GOrZJ02WKxZR9it5NVANc6uDI7uUf1oKRq/+JVF3eoU/s
 r5wZcgj53qr697gNi5Edg80P2QUeh/vBBFSI67b7q/+oymQuWjau4GHyLLWM8ekO
 7D8D0AKgYCvUluGfbC7Z78Zu9E1Kf5a0WyZofAmXu9sxwVkmNs7rrK/GWNnxp3NR
 H9LjJaaSMeOF43DtXC5O0Ag4TJpQiRIZNiysaljrqEPbqLwX/YZhjOeKzLZQHjvU
 8vq9p8IrXC4aQ4ZLivsaQBmmcihEJP2Es0YDj73SUlhC03Dxuq8opiFHehPm3mXA
 BZ5FqldwWHrqX1I1i45tvA56qi4XhGKcB8lquGTtZ7WGqMB1xqKqg3oW2T/ocD1r
 5TaNZ7O0c/1Uo1+OWX6aHETn2pUrk+e1Fn8rT/sUVjfdV+4+LJiNkpleS2n5nfdQ
 a7cX3XO9wC7LunA3okyhnM4e9ON7eXfdFCWFV8yZqiybskp9kvgHjLpYWxWn6L0B
 uDwOj0lgDQOOoh2focPryBxYFN9NYOibUXCGICFJlup9dowpCsMUOoWm8CYP8x2p
 j3YfMzX6WUEKwUOhwRj6Ugfn+fwmWR4ckeWT8ff1lHFxhA6gPMM=
 =yLQd
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Add Sapphire Rapids to the list of CPUs supporting the SMI count MSR

* tag 'perf_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/msr: Add Sapphire Rapids CPU support
2021-10-17 17:34:18 -10:00
Ingo Molnar 082f20b21d Merge branch 'x86/urgent' into x86/fpu, to resolve a conflict
Resolve the conflict between these commits:

   x86/fpu:      1193f408cd ("x86/fpu/signal: Change return type of __fpu_restore_sig() to boolean")

   x86/urgent:   d298b03506 ("x86/fpu: Restore the masking out of reserved MXCSR bits")
                 b2381acd3f ("x86/fpu: Mask out the invalid MXCSR bits properly")

 Conflicts:
        arch/x86/kernel/fpu/signal.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-10-16 15:17:46 +02:00
Borislav Petkov b2381acd3f x86/fpu: Mask out the invalid MXCSR bits properly
This is a fix for the fix (yeah, /facepalm).

The correct mask to use is not the negation of the MXCSR_MASK but the
actual mask which contains the supported bits in the MXCSR register.

Reported and debugged by Ville Syrjälä <ville.syrjala@linux.intel.com>

Fixes: d298b03506 ("x86/fpu: Restore the masking out of reserved MXCSR bits")
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ser Olmy <ser.olmy@protonmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/YWgYIYXLriayyezv@intel.com
2021-10-16 12:37:50 +02:00
Adrian Hunter 8b8ff8cc3b perf/x86: Add new event for AUX output counter index
PEBS-via-PT records contain a mask of applicable counters. To identify
which event belongs to which counter, a side-band event is needed. Until
now, there has been no side-band event, and consequently users were limited
to using a single event.

Add such a side-band event. Note the event is optimised to output only
when the counter index changes for an event. That works only so long as
all PEBS-via-PT events are scheduled together, which they are for a
recording session because they are in a single group.

Also no attribute bit is used to select the new event, so a new
kernel is not compatible with older perf tools.  The assumption
being that PEBS-via-PT is sufficiently esoteric that users will not
be troubled by this.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210907163903.11820-2-adrian.hunter@intel.com
2021-10-15 11:25:31 +02:00
Kan Liang 71920ea97d perf/x86/msr: Add Sapphire Rapids CPU support
SMI_COUNT MSR is supported on Sapphire Rapids CPU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1633551137-192083-1-git-send-email-kan.liang@linux.intel.com
2021-10-15 11:25:26 +02:00
Tim Chen 66558b730f sched: Add cluster scheduler level for x86
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.

To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a boost
to performance especially on medium load system on Jacobsville.  on a
Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
2021-10-15 11:25:16 +02:00
Kees Cook 42a20f86dc sched: Add wrapper for get_wchan() to keep task blocked
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
2021-10-15 11:25:14 +02:00
Qi Zheng bc9bbb8173 x86: Fix get_wchan() to support the ORC unwinder
Currently, the kernel CONFIG_UNWINDER_ORC option is enabled by default
on x86, but the implementation of get_wchan() is still based on the frame
pointer unwinder, so the /proc/<pid>/wchan usually returned 0 regardless
of whether the task <pid> is running.

Reimplement get_wchan() by calling stack_trace_save_tsk(), which is
adapted to the ORC and frame pointer unwinders.

Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211008111626.271115116@infradead.org
2021-10-15 11:25:14 +02:00
Paolo Bonzini 019057bd73 KVM: SEV-ES: fix length of string I/O
The size of the data in the scratch buffer is not divided by the size of
each port I/O operation, so vcpu->arch.pio.count ends up being larger
than it should be by a factor of size.

Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-15 04:47:36 -04:00
Jakub Kicinski e15f5972b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/ioam6.sh
  7b1700e009 ("selftests: net: modify IOAM tests for undef bits")
  bf77b1400a ("selftests: net: Test for the IOAM encapsulation with IPv6")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14 16:50:14 -07:00
Borislav Petkov 711885906b x86/Kconfig: Do not enable AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT automatically
This Kconfig option was added initially so that memory encryption is
enabled by default on machines which support it.

However, devices which have DMA masks that are less than the bit
position of the encryption bit, aka C-bit, require the use of an IOMMU
or the use of SWIOTLB.

If the IOMMU is disabled or in passthrough mode, the kernel would switch
to SWIOTLB bounce-buffering for those transfers.

In order to avoid that,

  2cc13bb4f5 ("iommu: Disable passthrough mode when SME is active")

disables the default IOMMU passthrough mode so that devices for which the
default 256K DMA is insufficient, can use the IOMMU instead.

However 2, there are cases where the IOMMU is disabled in the BIOS, etc.
(think the usual hardware folk "oops, I dropped the ball there" cases) or a
driver doesn't properly use the DMA APIs or a device has a firmware or
hardware bug, e.g.:

  ea68573d40 ("drm/amdgpu: Fail to load on RAVEN if SME is active")

However 3, in the above GPU use case, there are APIs like Vulkan and
some OpenGL/OpenCL extensions which are under the assumption that
user-allocated memory can be passed in to the kernel driver and both the
GPU and CPU can do coherent and concurrent access to the same memory.
That cannot work with SWIOTLB bounce buffers, of course.

So, in order for those devices to function, drop the "default y" for the
SME by default active option so that users who want to have SME enabled,
will need to either enable it in their config or use "mem_encrypt=on" on
the kernel command line.

 [ tlendacky: Generalize commit message. ]

Fixes: 7744ccdbc1 ("x86/mm: Add Secure Memory Encryption (SME) support")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/8bbacd0e-4580-3194-19d2-a0ecad7df09c@molgen.mpg.de
2021-10-11 19:14:22 +02:00
Linus Torvalds c22ccc4a3e - A FPU fix to properly handle invalid MXCSR values: 32-bit masks them
out due to histerical reasons and 64-bit kernels reject them
 
 - A fix to clear X86_FEATURE_SMAP when support for is not config-enabled
 
 - Three fixes correcting misspelled Kconfig symbols used in code
 
 - Two resctrl object cleanup fixes
 
 - Yet another attempt at fixing the neverending saga of botched x86
 timers, this time because some incredibly smart hardware decides to turn
 off the HPET timer in a low power state - who cares if the OS is relying
 on it...
 
 - Check the full return value range of an SEV VMGEXIT call to determine
 whether it returned an error
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFiqboACgkQEsHwGGHe
 VUpIFA//cLWfa1vvamCcLjW0lruQVzHrZesO4Cbti3Fyp2at/Dtwt9w/uZPu9NAa
 +sreJBdrkZfo9lmKW6/E1MmLT/YlLg8YHsylKn9d+XSdcy0qWXLYdVVm7bb4teJf
 XxRQfYNQrwfpjFNnt+7NUcaqte2zUo7K16CctJF5+E6SGUn+hlu6zK15tf6MMAM1
 TFHsQWEuRW5Mgc7eD734cNGDOJgzvb4IACn5BNfKR1+jD1ANfutytXjGqcveJ/sg
 lBoWMCU47vo5/uoW516oBK6PfQ/+s1OvYAx2G4DMQSC7WpEWpxnJUoszj9umu+jE
 VndS8jQ4WGXcVmfkkwUHbVxcJzsPEzZ/5m+nER9hrGOykKWTajzi2MirBHju5EKv
 xfYLqEJHNG9YulxKy2wIW0VRmXDE3wFZfaPAmQbLXud1KfzlC/EpEaloZSJSgqyG
 L4uOKk8CBumYJzgVCfTFAqqr1HhmeylYSxHmOUEzTm0sEJX2HuodGcl+sPI/LDPW
 MkjVYXq2sOUEVLmk5xyJIkbAUcK2X/Fzt3rKS4CVsjfzWRW67o1oopMy6ZrQ0o/h
 Dt/fHub/+Pke5sbB2+RiRsvq3aDftRkvaZK05pTiqlE9gFlKaCVwxDQqvmTnY0oa
 PkPzauXRC4qjNsdDMGHaiclm/fk/nlLM9vxXGJ+oTXP6snC4OhQ=
 =kKOw
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.15_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - A FPU fix to properly handle invalid MXCSR values: 32-bit masks them
   out due to historical reasons and 64-bit kernels reject them

 - A fix to clear X86_FEATURE_SMAP when support for is not
   config-enabled

 - Three fixes correcting misspelled Kconfig symbols used in code

 - Two resctrl object cleanup fixes

 - Yet another attempt at fixing the neverending saga of botched x86
   timers, this time because some incredibly smart hardware decides to
   turn off the HPET timer in a low power state - who cares if the OS is
   relying on it...

 - Check the full return value range of an SEV VMGEXIT call to determine
   whether it returned an error

* tag 'x86_urgent_for_v5.15_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Restore the masking out of reserved MXCSR bits
  x86/Kconfig: Correct reference to MWINCHIP3D
  x86/platform/olpc: Correct ifdef symbol to intended CONFIG_OLPC_XO15_SCI
  x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=n
  x86/entry: Correct reference to intended CONFIG_64_BIT
  x86/resctrl: Fix kfree() of the wrong type in domain_add_cpu()
  x86/resctrl: Free the ctrlval arrays when domain_setup_mon_state() fails
  x86/hpet: Use another crystalball to evaluate HPET usability
  x86/sev: Return an error on a returned non-zero SW_EXITINFO1[31:0]
2021-10-10 10:00:51 -07:00
Weizhao Ouyang 6644c654ea ftrace: Cleanup ftrace_dyn_arch_init()
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
ftrace_dyn_arch_init() to cleanup them.

Link: https://lkml.kernel.org/r/20210909090216.1955240-1-o451686892@gmail.com

Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Acked-by: Helge Deller <deller@gmx.de> (parisc)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-08 19:41:39 -04:00
Linus Torvalds 3946b46cab xen: branch for v5.15-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYWBSIwAKCRCAXGG7T9hj
 vrXxAP9na1EqRJ+SpWyvxHY1jMaIrbg1bgnOc+GsnWxU5liW5AEA4h1HjHtVtrzL
 3vweIS6u2fanrWlYML/daQ3r6EuLPQc=
 =iXsP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.15b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - fix two minor issues in the Xen privcmd driver plus a cleanup patch
   for that driver

 - fix multiple issues related to running as PVH guest and some related
   earlyprintk fixes for other Xen guest types

 - fix an issue introduced in 5.15 the Xen balloon driver

* tag 'for-linus-5.15b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/balloon: fix cancelled balloon action
  xen/x86: adjust data placement
  x86/PVH: adjust function/data placement
  xen/x86: hook up xen_banner() also for PVH
  xen/x86: generalize preferred console model from PV to PVH Dom0
  xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU
  xen/x86: allow "earlyprintk=xen" to work for PV Dom0
  xen/x86: make "earlyprintk=xen" work better for PVH Dom0
  xen/x86: allow PVH Dom0 without XEN_PV=y
  xen/x86: prevent PVH type from getting clobbered
  xen/privcmd: drop "pages" parameter from xen_remap_pfn()
  xen/privcmd: fix error handling in mmap-resource processing
  xen/privcmd: replace kcalloc() by kvcalloc() when allocating empty pages
2021-10-08 12:55:23 -07:00
Linus Torvalds 0dcf60d001 asm-generic: build fixes for v5.15
There is one build fix for Arm platforms that ended up impacting most
 architectures because of the way the drivers/firmware Kconfig file is
 wired up:
 
 The CONFIG_QCOM_SCM dependency have caused a number of randconfig
 regressions over time, and some still remain in v5.15-rc4. The
 fix we agreed on in the end is to make this symbol selected by any
 driver using it, and then building it even for non-Arm platforms with
 CONFIG_COMPILE_TEST.
 
 To make this work on all architectures, the drivers/firmware/Kconfig
 file needs to be included for all architectures to make the symbol
 itself visible.
 
 In a separate discussion, we found that a sound driver patch that is
 pending for v5.16 needs the same change to include this Kconfig file,
 so the easiest solution seems to have my Kconfig rework included in v5.15.
 
 There is a small merge conflict against an earlier partial fix for the
 QCOM_SCM dependency problems.
 
 Finally, the branch also includes a small unrelated build fix for NOMMU
 architectures.
 
 Link: https://lore.kernel.org/all/20210928153508.101208f8@canb.auug.org.au/
 Link: https://lore.kernel.org/all/20210928075216.4193128-1-arnd@kernel.org/
 Link: https://lore.kernel.org/all/20211007151010.333516-1-arnd@kernel.org/
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmFgVp8ACgkQmmx57+YA
 GNlQoA/+O0ljtTy5D0MjRGmFDs11M5AtKNrfys82lm2GeEnc4lnxn722jLk8kR6s
 y6DSOWFs7w1bqhKExQNehZYtJO3sgW/9qiLMV9qfOx1Nc6WwhDPcYM9bMyGlpTmL
 M456nh8NopixV7slanNtfz1e0kbMKoK+4Ub7M5OHepK6x9FKQXQYQpeoBxaXHmWZ
 9eaRiL/CsRHO/cSkvpq1GtL7IVrudvij3FDHzxoDGFFjkCUm9LiN/8yrnVxHA9G7
 3EPyJazI559SsnxXJR32udGPJWZV1HZ7D5gbxDvzr5rZ9EX0JpyPGJsuXUR1wqlS
 UB2Y7AUTSxkwDiZ8UhPoXn6i67WAirzEsP2WmdS4v6NEbxlNloLGTIeGwcwkCRMU
 DBvMtDW8kKusgVu/OkEUgoC6MTRt+Mg+gZcQI/C4sp0MqZGaMY6c7abnYjqwEzBV
 ARS7bUYyME2GL6wNDPFB8esuD9jjdFXy96bGHATmzMxT3012K3X7ufFOzJZ+GOF9
 pan00fgoC17oiI+Xu/sZEHns6KvMTSE11Aw3uk+yhHxYtZbzWi2B5Nk+4tBdsOxF
 PAZdZ5qsyuEcBw+PyfbyZIHWOrlbvZkrmjiIsMJo63cIXuOtgraCjvRRAwe/ZwoU
 PXgPcUmrlAs06WjKhuQAZWt6bww7cEP2XyOYlDqwZ4Vj0dqav6g=
 =187C
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-fixes-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic fixes from Arnd Bergmann:
 "There is one build fix for Arm platforms that ended up impacting most
  architectures because of the way the drivers/firmware Kconfig file is
  wired up:

  The CONFIG_QCOM_SCM dependency have caused a number of randconfig
  regressions over time, and some still remain in v5.15-rc4. The fix we
  agreed on in the end is to make this symbol selected by any driver
  using it, and then building it even for non-Arm platforms with
  CONFIG_COMPILE_TEST.

  To make this work on all architectures, the drivers/firmware/Kconfig
  file needs to be included for all architectures to make the symbol
  itself visible.

  In a separate discussion, we found that a sound driver patch that is
  pending for v5.16 needs the same change to include this Kconfig file,
  so the easiest solution seems to have my Kconfig rework included in
  v5.15.

  Finally, the branch also includes a small unrelated build fix for
  NOMMU architectures"

Link: https://lore.kernel.org/all/20210928153508.101208f8@canb.auug.org.au/
Link: https://lore.kernel.org/all/20210928075216.4193128-1-arnd@kernel.org/
Link: https://lore.kernel.org/all/20211007151010.333516-1-arnd@kernel.org/

* tag 'asm-generic-fixes-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  asm-generic/io.h: give stub iounmap() on !MMU same prototype as elsewhere
  qcom_scm: hide Kconfig symbol
  firmware: include drivers/firmware/Kconfig unconditionally
2021-10-08 11:57:54 -07:00
Borislav Petkov d298b03506 x86/fpu: Restore the masking out of reserved MXCSR bits
Ser Olmy reported a boot failure:

  init[1] bad frame in sigreturn frame:(ptrval) ip:b7c9fbe6 sp:bf933310 orax:ffffffff \
	  in libc-2.33.so[b7bed000+156000]
  Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
  CPU: 0 PID: 1 Comm: init Tainted: G        W         5.14.9 #1
  Hardware name: Hewlett-Packard HP PC/HP Board, BIOS  JD.00.06 12/06/2001
  Call Trace:
   dump_stack_lvl
   dump_stack
   panic
   do_exit.cold
   do_group_exit
   get_signal
   arch_do_signal_or_restart
   ? force_sig_info_to_task
   ? force_sig
   exit_to_user_mode_prepare
   syscall_exit_to_user_mode
   do_int80_syscall_32
   entry_INT80_32

on an old 32-bit Intel CPU:

  vendor_id       : GenuineIntel
  cpu family      : 6
  model           : 6
  model name      : Celeron (Mendocino)
  stepping        : 5
  microcode       : 0x3

Ser bisected the problem to the commit in Fixes.

tglx suggested reverting the rejection of invalid MXCSR values which
this commit introduced and replacing it with what the old code did -
simply masking them out to zero.

Further debugging confirmed his suggestion:

  fpu->state.fxsave.mxcsr: 0xb7be13b4, mxcsr_feature_mask: 0xffbf
  WARNING: CPU: 0 PID: 1 at arch/x86/kernel/fpu/signal.c:384 __fpu_restore_sig+0x51f/0x540

so restore the original behavior only for 32-bit kernels where you have
ancient machines with buggy hardware. For 32-bit programs on 64-bit
kernels, user space which supplies wrong MXCSR values is considered
malicious so fail the sigframe restoration there.

Fixes: 6f9866a166 ("x86/fpu/signal: Let xrstor handle the features to init")
Reported-by: Ser Olmy <ser.olmy@protonmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Ser Olmy <ser.olmy@protonmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YVtA67jImg3KlBTw@zn.tnic
2021-10-08 11:12:17 +02:00
Jakub Kicinski 9fe1155233 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07 15:24:06 -07:00
Jie Meng 6364d7d75a bpf, x64: Factor out emission of REX byte in more cases
Introduce a single reg version of maybe_emit_mod() and factor out
common code in more cases.

Signed-off-by: Jie Meng <jmeng@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211006194135.608932-1-jmeng@fb.com
2021-10-07 21:45:06 +02:00
Linus Torvalds 52bf8031c0 hyperv-fixes for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmFeykwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXhRLCADXOOSGKk4L1vWssRRhLmMXI45ElocY
 EbZ/mXcQhxKnlVhdMNnupGjz+lU5FQGkCCWlhmt9Ml2O6R+lDx+zIUS8BK3Nkom9
 twWjueMtum6yFwDMGYALhptVLjDqVFG71QcW0incghpnAx4s2FVE8h38md5MuUFY
 Kqqf/dRkppSePldHFrRG/e4c6r0WyTsJ6Z9LTU0UYp5GqJcmUJlx7TxxqzGk5Fti
 GpQ5cFS7JX8xHAkRROk/dvwJte1RRnBAW6lIWxwAaDJ6Gbg7mNfOQe7n+/KRO7ZG
 gC5hbkP9tMv2nthLxaFbpu791U4lMZ2WiTLZvbgCseO3FCmToXWZ6TDd
 =1mdq
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Replace uuid.h with types.h in a header (Andy Shevchenko)

 - Avoid sleeping in atomic context in PCI driver (Long Li)

 - Avoid sending IPI to self when it shouldn't (Vitaly Kuznetsov)

* tag 'hyperv-fixes-signed-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Avoid erroneously sending IPI to 'self'
  hyper-v: Replace uuid.h with types.h
  PCI: hv: Fix sleep while in non-sleep context when removing child devices from the bus
2021-10-07 09:44:48 -07:00
Arnd Bergmann 951cd3a086
firmware: include drivers/firmware/Kconfig unconditionally
Compile-testing drivers that require access to a firmware layer
fails when that firmware symbol is unavailable. This happened
twice this week:

 - My proposed to change to rework the QCOM_SCM firmware symbol
   broke on ppc64 and others.

 - The cs_dsp firmware patch added device specific firmware loader
   into drivers/firmware, which broke on the same set of
   architectures.

We should probably do the same thing for other subsystems as well,
but fix this one first as this is a dependency for other patches
getting merged.

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Charles Keepax <ckeepax@opensource.cirrus.com>
Cc: Simon Trimmer <simont@opensource.cirrus.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-10-07 16:51:26 +02:00
André Almeida 039c0ec9bb futex,x86: Wire up sys_futex_waitv()
Wire up syscall entry point for x86 arch, for both i386 and x86_64.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210923171111.300673-18-andrealmeid@collabora.com
2021-10-07 13:51:11 +02:00
Peter Zijlstra b08cadbd3b Merge branch 'objtool/urgent'
Fixup conflicts.

# Conflicts:
#	tools/objtool/check.c
2021-10-07 00:40:17 +02:00
Borislav Petkov 541ac97186 x86/sev: Make the #VC exception stacks part of the default stacks storage
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.

Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.

Rip out the custom VC stacks mapping and storage code.

 [ bp: Steal and adapt Tom's commit message. ]

Fixes: 7fae4c24a2 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
2021-10-06 21:48:27 +02:00
Lukas Bulwahn 225bac2dc5 x86/Kconfig: Correct reference to MWINCHIP3D
Commit in Fixes intended to exclude the Winchip series and referred to
CONFIG_WINCHIP3D, but the config symbol is called CONFIG_MWINCHIP3D.

Hence, scripts/checkkconfigsymbols.py warns:

WINCHIP3D
Referencing files: arch/x86/Kconfig

Correct the reference to the intended config symbol.

Fixes: 69b8d3fcab ("x86/Kconfig: Exclude i586-class CPUs lacking PAE support from the HIGHMEM64G Kconfig group")
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210803113531.30720-4-lukas.bulwahn@gmail.com
2021-10-06 18:46:06 +02:00
Lukas Bulwahn 4758fd801f x86/platform/olpc: Correct ifdef symbol to intended CONFIG_OLPC_XO15_SCI
The refactoring in the commit in Fixes introduced an ifdef
CONFIG_OLPC_XO1_5_SCI, however the config symbol is actually called
"CONFIG_OLPC_XO15_SCI".

Fortunately, ./scripts/checkkconfigsymbols.py warns:

OLPC_XO1_5_SCI
Referencing files: arch/x86/platform/olpc/olpc.c

Correct this ifdef condition to the intended config symbol.

Fixes: ec9964b480 ("Platform: OLPC: Move EC-specific functionality out from x86")
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210803113531.30720-3-lukas.bulwahn@gmail.com
2021-10-06 18:46:06 +02:00
Vegard Nossum 3958b9c34c x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=n
Commit

  3c73b81a91 ("x86/entry, selftests: Further improve user entry sanity checks")

added a warning if AC is set when in the kernel.

Commit

  662a022189 ("x86/entry: Fix AC assertion")

changed the warning to only fire if the CPU supports SMAP.

However, the warning can still trigger on a machine that supports SMAP
but where it's disabled in the kernel config and when running the
syscall_nt selftest, for example:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 49 at irqentry_enter_from_user_mode
  CPU: 0 PID: 49 Comm: init Tainted: G                T 5.15.0-rc4+ #98 e6202628ee053b4f310759978284bd8bb0ce6905
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  RIP: 0010:irqentry_enter_from_user_mode
  ...
  Call Trace:
   ? irqentry_enter
   ? exc_general_protection
   ? asm_exc_general_protection
   ? asm_exc_general_protectio

IS_ENABLED(CONFIG_X86_SMAP) could be added to the warning condition, but
even this would not be enough in case SMAP is disabled at boot time with
the "nosmap" parameter.

To be consistent with "nosmap" behaviour, clear X86_FEATURE_SMAP when
!CONFIG_X86_SMAP.

Found using entry-fuzz + satrandconfig.

 [ bp: Massage commit message. ]

Fixes: 3c73b81a91 ("x86/entry, selftests: Further improve user entry sanity checks")
Fixes: 662a022189 ("x86/entry: Fix AC assertion")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211003223423.8666-1-vegard.nossum@oracle.com
2021-10-06 18:46:06 +02:00
Lukas Bulwahn 2c861f2b85 x86/entry: Correct reference to intended CONFIG_64_BIT
Commit in Fixes adds a condition with IS_ENABLED(CONFIG_64_BIT),
but the intended config item is called CONFIG_64BIT, as defined in
arch/x86/Kconfig.

Fortunately, scripts/checkkconfigsymbols.py warns:

64_BIT
Referencing files: arch/x86/include/asm/entry-common.h

Correct the reference to the intended config symbol.

Fixes: 662a022189 ("x86/entry: Fix AC assertion")
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210803113531.30720-2-lukas.bulwahn@gmail.com
2021-10-06 18:46:02 +02:00
James Morse d4ebfca26d x86/resctrl: Fix kfree() of the wrong type in domain_add_cpu()
Commit in Fixes separated the architecture specific and filesystem parts
of the resctrl domain structures.

This left the error paths in domain_add_cpu() kfree()ing the memory with
the wrong type.

This will cause a problem if someone adds a new member to struct
rdt_hw_domain meaning d_resctrl is no longer the first member.

Fixes: 792e0f6f78 ("x86/resctrl: Split struct rdt_domain")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20210917165924.28254-1-james.morse@arm.com
2021-10-06 18:45:27 +02:00
James Morse 64e87d4bd3 x86/resctrl: Free the ctrlval arrays when domain_setup_mon_state() fails
domain_add_cpu() is called whenever a CPU is brought online. The
earlier call to domain_setup_ctrlval() allocates the control value
arrays.

If domain_setup_mon_state() fails, the control value arrays are not
freed.

Add the missing kfree() calls.

Fixes: 1bd2a63b4f ("x86/intel_rdt/mba_sc: Add initialization support")
Fixes: edf6fa1c4a ("x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210917165958.28313-1-james.morse@arm.com
2021-10-06 18:45:21 +02:00
Vitaly Kuznetsov f5c20e4a5f x86/hyperv: Avoid erroneously sending IPI to 'self'
__send_ipi_mask_ex() uses an optimization: when the target CPU mask is
equal to 'cpu_present_mask' it uses 'HV_GENERIC_SET_ALL' format to avoid
converting the specified cpumask to VP_SET. This case was overlooked when
'exclude_self' parameter was added. As the result, a spurious IPI to
'self' can be send.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: dfb5c1e12c ("x86/hyperv: remove on-stack cpumask from hv_send_ipi_mask_allbutself")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211006125016.941616-1-vkuznets@redhat.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-10-06 15:56:45 +00:00
Jie Meng 57a610f1c5 bpf, x64: Save bytes for DIV by reducing reg copies
Instead of unconditionally performing push/pop on %rax/%rdx in case of
division/modulo, we can save a few bytes in case of destination register
being either BPF r0 (%rax) or r3 (%rdx) since the result is written in
there anyway.

Also, we do not need to copy the source to %r11 unless the source is either
%rax, %rdx or an immediate.

For example, before the patch:

  22:   push   %rax
  23:   push   %rdx
  24:   mov    %rsi,%r11
  27:   xor    %edx,%edx
  29:   div    %r11
  2c:   mov    %rax,%r11
  2f:   pop    %rdx
  30:   pop    %rax
  31:   mov    %r11,%rax

After:

  22:   push   %rdx
  23:   xor    %edx,%edx
  25:   div    %rsi
  28:   pop    %rdx

Signed-off-by: Jie Meng <jmeng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211002035626.2041910-1-jmeng@fb.com
2021-10-06 15:24:36 +02:00
Borislav Petkov f96b467583 x86/insn: Use get_unaligned() instead of memcpy()
Use get_unaligned() instead of memcpy() to access potentially unaligned
memory, which, when accessed through a pointer, leads to undefined
behavior. get_unaligned() describes much better what is happening there
anyway even if memcpy() does the job.

In addition, since perf tool builds with -Werror, it would fire with:

  util/intel-pt-decoder/../../../arch/x86/lib/insn.c: In function '__insn_get_emulate_prefix':
  tools/include/../include/asm-generic/unaligned.h:10:15: error: packed attribute is unnecessary [-Werror=packed]
     10 |  const struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr); \

because -Werror=packed would complain if the packed attribute would have
no effect on the layout of the structure.

In this case, that is intentional so disable the warning only for that
compilation unit.

That part is Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>

No functional changes.

Fixes: 5ba1071f75 ("x86/insn, tools/x86: Fix undefined behavior due to potential unaligned accesses")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/YVSsIkj9Z29TyUjE@zn.tnic
2021-10-06 11:56:37 +02:00
Lukas Bulwahn 6bf8a55d83 x86: Fix misspelled Kconfig symbols
Fix misspelled Kconfig symbols as detected by
scripts/checkkconfigsymbols.py.

 [ bp: Combine into a single patch. ]

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210803113531.30720-7-lukas.bulwahn@gmail.com
2021-10-05 21:48:30 +02:00
Lukas Bulwahn 3fd3590b53 x86/Kconfig: Remove references to obsolete Kconfig symbols
Remove two symbols referenced in Kconfig which have been removed
previously by:

  ef3c67b645 ("mfd: intel_msic: Remove driver for deprecated platform")
  1b79fc4f2b ("x86/apb_timer: Remove driver for deprecated platform")

Detected by scripts/checkkconfigsymbols.py.

  [ bp: Merge into a single patch. ]

Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210803113531.30720-5-lukas.bulwahn@gmail.com
2021-10-05 21:36:58 +02:00
Ricardo Neri 183b8ec38f x86/sched: Decrease further the priorities of SMT siblings
When scheduling, it is better to prefer a separate physical core rather
than the SMT sibling of a high priority core. The existing formula to
compute priorities takes such fact in consideration. There may exist,
however, combinations of priorities (i.e., maximum frequencies) in which
the priority of high-numbered SMT siblings of high-priority cores collides
with the priority of low-numbered SMT siblings of low-priority cores.

Consider for instance an SMT2 system with CPUs [0, 1] with priority 60 and
[2, 3] with priority 30(CPUs in brackets are SMT siblings. In such a case,
the resulting priorities would be [120, 60], [60, 30]. Thus, to ensure
that CPU2 has higher priority than CPU1, divide the raw priority by the
squared SMT iterator. The resulting priorities are [120, 30]. [60, 15].

Originally-by: Len Brown <len.brown@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-2-ricardo.neri-calderon@linux.intel.com
2021-10-05 15:51:59 +02:00
Jan Beulich 9c11112c0e xen/x86: adjust data placement
Both xen_pvh and xen_start_flags get written just once early during
init. Using the respective annotation then allows the open-coded placing
in .data to go away.

Additionally the former, like the latter, wants exporting, or else
xen_pvh_domain() can't be used from modules.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/8155ed26-5a1d-c06f-42d8-596d26e75849@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:36:19 +02:00
Jan Beulich 59f7e53741 x86/PVH: adjust function/data placement
Two of the variables can live in .init.data, allowing the open-coded
placing in .data to go away. Another "variable" is used to communicate a
size value only to very early assembly code, which hence can be both
const and live in .init.*. Additionally two functions were lacking
__init annotations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

Link: https://lore.kernel.org/r/3b0bb22e-43f4-e459-c5cb-169f996b5669@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:36:17 +02:00
Jan Beulich 079c4baa2a xen/x86: hook up xen_banner() also for PVH
This was effectively lost while dropping PVHv1 code. Move the function
and arrange for it to be called the same way as done in PV mode. Clearly
this then needs re-introducing the XENFEAT_mmu_pt_update_preserve_ad
check that was recently removed, as that's a PV-only feature.

Since the string pointed at by pv_info.name describes the mode, drop
"paravirtualized" from the log message while moving the code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/de03054d-a20d-2114-bb86-eec28e17b3b8@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:36:14 +02:00
Jan Beulich 4d1ab432ac xen/x86: generalize preferred console model from PV to PVH Dom0
Without announcing hvc0 as preferred it won't get used as long as tty0
gets registered earlier. This is particularly problematic with there not
being any screen output for PVH Dom0 when the screen is in graphics
mode, as the necessary information doesn't get conveyed yet from the
hypervisor.

Follow PV's model, but be conservative and do this for Dom0 only for
now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/582328b6-c86c-37f3-d802-5539b7a86736@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:36:12 +02:00
Jan Beulich 8e24d9bfc4 xen/x86: allow "earlyprintk=xen" to work for PV Dom0
With preferred consoles "tty" and "hvc" announced as preferred,
registering "xenboot" early won't result in use of the console: It also
needs to be registered as preferred. Generalize this from being DomU-
only so far.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/d4a34540-a476-df2c-bca6-732d0d58c5f0@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:36:02 +02:00
Jan Beulich cae7d81a37 xen/x86: allow PVH Dom0 without XEN_PV=y
Decouple XEN_DOM0 from XEN_PV, converting some existing uses of XEN_DOM0
to a new XEN_PV_DOM0. (I'm not convinced all are really / should really
be PV-specific, but for starters I've tried to be conservative.)

For PVH Dom0 the hypervisor populates MADT with only x2APIC entries, so
without x2APIC support enabled in the kernel things aren't going to work
very well. (As opposed, DomU-s would only ever see LAPIC entries in MADT
as of now.) Note that this then requires PVH Dom0 to be 64-bit, as
X86_X2APIC depends on X86_64.

In the course of this xen_running_on_version_or_later() needs to be
available more broadly. Move it from a PV-specific to a generic file,
considering that what it does isn't really PV-specific at all anyway.

Note that xen/interface/version.h cannot be included on its own; in
enlighten.c, which uses SCHEDOP_* anyway, include xen/interface/sched.h
first to resolve the apparently sole missing type (xen_ulong_t).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

Link: https://lore.kernel.org/r/983bb72f-53df-b6af-14bd-5e088bd06a08@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:35:56 +02:00
Jan Beulich 9172b5c4a7 xen/x86: prevent PVH type from getting clobbered
Like xen_start_flags, xen_domain_type gets set before .bss gets cleared.
Hence this variable also needs to be prevented from getting put in .bss,
which is possible because XEN_NATIVE is an enumerator evaluating to
zero. Any use prior to init_hvm_pv_info() setting the variable again
would lead to wrong decisions; one such case is xenboot_console_setup()
when called as a result of "earlyprintk=xen".

Use __ro_after_init as more applicable than either __section(".data") or
__read_mostly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

Link: https://lore.kernel.org/r/d301677b-6f22-5ae6-bd36-458e1f323d0b@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:35:48 +02:00
Jan Beulich 97315723c4 xen/privcmd: drop "pages" parameter from xen_remap_pfn()
The function doesn't use it and all of its callers say in a comment that
their respective arguments are to be non-NULL only in auto-translated
mode. Since xen_remap_domain_mfn_array() isn't supposed to be used by
non-PV, drop the parameter there as well. It was bogusly passed as non-
NULL (PRIV_VMA_LOCKED) by its only caller anyway. For
xen_remap_domain_gfn_range(), otoh, it's not clear at all why this
wouldn't want / might not need to gain auto-translated support down the
road, so the parameter is retained there despite now remaining unused
(and the only caller passing NULL); correct a respective comment as
well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Link: https://lore.kernel.org/r/036ad8a2-46f9-ac3d-6219-bdc93ab9e10b@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-10-05 08:20:27 +02:00
Andrea Arcangeli 2f46993d83 x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Switch the kernel default of SSBD and STIBP to the ones with
CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl
spectre_v2_user=prctl) even if CONFIG_SECCOMP=y.

Several motivations listed below:

- If SMT is enabled the seccomp jail can still attack the rest of the
  system even with spectre_v2_user=seccomp by using MDS-HT (except on
  XEON PHI where MDS can be tamed with SMT left enabled, but that's a
  special case). Setting STIBP become a very expensive window dressing
  after MDS-HT was discovered.

- The seccomp jail cannot attack the kernel with spectre-v2-HT
  regardless (even if STIBP is not set), but with MDS-HT the seccomp
  jail can attack the kernel too.

- With spec_store_bypass_disable=prctl the seccomp jail can attack the
  other userland (guest or host mode) using spectre-v2-HT, but the
  userland attack is already mitigated by both ASLR and pid namespaces
  for host userland and through virt isolation with libkrun or
  kata. (if something if somebody is worried about spectre-v2-HT it's
  best to mount proc with hidepid=2,gid=proc on workstations where not
  all apps may run under container runtimes, rather than slowing down
  all seccomp jails, but the best is to add pid namespaces to the
  seccomp jail). As opposed MDS-HT is not mitigated and the seccomp
  jail can still attack all other host and guest userland if SMT is
  enabled even with spec_store_bypass_disable=seccomp.

- If full security is required then MDS-HT must also be mitigated with
  nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp
  would become identical.

- Setting spectre_v2_user=seccomp is overall lower priority than to
  setting javascript.options.wasm false in about:config to protect
  against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT
  and STIBP which again is already statistically well mitigated by
  other means in userland and it's fully mitigated in kernel with
  retpolines (unlike the wasm assist call with MDS-HT).

- SSBD is needed to prevent reading the JIT memory and the primary
  user being the OpenJDK. However the primary user of SSBD wouldn't be
  covered by spec_store_bypass_disable=seccomp because it doesn't use
  seccomp and the primary user also explicitly declined to set
  PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily
  could. In fact it would need to set it only when the sandboxing
  mechanism is enabled for javaws applets, but it still declined it by
  declaring security within the same user address space as an
  untenable objective for their JIT, even in the sandboxing case where
  performance would be a lesser concern (for the record: I kind of
  disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and
  I prefer to run javaws through a wrapper that sets
  PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that
  even if the primary user of SSBD would use seccomp, they would
  invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now.

- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
  and podman have a default json seccomp allowlist that cannot be
  slowed down, so for the #1 seccomp user this change is already a
  noop.

- systemd/sshd or other apps that use seccomp, if they really need
  STIBP or SSBD, they need to explicitly set the
  PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind
  catch-all approach was done probably initially with a wishful
  thinking objective to pretend to have a peace of mind that it could
  magically fix it all. That was wishful thinking before MDS-HT was
  discovered, but after MDS-HT has been discovered it become just
  window dressing.

- For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP
  or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's
  needed with TCG it should be an opt-in with
  PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't
  slowdown KVM for nothing). For qemu+KVM STIBP would be even more
  window dressing than it is for all other apps, because in the
  qemu+KVM case there's not only the MDS attack to worry about with
  SMT enabled. Even after disabling SMT, there's still a theoretical
  spectre-v2 attack possible within the same thread context from guest
  mode to host ring3 that the host kernel retpoline mitigation has no
  theoretical chance to mitigate. On some kernels a
  ibrs-always/ibrs-retpoline opt-in model is provided that will
  enabled IBRS in the qemu host ring3 userland which fixes this
  theoretical concern. Only after enabling IBRS in the host userland
  it would then make sense to proceed and worry about STIBP and an
  attack on the other host userland, but then again SMT would need to
  be disabled for full security anyway, so that would render STIBP
  again a noop.

- last but not the least: the lack of "spec_store_bypass_disable=prctl
  spectre_v2_user=prctl" means the moment a guest boots and
  sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR
  which will make the guest vmexit forever slower, forcing KVM to
  issue a very slow rdmsr instruction at every vmexit. So the end
  result is that SPEC_CTRL MSR is only available in GCE. Most other
  public cloud providers don't expose SPEC_CTRL, which means that not
  only STIBP/SSBD isn't available, but IBPB isn't available either
  (which would cause no overhead to the guest or the hypervisor
  because it's write only and requires no reading during vmexit). So
  the current default already net loss in security (missing IBPB)
  which means most public cloud providers cannot achieve a fully
  secure guest with nosmt (and nosmt is enough to fully mitigate
  MDS-HT). It also means GCE and is unfairly penalized in performance
  because it provides the option to enable full security in the guest
  as an opt-in (i.e. nosmt and IBPB). So this change will allow all
  cloud providers to expose SPEC_CTRL without incurring into any
  hypervisor slowdown and at the same time it will remove the unfair
  penalization of GCE performance for doing the right thing and it'll
  allow to get full security with nosmt with IBPB being available (and
  STIBP becoming meaningless).

Example to put things in prospective: the STIBP enabled in seccomp has
never been about protecting apps using seccomp like sshd from an
attack from a malicious userland, but to the contrary it has always
been about protecting the system from an attack from sshd, after a
successful remote network exploit against sshd. In fact initially it
wasn't obvious STIBP would work both ways (STIBP was about preventing
the task that runs with STIBP to be attacked with spectre-v2-HT, but
accidentally in the STIBP case it also prevents the attack in the
other direction). In the hypothetical case that sshd has been remotely
exploited the last concern should be STIBP being set, because it'll be
still possible to obtain info even from the kernel by using MDS if
nosmt wasn't set (and if it was set, STIBP is a noop in the first
place). As opposed kernel cannot leak anything with spectre-v2 HT
because of retpolines and the userland is mitigated by ASLR already
and ideally PID namespaces too. If something it'd be worth checking if
sshd run the seccomp thread under pid namespaces too if available in
the running kernel. SSBD also would be a noop for sshd, since sshd
uses no JIT. If sshd prefers to keep doing the STIBP window dressing
exercise, it still can even after this change of defaults by opting-in
with PR_SPEC_INDIRECT_BRANCH.

Ultimately setting SSBD and STIBP by default for all seccomp jails is
a bad sweet spot and bad default with more cons than pros that end up
reducing security in the public cloud (by giving an huge incentive to
not expose SPEC_CTRL which would be needed to get full security with
IBPB after setting nosmt in the guest) and by excessively hurting
performance to more secure apps using seccomp that end up having to
opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW.

The following is the verified result of the new default with SMT
enabled:

(gdb) print spectre_v2_user_stibp
$1 = SPECTRE_V2_USER_PRCTL
(gdb) print spectre_v2_user_ibpb
$2 = SPECTRE_V2_USER_PRCTL
(gdb) print ssb_mode
$3 = SPEC_STORE_BYPASS_PRCTL

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2021-10-04 12:12:57 -07:00
Borislav Petkov c7419a6e1a Merge branch x86/cc into x86/core
Pick up dependent cc_platform_has() changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-10-04 17:37:22 +02:00
Tom Lendacky e9d1d2bb75 treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
Replace uses of mem_encrypt_active() with calls to cc_platform_has() with
the CC_ATTR_MEM_ENCRYPT attribute.

Remove the implementation of mem_encrypt_active() across all arches.

For s390, since the default implementation of the cc_platform_has()
matches the s390 implementation of mem_encrypt_active(), cc_platform_has()
does not need to be implemented in s390 (the config option
ARCH_HAS_CC_PLATFORM is not set).

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-9-bp@alien8.de
2021-10-04 11:47:24 +02:00
Tom Lendacky 6283f2effb x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
Replace uses of sev_es_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_STATE_ENCRYPT. If future support is added for other
memory encyrption techonologies, the use of CC_ATTR_GUEST_STATE_ENCRYPT
can be updated, as required.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-8-bp@alien8.de
2021-10-04 11:47:09 +02:00
Tom Lendacky 4d96f91091 x86/sev: Replace occurrences of sev_active() with cc_platform_has()
Replace uses of sev_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_GUEST_MEM_ENCRYPT
can be updated, as required.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-7-bp@alien8.de
2021-10-04 11:46:58 +02:00
Tom Lendacky 32cb4d02fb x86/sme: Replace occurrences of sme_active() with cc_platform_has()
Replace uses of sme_active() with the more generic cc_platform_has()
using CC_ATTR_HOST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_HOST_MEM_ENCRYPT
can be updated, as required.

This also replaces two usages of sev_active() that are really geared
towards detecting if SME is active.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-6-bp@alien8.de
2021-10-04 11:46:46 +02:00
Tom Lendacky aa5a461171 x86/sev: Add an x86 version of cc_platform_has()
Introduce an x86 version of the cc_platform_has() function. This will be
used to replace vendor specific calls like sme_active(), sev_active(),
etc.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de
2021-10-04 11:46:20 +02:00
Tom Lendacky 402fe0cb71 x86/ioremap: Selectively build arch override encryption functions
In preparation for other uses of the cc_platform_has() function
besides AMD's memory encryption support, selectively build the
AMD memory encryption architecture override functions only when
CONFIG_AMD_MEM_ENCRYPT=y. These functions are:

- early_memremap_pgprot_adjust()
- arch_memremap_can_ram_remap()

Additionally, routines that are only invoked by these architecture
override functions can also be conditionally built. These functions are:

- memremap_should_map_decrypted()
- memremap_is_efi_data()
- memremap_is_setup_data()
- early_memremap_is_setup_data()

And finally, phys_mem_access_encrypted() is conditionally built as well,
but requires a static inline version of it when CONFIG_AMD_MEM_ENCRYPT is
not set.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-2-bp@alien8.de
2021-10-04 11:45:49 +02:00
Randy Dunlap ef775a0e36 x86/Kconfig: Fix an unused variable error in dell-smm-hwmon
When CONFIG_PROC_FS is not set, there is a build warning (turned
into an error):

  ../drivers/hwmon/dell-smm-hwmon.c: In function 'i8k_init_procfs':
  ../drivers/hwmon/dell-smm-hwmon.c:624:24: error: unused variable 'data' [-Werror=unused-variable]
    struct dell_smm_data *data = dev_get_drvdata(dev);

Make I8K depend on PROC_FS and HWMON (instead of selecting HWMON -- it
is strongly preferred to not select entire subsystems).

Build tested in all possible combinations of SENSORS_DELL_SMM, I8K, and
PROC_FS.

Fixes: 039ae58503 ("hwmon: Allow to compile dell-smm-hwmon driver without /proc/i8k")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Pali Rohár <pali@kernel.org>
Link: https://lkml.kernel.org/r/20210910071921.16777-1-rdunlap@infradead.org
2021-10-04 11:11:48 +02:00
Linus Torvalds 291073a566 kvm: fix objtool relocation warning
The recent change to make objtool aware of more symbol relocation types
(commit 24ff65257375: "objtool: Teach get_alt_entry() about more
relocation types") also added another check, and resulted in this
objtool warning when building kvm on x86:

    arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception

The reason seems to be that kvm_fastop_exception() is marked as a global
symbol, which causes the relocation to ke kept around for objtool.  And
at the same time, the kvm_fastop_exception definition (which is done as
an inline asm statement) doesn't actually set the type of the global,
which then makes objtool unhappy.

The minimal fix is to just not mark kvm_fastop_exception as being a
global symbol.  It's only used in that one compilation unit anyway, so
it was always pointless.  That's how all the other local exception table
labels are done.

I'm not entirely happy about the kinds of games that the kvm code plays
with doing its own exception handling, and the fact that it confused
objtool is most definitely a symptom of the code being a bit too subtle
and ad-hoc.  But at least this trivial one-liner makes objtool no longer
upset about what is going on.

Fixes: 24ff652573 ("objtool: Teach get_alt_entry() about more relocation types")
Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
Cc: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-03 13:34:19 -07:00
Linus Torvalds 3a399a2bc4 - Make sure the destroy callback is reset when a event initialization fails
- Update the event constraints for Icelake
 
 - Make sure the active time of an event is updated even for inactive events
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFZfZoACgkQEsHwGGHe
 VUqonA//SskDtPRTrbHoc1pDZEBOoYmuYZpsVIXMTcjxt8DkdGZcmgSRzJ53QdGT
 aOv+Oc7qIsOa7qcRPaUxlyjnqbWEynOmeHaIN2PtKebOKsXsvdf0ehrrOYjgoD5g
 1ENCisvV6TyyPtQPDv9QkYkGScZXg5/D8VyrlV5GsRj+MslWsfVWaJvUCHQtJcnz
 YPbcOFw1MHCBtUuJXtXuzA2JDp0Lfr6aX5Zl6hXqP8k69Y+Ob0bjA3jopo29l5rU
 lp/y4MQcFHuEAbQ7AdVeKg3Yi/RnSEI0w3HT1xDeKyXlMGp+Xz5rDBg1I+WREDri
 u7QuT9Mx6XSzGHGeg/Y5XbMN+M5WCQUxH5gJMLiduG0YYULWhq/j4f5tA0U0cQ80
 qVKRH8luh+0R+RLZtqno9pNg0pSdB47X8lqXOX0/DyAA7f6xPm17VBvyDeuF75AL
 vYuhdtu0vB5Iwsccj+h5cXGsaMWCzcriGABWZL/1JZFwRT8kRadiybJkv0n6R77F
 qDqUMu6cLM9kDhTeggDffoQZoxLiEL6pfeWFuWcaWdVXa8veX4lZZNnPFbqNhnJm
 zRm758r3jdHie1BAgFXym6Gt5EhkBhtofbUBaw3T2SdxFTs0csKzF5tzCXIvIyz2
 LH3Is3ce0pw+A72barhbLua2IJ2h895DGE529sQBiaSibB53tYk=
 =TQvw
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Make sure the destroy callback is reset when a event initialization
   fails

 - Update the event constraints for Icelake

 - Make sure the active time of an event is updated even for inactive
   events

* tag 'perf_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: fix userpage->time_enabled of inactive events
  perf/x86/intel: Update event constraints for ICX
  perf/x86: Reset destroy callback on event init failure
2021-10-03 10:32:27 -07:00
Richard Guy Briggs 1c30e3af8a audit: add support for the openat2 syscall
The openat2(2) syscall was added in kernel v5.6 with commit
fddb5d430a ("open: introduce openat2(2) syscall").

Add the openat2(2) syscall to the audit syscall classifier.

Link: https://github.com/linux-audit/audit-kernel/issues/67
Link: https://lore.kernel.org/r/f5f1a4d8699613f8c02ce762807228c841c2e26f.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: merge fuzz due to previous header rename, commit line wraps]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-01 16:52:48 -04:00
Richard Guy Briggs 42f355ef59 audit: replace magic audit syscall class numbers with macros
Replace audit syscall class magic numbers with macros.

This required putting the macros into new header file
include/linux/audit_arch.h since the syscall macros were
included for both 64 bit and 32 bit in any compat code, causing
redefinition warnings.

Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: renamed header to audit_arch.h after consulting with Richard]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-01 16:41:33 -04:00
Linus Torvalds b2626f1e32 Small x86 fixes.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFXQUoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMglgf/egh3zb9/+BUQWe0xWfhcINNzpsVk
 PJtiBmJc3nQLbZbTSLp63rouy1lNgR0s2DiMwP7G1u39OwW8W3LHMrBUSqF1F01+
 gntb4GGiRTiTPJI64K4z6ytORd3tuRarHq8TUIa2zvki9ZW5Obgkm1i1RsNMOo+s
 AOA7whhpS8e/a5fBbtbS9bTZb30PKTZmbW4oMjvO9Sw4Eb76IauqPSEtRPSuCAc7
 r7z62RTlm10Qk0JR3tW1iXMxTJHZk+tYPJ8pclUAWVX5bZqWa/9k8R0Z5i/miFiZ
 glW/y3R4+aUwIQV2v7V3Jx9MOKDhZxniMtnqZG/Hp9NVDtWIz37V/U37vw==
 =zQQ1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm fixes from Paolo Bonzini:
 "Small x86 fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: selftests: Ensure all migrations are performed when test is affined
  KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
  ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
  x86/kvmclock: Move this_cpu_pvti into kvmclock.h
  selftests: KVM: Don't clobber XMM register when read
  KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
2021-10-01 11:08:07 -07:00
Kan Liang ecc2123e09 perf/x86/intel: Update event constraints for ICX
According to the latest event list, the event encoding 0xEF is only
available on the first 4 counters. Add it into the event constraints
table.

Fixes: 6017608936 ("perf/x86/intel: Add Icelake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1632842343-25862-1-git-send-email-kan.liang@linux.intel.com
2021-10-01 13:57:54 +02:00
Anand K Mistry 02d029a41d perf/x86: Reset destroy callback on event init failure
perf_init_event tries multiple init callbacks and does not reset the
event state between tries. When x86_pmu_event_init runs, it
unconditionally sets the destroy callback to hw_perf_event_destroy. On
the next init attempt after x86_pmu_event_init, in perf_try_init_event,
if the pmu's capabilities includes PERF_PMU_CAP_NO_EXCLUDE, the destroy
callback will be run. However, if the next init didn't set the destroy
callback, hw_perf_event_destroy will be run (since the callback wasn't
reset).

Looking at other pmu init functions, the common pattern is to only set
the destroy callback on a successful init. Resetting the callback on
failure tries to replicate that pattern.

This was discovered after commit f11dd0d805 ("perf/x86/amd/ibs: Extend
PERF_PMU_CAP_NO_EXCLUDE to IBS Op") when the second (and only second)
run of the perf tool after a reboot results in 0 samples being
generated. The extra run of hw_perf_event_destroy results in
active_events having an extra decrement on each perf run. The second run
has active_events == 0 and every subsequent run has active_events < 0.
When active_events == 0, the NMI handler will early-out and not record
any samples.

Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210929170405.1.I078b98ee7727f9ae9d6df8262bad7e325e40faf0@changeid
2021-10-01 13:57:54 +02:00
Thomas Gleixner 6e3cd95234 x86/hpet: Use another crystalball to evaluate HPET usability
On recent Intel systems the HPET stops working when the system reaches PC10
idle state.

The approach of adding PCI ids to the early quirks to disable HPET on
these systems is a whack a mole game which makes no sense.

Check for PC10 instead and force disable HPET if supported. The check is
overbroad as it does not take ACPI, intel_idle enablement and command
line parameters into account. That's fine as long as there is at least
PMTIMER available to calibrate the TSC frequency. The decision can be
overruled by adding "hpet=force" on the kernel command line.

Remove the related early PCI quirks for affected Ice Cake and Coffin Lake
systems as they are not longer required. That should also cover all
other systems, i.e. Tiger Rag and newer generations, which are most
likely affected by this as well.

Fixes: Yet another hardware trainwreck
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
Cc: stable@vger.kernel.org
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
2021-10-01 13:38:13 +02:00
Tom Lendacky 06f2ac3d42 x86/sev: Return an error on a returned non-zero SW_EXITINFO1[31:0]
After returning from a VMGEXIT NAE event, SW_EXITINFO1[31:0] is checked
for a value of 1, which indicates an error and that SW_EXITINFO2
contains exception information. However, future versions of the GHCB
specification may define new values for SW_EXITINFO1[31:0], so really
any non-zero value should be treated as an error.

Fixes: 597cfe4821 ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 5.10+
Link: https://lkml.kernel.org/r/efc772af831e9e7f517f0439b13b41f56bad8784.1633063321.git.thomas.lendacky@amd.com
2021-10-01 11:14:41 +02:00
Masami Hiramatsu bf094cffea x86/kprobes: Fixup return address in generic trampoline handler
In x86, the fake return address on the stack saved by
__kretprobe_trampoline() will be replaced with the real return
address after returning from trampoline_handler(). Before fixing
the return address, the real return address can be found in the
'current->kretprobe_instances'.

However, since there is a window between updating the
'current->kretprobe_instances' and fixing the address on the stack,
if an interrupt happens at that timing and the interrupt handler
does stacktrace, it may fail to unwind because it can not get
the correct return address from 'current->kretprobe_instances'.

This will eliminate that window by fixing the return address
right before updating 'current->kretprobe_instances'.

Link: https://lkml.kernel.org/r/163163057094.489837.9044470370440745866.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:08 -04:00
Masami Hiramatsu 19138af1bd x86/unwind: Recover kretprobe trampoline entry
Since the kretprobe replaces the function return address with
the kretprobe_trampoline on the stack, x86 unwinders can not
continue the stack unwinding at that point, or record
kretprobe_trampoline instead of correct return address.

To fix this issue, find the correct return address from task's
kretprobe_instances as like as function-graph tracer does.

With this fix, the unwinder can correctly unwind the stack
from kretprobe event on x86, as below.

           <...>-135     [003] ...1     6.722338: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read)
           <...>-135     [003] ...1     6.722377: <stack trace>
 => kretprobe_trace_func+0x209/0x2f0
 => kretprobe_dispatcher+0x4a/0x70
 => __kretprobe_trampoline_handler+0xca/0x150
 => trampoline_handler+0x44/0x70
 => kretprobe_trampoline+0x2a/0x50
 => vfs_read+0xab/0x1a0
 => ksys_read+0x5f/0xe0
 => do_syscall_64+0x33/0x40
 => entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/163163055130.489837.5161749078833497255.stgit@devnote2

Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:07 -04:00
Masami Hiramatsu 1f36839308 x86/kprobes: Push a fake return address at kretprobe_trampoline
Change __kretprobe_trampoline() to push the address of the
__kretprobe_trampoline() as a fake return address at the bottom
of the stack frame. This fake return address will be replaced
with the correct return address in the trampoline_handler().

With this change, the ORC unwinder can check whether the return
address is modified by kretprobes or not.

Link: https://lkml.kernel.org/r/163163054185.489837.14338744048957727386.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:07 -04:00
Josh Poimboeuf eb4a3f7d78 x86/kprobes: Add UNWIND_HINT_FUNC on kretprobe_trampoline()
Add UNWIND_HINT_FUNC on __kretprobe_trampoline() code so that ORC
information is generated on the __kretprobe_trampoline() correctly.
Also, this uses STACK_FRAME_NON_STANDARD_FP(), CONFIG_FRAME_POINTER-
-specific version of STACK_FRAME_NON_STANDARD().

Link: https://lkml.kernel.org/r/163163049242.489837.11970969750993364293.stgit@devnote2

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:07 -04:00
Masami Hiramatsu adf8a61a94 kprobes: treewide: Make it harder to refer kretprobe_trampoline directly
Since now there is kretprobe_trampoline_addr() for referring the
address of kretprobe trampoline code, we don't need to access
kretprobe_trampoline directly.

Make it harder to refer by renaming it to __kretprobe_trampoline().

Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:06 -04:00
Masami Hiramatsu 96fed8ac2b kprobes: treewide: Remove trampoline_address from kretprobe_trampoline_handler()
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:

  dereference_kernel_function_descriptor(kretprobe_trampoline)

In fact we had bugs in arch code where it wasn't replicated correctly.

So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.

Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:06 -04:00
Masami Hiramatsu c42421e205 kprobes: treewide: Use 'kprobe_opcode_t *' for the code address in get_optimized_kprobe()
Since get_optimized_kprobe() is only used inside kprobes,
it doesn't need to use 'unsigned long' type for 'addr' parameter.
Make it use 'kprobe_opcode_t *' for the 'addr' parameter and
subsequent call of arch_within_optimized_kprobe() also should use
'kprobe_opcode_t *'.

Note that MAX_OPTIMIZED_LENGTH and RELATIVEJUMP_SIZE are defined
by byte-size, but the size of 'kprobe_opcode_t' depends on the
architecture. Therefore, we must be careful when calculating
addresses using those macros.

Link: https://lkml.kernel.org/r/163163040680.489837.12133032364499833736.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30 21:24:05 -04:00
Jakub Kicinski dd9a887b35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/phy/bcm7xxx.c
  d88fd1b546 ("net: phy: bcm7xxx: Fixed indirect MMD operations")
  f68d08c437 ("net: phy: bcm7xxx: Add EPHY entry for 72165")

net/sched/sch_api.c
  b193e15ac6 ("net: prevent user from passing illegal stab size")
  69508d4333 ("net_sched: Use struct_size() and flex_array_size() helpers")

Both cases trivial - adjacent code additions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-30 14:49:21 -07:00
Linus Torvalds 4de593fb96 Networking fixes for 5.15-rc4, including fixes from mac80211, netfilter
and bpf.
 
 Current release - regressions:
 
  - bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
    interrupt
 
  - mdio: revert mechanical patches which broke handling of optional
    resources
 
  - dev_addr_list: prevent address duplication
 
 Previous releases - regressions:
 
  - sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
    (NULL deref)
 
  - Revert "mac80211: do not use low data rates for data frames with no
    ack flag", fixing broadcast transmissions
 
  - mac80211: fix use-after-free in CCMP/GCMP RX
 
  - netfilter: include zone id in tuple hash again, minimize collisions
 
  - netfilter: nf_tables: unlink table before deleting it (race -> UAF)
 
  - netfilter: log: work around missing softdep backend module
 
  - mptcp: don't return sockets in foreign netns
 
  - sched: flower: protect fl_walk() with rcu (race -> UAF)
 
  - ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup
 
  - smsc95xx: fix stalled rx after link change
 
  - enetc: fix the incorrect clearing of IF_MODE bits
 
  - ipv4: fix rtnexthop len when RTA_FLOW is present
 
  - dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this SKU
 
  - e100: fix length calculation & buffer overrun in ethtool::get_regs
 
 Previous releases - always broken:
 
  - mac80211: fix using stale frag_tail skb pointer in A-MSDU tx
 
  - mac80211: drop frames from invalid MAC address in ad-hoc mode
 
  - af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
    (race -> UAF)
 
  - bpf, x86: Fix bpf mapping of atomic fetch implementation
 
  - bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
 
  - netfilter: ip6_tables: zero-initialize fragment offset
 
  - mhi: fix error path in mhi_net_newlink
 
  - af_unix: return errno instead of NULL in unix_create1() when
    over the fs.file-max limit
 
 Misc:
 
  - bpf: exempt CAP_BPF from checks against bpf_jit_limit
 
  - netfilter: conntrack: make max chain length random, prevent guessing
    buckets by attackers
 
  - netfilter: nf_nat_masquerade: make async masq_inet6_event handling
    generic, defer conntrack walk to work queue (prevent hogging RTNL lock)
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmFV5KYACgkQMUZtbf5S
 Irs3cRAAqNsgaQXSVOXcPKsndeDDKHIv1Ktes8aOP9tgEXZw5rpsbct7g9Yxc0os
 Oolyt6HThjWr3u1/e3HHJO9I5Klr/J8eQReoRZnKW+6TYZflmmzfuf8u1nx6SLP/
 tliz5y8wKbp8BqqNuTMdRpm+R1QQcNkXTeruUoR1PgREcY4J7bC2BRrqeZhBGHSR
 Z5yPOIietFN3nITxNwbe4AYJXlesMc6QCWhBtXjMPGQ4Zc4/sjDNfqi7eHJi2H2y
 kW2dHeXG86gnlgFllOBBWP85ptxynyxoNQJuhrxgC9T+/FpSVST7cwKbtmkwDI3M
 5WGmeE6B3yfF8iOQuR8fbKQmsnLgQlYhjpbbhgN0GxzkyI7RpGYOFroX0Pht4IVZ
 mwprDOtvoLs4UeDjULRMB0JZfRN75PCtVlhfUkhhJxXGCCmnhGYaxG/pE+6OQWlr
 +n8RXYYMoOzPaHIYTS9NGSGqT0r32IUy/W5Yfv3rEzSeehy2/fxzGr2fOyBGs+q7
 xrnqpsOnM8cODDwGMy3TclCI4Dd72WoHNCHPhA/bk/ZMjHpBd4CSEZPm8IROY3Ja
 g1t68cncgL8fB7TSD9WLFgYu67Lg5j0gC/BHOzUQDQMX5IbhOq/fj1xRy5Lc6SYp
 mqW1f7LdnixBe4W61VjDAYq5jJRqrwEWedx+rvV/ONLvr77KULk=
 =rSti
 -----END PGP SIGNATURE-----

Merge tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from mac80211, netfilter and bpf.

  Current release - regressions:

   - bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
     interrupt

   - mdio: revert mechanical patches which broke handling of optional
     resources

   - dev_addr_list: prevent address duplication

  Previous releases - regressions:

   - sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
     (NULL deref)

   - Revert "mac80211: do not use low data rates for data frames with no
     ack flag", fixing broadcast transmissions

   - mac80211: fix use-after-free in CCMP/GCMP RX

   - netfilter: include zone id in tuple hash again, minimize collisions

   - netfilter: nf_tables: unlink table before deleting it (race -> UAF)

   - netfilter: log: work around missing softdep backend module

   - mptcp: don't return sockets in foreign netns

   - sched: flower: protect fl_walk() with rcu (race -> UAF)

   - ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup

   - smsc95xx: fix stalled rx after link change

   - enetc: fix the incorrect clearing of IF_MODE bits

   - ipv4: fix rtnexthop len when RTA_FLOW is present

   - dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this
     SKU

   - e100: fix length calculation & buffer overrun in ethtool::get_regs

  Previous releases - always broken:

   - mac80211: fix using stale frag_tail skb pointer in A-MSDU tx

   - mac80211: drop frames from invalid MAC address in ad-hoc mode

   - af_unix: fix races in sk_peer_pid and sk_peer_cred accesses (race
     -> UAF)

   - bpf, x86: Fix bpf mapping of atomic fetch implementation

   - bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog

   - netfilter: ip6_tables: zero-initialize fragment offset

   - mhi: fix error path in mhi_net_newlink

   - af_unix: return errno instead of NULL in unix_create1() when over
     the fs.file-max limit

  Misc:

   - bpf: exempt CAP_BPF from checks against bpf_jit_limit

   - netfilter: conntrack: make max chain length random, prevent
     guessing buckets by attackers

   - netfilter: nf_nat_masquerade: make async masq_inet6_event handling
     generic, defer conntrack walk to work queue (prevent hogging RTNL
     lock)"

* tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
  af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
  net: stmmac: fix EEE init issue when paired with EEE capable PHYs
  net: dev_addr_list: handle first address in __hw_addr_add_ex
  net: sched: flower: protect fl_walk() with rcu
  net: introduce and use lock_sock_fast_nested()
  net: phy: bcm7xxx: Fixed indirect MMD operations
  net: hns3: disable firmware compatible features when uninstall PF
  net: hns3: fix always enable rx vlan filter problem after selftest
  net: hns3: PF enable promisc for VF when mac table is overflow
  net: hns3: fix show wrong state when add existing uc mac address
  net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE
  net: hns3: don't rollback when destroy mqprio fail
  net: hns3: remove tc enable checking
  net: hns3: do not allow call hns3_nic_net_open repeatedly
  ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup
  net: bridge: mcast: Associate the seqcount with its protecting lock.
  net: mdio-ipq4019: Fix the error for an optional regs resource
  net: hns3: fix hclge_dbg_dump_tm_pg() stack usage
  net: mdio: mscc-miim: Fix the mdio controller
  af_unix: Return errno instead of NULL in unix_create1().
  ...
2021-09-30 14:28:05 -07:00
Ard Biesheuvel 5443f98fb9 x86: add CPU field to struct thread_info
The CPU field will be moved back into thread_info even when
THREAD_INFO_IN_TASK is enabled, so add it back to x86's definition of
struct thread_info.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
2021-09-30 14:40:30 +02:00
Sean Christopherson e8a747d088 KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
Check whether a CPUID entry's index is significant before checking for a
matching index to hack-a-fix an undefined behavior bug due to consuming
uninitialized data.  RESET/INIT emulation uses kvm_cpuid() to retrieve
CPUID.0x1, which does _not_ have a significant index, and fails to
initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
ECX, a.k.a. index, input.

Practically speaking, it's _extremely_  unlikely any compiler will yield
code that causes problems, as the compiler would need to inline the
kvm_cpuid() call to detect the uninitialized data, and intentionally hose
the kernel, e.g. insert ud2, instead of simply ignoring the result of
the index comparison.

Although the sketchy "dummy" pattern was introduced in SVM by commit
66f7b72e11 ("KVM: x86: Make register state after reset conform to
specification"), it wasn't actually broken until commit 7ff6c03503
("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
order of operations such that "index" was checked before the significant
flag.

Avoid consuming uninitialized data by reverting to checking the flag
before the index purely so that the fix can be easily backported; the
offending RESET/INIT code has been refactored, moved, and consolidated
from vendor code to common x86 since the bug was introduced.  A future
patch will directly address the bad RESET/INIT behavior.

The undefined behavior was detected by syzbot + KernelMemorySanitizer.

  BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
  BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
  BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
   cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
   kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
   kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
   kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
   kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
   vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
   vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
   kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020

  Local variable ----dummy@kvm_vcpu_reset created at:
   kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
   kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923

Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 2a24be79b6 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
Fixes: 7ff6c03503 ("KVM: x86: Remove stateful CPUID handling")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:20:33 -04:00
Zelin Deng ad9af93068 x86/kvmclock: Move this_cpu_pvti into kvmclock.h
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.

Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:08:01 -04:00
Linus Torvalds 6e439bbd43 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This contains fixes for a resource leak in ccp as well as stack
  corruption in x86/sm4"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: x86/sm4 - Fix frame pointer stack corruption
  crypto: ccp - fix resource leaks in ccp_run_aes_gcm_cmd()
2021-09-29 07:37:46 -07:00
H. Peter Anvin (Intel) 067595d728 x86/boot: Fix make hdimage with older versions of mtools
Some versions of mtools (fixed somewhere between 4.0.31 and 4.0.35)
generate bad output for mformat when used with the partition= option.
Use the offset= option instead. An mtools.conf entry is *also* needed
with partition= to support mpartition; combining them in one entry does
not work either.

Don't specify the -t option to mpartition; it is unnecessary and seems
to confuse mpartition under some circumstances.

Also do a few minor optimizations:

Use a larger cluster size; there is no reason for the typical 4K
clusters when we are dealing mainly with comparatively huge files.

Start the partition at 32K. There is no reason to align it more than
that, since the internal FAT filesystem structures will at best be
cluster-aligned, and 32K is the maximum FAT cluster size.

 [ bp: Remove "we". ]

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210911003906.2700218-1-hpa@zytor.com
2021-09-29 11:06:47 +02:00
David S. Miller 4ccb9f03fe Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-09-28

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 14 day(s) which contain
a total of 11 files changed, 139 insertions(+), 53 deletions(-).

The main changes are:

1) Fix MIPS JIT jump code emission for too large offsets, from Piotr Krysiuk.

2) Fix x86 JIT atomic/fetch emission when dst reg maps to rax, from Johan Almbladh.

3) Fix cgroup_sk_alloc corner case when called from interrupt, from Daniel Borkmann.

4) Fix segfault in libbpf's linker for objects without BTF, from Kumar Kartikeya Dwivedi.

5) Fix bpf_jit_charge_modmem for applications with CAP_BPF, from Lorenz Bauer.

6) Fix return value handling for struct_ops BPF programs, from Hou Tao.

7) Various fixes to BPF selftests, from Jiri Benc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
,
2021-09-28 13:52:46 +01:00
Johan Almbladh ced185824c bpf, x86: Fix bpf mapping of atomic fetch implementation
Fix the case where the dst register maps to %rax as otherwise this produces
an incorrect mapping with the implementation in 981f94c3e9 ("bpf: Add
bitwise atomic instructions") as %rax is clobbered given it's part of the
cmpxchg as operand.

The issue is similar to b29dd96b90 ("bpf, x86: Fix BPF_FETCH atomic and/or/
xor with r0 as src") just that the case of dst register was missed.

Before, dst=r0 (%rax) src=r2 (%rsi):

  [...]
  c5:   mov    %rax,%r10
  c8:   mov    0x0(%rax),%rax       <---+ (broken)
  cc:   mov    %rax,%r11                |
  cf:   and    %rsi,%r11                |
  d2:   lock cmpxchg %r11,0x0(%rax) <---+
  d8:   jne    0x00000000000000c8       |
  da:   mov    %rax,%rsi                |
  dd:   mov    %r10,%rax                |
  [...]                                 |
                                        |
After, dst=r0 (%rax) src=r2 (%rsi):     |
                                        |
  [...]                                 |
  da:	mov    %rax,%r10                |
  dd:	mov    0x0(%r10),%rax       <---+ (fixed)
  e1:	mov    %rax,%r11                |
  e4:	and    %rsi,%r11                |
  e7:	lock cmpxchg %r11,0x0(%r10) <---+
  ed:	jne    0x00000000000000dd
  ef:	mov    %rax,%rsi
  f2:	mov    %r10,%rax
  [...]

The remaining combinations were fine as-is though:

After, dst=r9 (%r15) src=r0 (%rax):

  [...]
  dc:	mov    %rax,%r10
  df:	mov    0x0(%r15),%rax
  e3:	mov    %rax,%r11
  e6:	and    %r10,%r11
  e9:	lock cmpxchg %r11,0x0(%r15)
  ef:	jne    0x00000000000000df      _
  f1:	mov    %rax,%r10                | (unneeded, but
  f4:	mov    %r10,%rax               _|  not a problem)
  [...]

After, dst=r9 (%r15) src=r4 (%rcx):

  [...]
  de:	mov    %rax,%r10
  e1:	mov    0x0(%r15),%rax
  e5:	mov    %rax,%r11
  e8:	and    %rcx,%r11
  eb:	lock cmpxchg %r11,0x0(%r15)
  f1:	jne    0x00000000000000e1
  f3:	mov    %rax,%rcx
  f6:	mov    %r10,%rax
  [...]

The case of dst == src register is rejected by the verifier and
therefore not supported, but x86 JIT also handles this case just
fine.

After, dst=r0 (%rax) src=r0 (%rax):

  [...]
  eb:	mov    %rax,%r10
  ee:	mov    0x0(%r10),%rax
  f2:	mov    %rax,%r11
  f5:	and    %r10,%r11
  f8:	lock cmpxchg %r11,0x0(%r10)
  fe:	jne    0x00000000000000ee
 100:	mov    %rax,%r10
 103:	mov    %r10,%rax
  [...]

Fixes: 981f94c3e9 ("bpf: Add bitwise atomic instructions")
Reported-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-09-28 12:10:29 +02:00
Linus Torvalds 9cccec2bf3 x86:
- missing TLB flush
 
 - nested virtualization fixes for SMM (secure boot on nested hypervisor)
   and other nested SVM fixes
 
 - syscall fuzzing fixes
 
 - live migration fix for AMD SEV
 
 - mirror VMs now work for SEV-ES too
 
 - fixes for reset
 
 - possible out-of-bounds access in IOAPIC emulation
 
 - fix enlightened VMCS on Windows 2022
 
 ARM:
 
 - Add missing FORCE target when building the EL2 object
 
 - Fix a PMU probe regression on some platforms
 
 Generic:
 
 - KCSAN fixes
 
 selftests:
 
 - random fixes, mostly for clang compilation
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFN0EwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNqaQf/Vx7ePFTqwWpo+8wKapnc6JN9SLjC
 hM4jipxfc1WyQWcfCt8ZuPhCnhF7o8mG/mrqTm+JB+oGqIsydHW19DiUT8ekv09F
 dQ+XYSiR4B547wUH5XLQc4xG9imwYlXGEOHqrE7eJvGH3LOqVFX2fLRBnFefZbO8
 GKhRJrGXwG3/JSAP6A0c22iVU+pLbfV9gpKwrAj0V7o8nzT2b3Wmh74WBNb47BzE
 a4+AwKpWO4rqJGOwdYwy67pdFHh1YmrlZ59cFZc7fzlXE+o0D0bitaJyioZALpOl
 4mRGdzoYkNB++ZjDzVFnAClCYQV/oNxCNGFaFF2mh/gzXG1TLmN7B8zGDg==
 =7oVh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "A bit late... I got sidetracked by back-from-vacation routines and
  conferences. But most of these patches are already a few weeks old and
  things look more calm on the mailing list than what this pull request
  would suggest.

  x86:

   - missing TLB flush

   - nested virtualization fixes for SMM (secure boot on nested
     hypervisor) and other nested SVM fixes

   - syscall fuzzing fixes

   - live migration fix for AMD SEV

   - mirror VMs now work for SEV-ES too

   - fixes for reset

   - possible out-of-bounds access in IOAPIC emulation

   - fix enlightened VMCS on Windows 2022

  ARM:

   - Add missing FORCE target when building the EL2 object

   - Fix a PMU probe regression on some platforms

  Generic:

   - KCSAN fixes

  selftests:

   - random fixes, mostly for clang compilation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
  selftests: KVM: Explicitly use movq to read xmm registers
  selftests: KVM: Call ucall_init when setting up in rseq_test
  KVM: Remove tlbs_dirty
  KVM: X86: Synchronize the shadow pagetable before link it
  KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
  KVM: x86: nSVM: don't copy virt_ext from vmcb12
  KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
  KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0
  KVM: x86: nSVM: restore int_vector in svm_clear_vintr
  kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
  KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
  KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
  KVM: x86: VMX: synthesize invalid VM exit when emulating invalid guest state
  KVM: x86: nSVM: refactor svm_leave_smm and smm_enter_smm
  KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode
  KVM: x86: reset pdptrs_from_userspace when exiting smm
  KVM: x86: nSVM: restore the L1 host state prior to resuming nested guest on SMM exit
  KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
  KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs
  KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
  ...
2021-09-27 13:58:23 -07:00
Zhenzhong Duan 5c49d1850d KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry,
clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i].
Modifying guest_uret_msrs directly is completely broken as 'i' does not
point at the MSR_IA32_TSX_CTRL entry.  In fact, it's guaranteed to be an
out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior
loop. By sheer dumb luck, the fallout is limited to "only" failing to
preserve the host's TSX_CTRL_CPUID_CLEAR.  The out-of-bounds access is
benign as it's guaranteed to clear a bit in a guest MSR value, which are
always zero at vCPU creation on both x86-64 and i386.

Cc: stable@vger.kernel.org
Fixes: 8ea8b8d6f8 ("KVM: VMX: Use common x86's uret MSR list as the one true list")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-27 11:25:40 -04:00
Thomas Gleixner 441e903693 x86/softirq: Disable softirq stacks on PREEMPT_RT
PREEMPT_RT preempts softirqs and the current implementation avoids
do_softirq_own_stack() and only uses __do_softirq().

Disable the unused softirqs stacks on PREEMPT_RT to safe some memory and
ensure that do_softirq_own_stack() is not used which is not expected.

 [ bigeasy: commit description. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210924161245.2357247-1-bigeasy@linutronix.de
2021-09-27 12:28:32 +02:00
Linus Torvalds 5bb7b2107f A set of fixes for X86:
- Prevent sending the wrong signal when protection keys are enabled and
    the kernel handles a fault in the vsyscall emulation.
 
  - Invoke early_reserve_memory() before invoking e820_memory_setup() which
    is required to make the Xen dom0 e820 hooks work correctly.
 
  - Use the correct data type for the SETZ operand in the EMQCMDS
    instruction wrapper.
 
  - Prevent undefined behaviour to the potential unaligned accesss in the
    instroction decoder library.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmFQQaITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaZjD/0TF0mE8QUhI4tyGELdNgwvje5iZ9vg
 Nd9KJpR4hUALHgfUD44NVl9JWawFY2d8FXyIPoAFEcvmy6o4f1w0ia8US3hQWA0Y
 EdLSigXi/eYSstkONaJUEBCxlLbwy7JDzaazA9DeKOEuRc7NWSyZURYvzTAkPK1Y
 mbE9kjKhjFa5NGnSB8HbSF2yEzFsKaTo4nreWP/OkzDjnEMshLR1/FUOUvZmlsgA
 CWjMxAVYFqeJN3QhDgR/vRKPoz1sOjDL1s4AsU+xdy63WyFJZ7Z1b8t6bOBoYh6w
 UztkuOkzZ6pIdzz4O1WGoFx4/FJ74qNx0vO/hOB+cKH6rgJs6AkHAvwlnjI/fE2C
 Y+IsuE4PBXMRpkaayTCsAq/enabwgKsmLSUu916APrhVvuUtb3GJgyhedLE3mEBw
 yZXezzRDhNpYop2yQSRXDeKebpoQgl+zqEP5g1O8pAFnud8FGHnz64eJV7Su7Y7C
 BCac0hmv+drlqb/jOSYqjsfo6QfhvR60WwDIgTplOMMLa3plEJFx/rIuU2xVg5g9
 w0m2QUsZboyT2yBnl8gRrqrcQmv2t4iX6TAj9Wm23Lx41h94JQMRtZyJT9bcNqY9
 jMJu27BcNSveciZA7W2DVUlFf/gTF3bwpF7ZDWRt/VSrHPtkI9WKlERhQaywo1L0
 rF8SGCEuNU2ktw==
 =h7v1
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for X86:

   - Prevent sending the wrong signal when protection keys are enabled
     and the kernel handles a fault in the vsyscall emulation.

   - Invoke early_reserve_memory() before invoking e820_memory_setup()
     which is required to make the Xen dom0 e820 hooks work correctly.

   - Use the correct data type for the SETZ operand in the EMQCMDS
     instruction wrapper.

   - Prevent undefined behaviour to the potential unaligned accesss in
     the instruction decoder library"

* tag 'x86-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/insn, tools/x86: Fix undefined behavior due to potential unaligned accesses
  x86/asm: Fix SETZ size enqcmds() build failure
  x86/setup: Call early_reserve_memory() earlier
  x86/fault: Fix wrong signal when vsyscall fails with pkey
2021-09-26 10:09:20 -07:00
Linus Torvalds 5739844347 xen: branch for v5.15-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYU8xegAKCRCAXGG7T9hj
 vvk2APwM85dFSMNHQo5+S35X+h+M8uKuqqvPYxVtKqQEwu3LXAEAg0cVgr1lWegI
 X98f07/+M0rPJl24kgW/EIxAD9fsWAw=
 =JILz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.15b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "Some minor cleanups and fixes of some theoretical bugs, as well as a
  fix of a bug introduced in 5.15-rc1"

* tag 'for-linus-5.15b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/x86: fix PV trap handling on secondary processors
  xen/balloon: fix balloon kthread freezing
  swiotlb-xen: this is PV-only on x86
  xen/pci-swiotlb: reduce visibility of symbols
  PCI: only build xen-pcifront in PV-enabled environments
  swiotlb-xen: ensure to issue well-formed XENMEM_exchange requests
  Xen/gntdev: don't ignore kernel unmapping error
  xen/x86: drop redundant zeroing from cpu_initialize_context()
2021-09-25 15:37:31 -07:00
Kees Cook cfecea6ead lib/string: Move helper functions out of string.c
The core functions of string.c are those that may be implemented by
per-architecture functions, or overloaded by FORTIFY_SOURCE. As a
result, it needs to be built with __NO_FORTIFY. Without this, macros
will collide with function declarations. This was accidentally working
due to -ffreestanding (on some architectures). Make this deterministic
by explicitly setting __NO_FORTIFY and move all the helper functions
into string_helpers.c so that they gain the fortification coverage they
had been missing.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Andy Lavr <andy.lavr@gmail.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-09-25 08:20:49 -07:00
Borislav Petkov f3f07ae425 x86/umip: Downgrade warning messages to debug loglevel
After four years in the wild, those have not fullfilled their
initial purpose of pushing people to fix their software to not use
UMIP-emulated instructions, and to warn users about the degraded
emulation performance.

Yet, the only thing that "degrades" performance is overflowing dmesg
with those:

  [Di Sep  7 00:24:05 2021] umip_printk: 1345 callbacks suppressed
  [Di Sep  7 00:24:05 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b7c0: SIDT instruction cannot be used by applications.
  [Di Sep  7 00:24:05 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b7c0: For now, expensive software emulation returns the result.
  ...
  [Di Sep  7 00:26:06 2021] umip_printk: 2227 callbacks suppressed
  [Di Sep  7 00:26:06 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b940: SIDT instruction cannot be used by applications.

and users don't really care about that - they just want to play their
games in wine.

So convert those to debug loglevel - in case someone is still interested
in them, someone can boot with "debug" on the kernel cmdline.

Reported-by: Marcus Rückert <mrueckert@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20210907200454.30458-1-bp@alien8.de
2021-09-25 13:23:28 +02:00
Numfor Mbiziwo-Tiapo 5ba1071f75 x86/insn, tools/x86: Fix undefined behavior due to potential unaligned accesses
Don't perform unaligned loads in __get_next() and __peek_nbyte_next() as
these are forms of undefined behavior:

"A pointer to an object or incomplete type may be converted to a pointer
to a different object or incomplete type. If the resulting pointer
is not correctly aligned for the pointed-to type, the behavior is
undefined."

(from http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)

These problems were identified using the undefined behavior sanitizer
(ubsan) with the tools version of the code and perf test.

 [ bp: Massage commit message. ]

Signed-off-by: Numfor Mbiziwo-Tiapo <nums@google.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210923161843.751834-1-irogers@google.com
2021-09-24 12:37:38 +02:00
Josh Poimboeuf 0e14ef3866 crypto: x86/sm4 - Fix frame pointer stack corruption
sm4_aesni_avx_crypt8() sets up the frame pointer (which includes pushing
RBP) before doing a conditional sibling call to sm4_aesni_avx_crypt4(),
which sets up an additional frame pointer.  Things will not go well when
sm4_aesni_avx_crypt4() pops only the innermost single frame pointer and
then tries to return to the outermost frame pointer.

Sibling calls need to occur with an empty stack frame.  Do the
conditional sibling call *before* setting up the stack pointer.

This fixes the following warning:

  arch/x86/crypto/sm4-aesni-avx-asm_64.o: warning: objtool: sm4_aesni_avx_crypt8()+0x8: sibling call from callable instruction with modified stack frame

Fixes: a7ee22ee14 ("crypto: x86/sm4 - add AES-NI/AVX/x86_64 implementation")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-09-24 15:58:50 +08:00
Jakub Kicinski 2fcd14d0f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/mptcp/protocol.c
  977d293e23 ("mptcp: ensure tx skbs always have the MPTCP ext")
  efe686ffce ("mptcp: ensure tx skbs always have the MPTCP ext")

same patch merged in both trees, keep net-next.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-23 11:19:49 -07:00
Lai Jiangshan 65855ed8b0 KVM: X86: Synchronize the shadow pagetable before link it
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM.  So the host must synchronze sp before
link it.  Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb.  The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a0 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 11:01:00 -04:00
Lai Jiangshan f81602958c KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page().  If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
and if the tlb entry is still in a remote running VCPU,  the @gfn
is not safely protected.

To fix the problem, kvm_sync_page() does the remote flush when
needed to avoid the problem.

Fixes: a4ee1ca4a3 ("KVM: MMU: delay flush all tlbs on sync_page path")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:28:44 -04:00
Maxim Levitsky faf6b75562 KVM: x86: nSVM: don't copy virt_ext from vmcb12
These field correspond to features that we don't expose yet to L2

While currently there are no CVE worthy features in this field,
if AMD adds more features to this field, that could allow guest
escapes similar to CVE-2021-3653 and CVE-2021-3656.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-6-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:06:46 -04:00
Maxim Levitsky d1cba6c922 KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
GP SVM errata workaround made the #GP handler always emulate
the SVM instructions.

However these instructions #GP in case the operand is not 4K aligned,
but the workaround code didn't check this and we ended up
emulating these instructions anyway.

This is only an emulation accuracy check bug as there is no harm for
KVM to read/write unaligned vmcb images.

Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:05:29 -04:00
Maxim Levitsky aee77e1169 KVM: x86: nSVM: restore int_vector in svm_clear_vintr
In svm_clear_vintr we try to restore the virtual interrupt
injection that might be pending, but we fail to restore
the interrupt vector.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:04:40 -04:00
Borislav Petkov 15802468a9 x86/mce: Sort mca_config members to get rid of unnecessary padding
$ pahole -C mca_config arch/x86/kernel/cpu/mce/core.o

before:

   /* size: 40, cachelines: 1, members: 16 */
   /* sum members: 21, holes: 1, sum holes: 3 */
   /* sum bitfield members: 64 bits, bit holes: 2, sum bit holes: 32 bits */
   /* padding: 4 */
   /* last cacheline: 40 bytes */

after:

   /* size: 32, cachelines: 1, members: 16 */
   /* padding: 3 */
   /* last cacheline: 32 bytes */

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-6-bp@alien8.de
2021-09-23 11:38:04 +02:00
Borislav Petkov cc466666ab x86/mce: Get rid of the ->quirk_no_way_out() indirect call
Use a flag setting to call the only quirk function for that.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-5-bp@alien8.de
2021-09-23 11:34:49 +02:00
Borislav Petkov 8121b8f947 x86/mce: Get rid of msr_ops
Avoid having indirect calls and use a normal function which returns the
proper MSR address based on ->smca setting.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-4-bp@alien8.de
2021-09-23 11:20:20 +02:00
Borislav Petkov cbe1de162d x86/mce: Get rid of machine_check_vector
Get rid of the indirect function pointer and use flags settings instead
to steer execution.

Now that it is not an indirect call any longer, drop the instrumentation
annotation for objtool too.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-3-bp@alien8.de
2021-09-23 11:15:49 +02:00
Borislav Petkov 631adc7b0b x86/mce: Get rid of the mce_severity function pointer
Turn it into a normal function which calls an AMD- or Intel-specific
variant depending on the CPU it runs on.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-2-bp@alien8.de
2021-09-23 11:03:51 +02:00
Anders Roxell 724fc0248d x86/fpu/signal: Fix missed conversion to correct boolean retval in save_xstate_epilog()
Fix the missing return code polarity in save_xstate_epilog().

 [ bp: Massage, use the right commit in the Fixes: tag ]

Fixes: 2af07f3a6e ("x86/fpu/signal: Change return type of copy_fpregs_to_sigframe() helpers to boolean")
Reported-by: Remi Duraffort <remi.duraffort@linaro.org>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1461
Link: https://lkml.kernel.org/r/20210922200901.1823741-1-anders.roxell@linaro.org
2021-09-23 10:52:20 +02:00
Kees Cook d81ff5fe14 x86/asm: Fix SETZ size enqcmds() build failure
When building under GCC 4.9 and 5.5:

  arch/x86/include/asm/special_insns.h: Assembler messages:
  arch/x86/include/asm/special_insns.h:286: Error: operand size mismatch for `setz'

Change the type to "bool" for condition code arguments, as documented.

Fixes: 7f5933f81b ("x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction")
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910223332.3224851-1-keescook@chromium.org
2021-09-22 19:45:48 +02:00
Fares Mehanna e1fc1553cd kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
consistent behavior between Intel and AMD when using KVM_GET_MSRS,
KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.

We have to add legacy and new MSRs to handle guests running without
X86_FEATURE_PERFCTR_CORE.

Signed-off-by: Fares Mehanna <faresx@amazon.de>
Message-Id: <20210915133951.22389-1-faresx@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 11:30:14 -04:00
Maxim Levitsky dbab610a5b KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
If L1 had invalid state on VM entry (can happen on SMM transactions
when we enter from real mode, straight to nested guest),

then after we load 'host' state from VMCS12, the state has to become
valid again, but since we load the segment registers with
__vmx_set_segment we weren't always updating emulation_required.

Update emulation_required explicitly at end of load_vmcs12_host_state.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Maxim Levitsky c8607e4a08 KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
It is possible that when non root mode is entered via special entry
(!from_vmentry), that is from SMM or from loading the nested state,
the L2 state could be invalid in regard to non unrestricted guest mode,
but later it can become valid.

(for example when RSM emulation restores segment registers from SMRAM)

Thus delay the check to VM entry, where we will check this and fail.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Maxim Levitsky c42dec148b KVM: x86: VMX: synthesize invalid VM exit when emulating invalid guest state
Since no actual VM entry happened, the VM exit information is stale.
To avoid this, synthesize an invalid VM guest state VM exit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:49 -04:00
Maxim Levitsky 136a55c054 KVM: x86: nSVM: refactor svm_leave_smm and smm_enter_smm
Use return statements instead of nested if, and fix error
path to free all the maps that were allocated.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:43 -04:00
Maxim Levitsky e85d3e7b49 KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode
Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs,
and MSR bitmap, with former not really needed for SMM as SMM exit code
reloads them again from SMRAM'S CR3, and later happens to work
since MSR bitmap isn't modified while in SMM.

Still it is better to be consistient with VMX.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:17 -04:00
Maxim Levitsky 37687c403a KVM: x86: reset pdptrs_from_userspace when exiting smm
When exiting SMM, pdpts are loaded again from the guest memory.

This fixes a theoretical bug, when exit from SMM triggers entry to the
nested guest which re-uses some of the migration
code which uses this flag as a workaround for a legacy userspace.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:16 -04:00
Maxim Levitsky e2e6e449d6 KVM: x86: nSVM: restore the L1 host state prior to resuming nested guest on SMM exit
Otherwise guest entry code might see incorrect L1 state (e.g paging state).

Fixes: 37be407b2c ("KVM: nSVM: Fix L1 state corruption upon return from SMM")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:16 -04:00
Vitaly Kuznetsov 8d68bad6d8 KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when
enlightened VMCS is advertised. Debugging revealed there are two exposed
secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and
SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported,
as there are no corresponding fields in eVMCSv1 (see the comment above
EVMCS1_UNSUPPORTED_2NDEXEC definition).

Previously, commit 31de3d2500 ("x86/kvm/hyper-v: move VMX controls
sanitization out of nested_enable_evmcs()") introduced the required
filtering mechanism for VMX MSRs but for some reason put only known
to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls
there.

Note, Windows Server 2022 seems to have gained some sanity check for VMX
MSRs: it doesn't even try to launch a guest when there's something it
doesn't like, nested_evmcs_check_controls() mechanism can't catch the
problem.

Let's be bold this time and instead of playing whack-a-mole just filter out
all unsupported controls from VMX MSRs.

Fixes: 31de3d2500 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210907163530.110066-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:15 -04:00
Vitaly Kuznetsov 2f9b68f57c KVM: x86: Fix stack-out-of-bounds memory access from ioapic_write_indirect()
KASAN reports the following issue:

 BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
 Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798

 CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G               X --------- ---
 Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
 Call Trace:
  dump_stack+0xa5/0xe6
  print_address_description.constprop.0+0x18/0x130
  ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
  __kasan_report.cold+0x7f/0x114
  ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
  kasan_report+0x38/0x50
  kasan_check_range+0xf5/0x1d0
  kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
  kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
  ? kvm_arch_exit+0x110/0x110 [kvm]
  ? sched_clock+0x5/0x10
  ioapic_write_indirect+0x59f/0x9e0 [kvm]
  ? static_obj+0xc0/0xc0
  ? __lock_acquired+0x1d2/0x8c0
  ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]

The problem appears to be that 'vcpu_bitmap' is allocated as a single long
on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
the lower 16 bits of it with bitmap_zero() for no particular reason (my
guess would be that 'bitmap' and 'vcpu_bitmap' variables in
kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
16-bit long, the later should accommodate all possible vCPUs).

Fixes: 7ee30bc132 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Fixes: 9a2ae9f6b6 ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210827092516.1027264-7-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:14 -04:00
Peter Gonda 5b92b6ca92 KVM: SEV: Allow some commands for mirror VM
A mirrored SEV-ES VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs and have them measured, and their VMSAs encrypted. Without
this change, it is impossible to have mirror VMs as part of SEV-ES VMs.

Also allow the guest status check and debugging commands since they do
not change any guest state.

Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Nathan Tempelman <natet@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-3-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:13 -04:00
Peter Gonda f43c887cb7 KVM: SEV: Update svm_vm_copy_asid_from for SEV-ES
For mirroring SEV-ES the mirror VM will need more then just the ASID.
The FD and the handle are required to all the mirror to call psp
commands. The mirror VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs' VMSAs for SEV-ES.

Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Nathan Tempelman <natet@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-2-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:13 -04:00
Chenyi Qiang 24a996ade3 KVM: nVMX: Fix nested bus lock VM exit
Nested bus lock VM exits are not supported yet. If L2 triggers bus lock
VM exit, it will be directed to L1 VMM, which would cause unexpected
behavior. Therefore, handle L2's bus lock VM exits in L0 directly.

Fixes: fe6b6bc802 ("KVM: VMX: Enable bus lock VM exit")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210914095041.29764-1-chenyi.qiang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:12 -04:00
Sean Christopherson 94c245a245 KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry
Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is
shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU
that's guaranteed to exist).  Using kvm_get_vcpu() to find vCPU works,
but it's a rather odd and suboptimal method to check the index of a given
vCPU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:12 -04:00
Sean Christopherson 4eeef24241 KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor
Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper.  The wrapper is a
remnant of the original implementation and serves no purpose; remove it
before it gains more users.

Back when kvm_vcpu_get_idx() was added by commit 497d72d80a ("KVM: Add
kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation
was more than just a simple wrapper as vcpu->vcpu_idx did not exist and
retrieving the index meant walking over the vCPU array to find the given
vCPU.

When vcpu_idx was introduced by commit 8750e72a79 ("KVM: remember
position in kvm->vcpus array"), the helper was left behind, likely to
avoid extra thrash (but even then there were only two users, the original
arm usage having been removed at some point in the past).

No functional change intended.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:11 -04:00
Hou Wenlong e9337c843c kvm: fix wrong exception emulation in check_rdtsc
According to Intel's SDM Vol2 and AMD's APM Vol3, when
CR4.TSD is set, use rdtsc/rdtscp instruction above privilege
level 0 should trigger a #GP.

Fixes: d7eb820306 ("KVM: SVM: Add intercept checks for remaining group7 instructions")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <1297c0dd3f1bb47a6d089f850b629c7aa0247040.1629257115.git.houwenlong93@linux.alibaba.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:11 -04:00
Sean Christopherson 50c038018d KVM: SEV: Pin guest memory for write for RECEIVE_UPDATE_DATA
Require the target guest page to be writable when pinning memory for
RECEIVE_UPDATE_DATA.  Per the SEV API, the PSP writes to guest memory:

  The result is then encrypted with GCTX.VEK and written to the memory
  pointed to by GUEST_PADDR field.

Fixes: 15fb7de1a7 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210914210951.2994260-2-seanjc@google.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:11 -04:00
Mingwei Zhang f1815e0aa7 KVM: SVM: fix missing sev_decommission in sev_receive_start
DECOMMISSION the current SEV context if binding an ASID fails after
RECEIVE_START.  Per AMD's SEV API, RECEIVE_START generates a new guest
context and thus needs to be paired with DECOMMISSION:

     The RECEIVE_START command is the only command other than the LAUNCH_START
     command that generates a new guest context and guest handle.

The missing DECOMMISSION can result in subsequent SEV launch failures,
as the firmware leaks memory and might not able to allocate more SEV
guest contexts in the future.

Note, LAUNCH_START suffered the same bug, but was previously fixed by
commit 934002cd66 ("KVM: SVM: Call SEV Guest Decommission if ASID
binding fails").

Cc: Alper Gun <alpergun@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: David Rienjes <rientjes@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: John Allen <john.allen@amd.com>
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vipin Sharma <vipinsh@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Marc Orr <marcorr@google.com>
Acked-by: Brijesh Singh <brijesh.singh@amd.com>
Fixes: af43cbbf95 ("KVM: SVM: Add support for KVM_SEV_RECEIVE_START command")
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210912181815.3899316-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:10 -04:00
Peter Gonda bb18a67774 KVM: SEV: Acquire vcpu mutex when updating VMSA
The update-VMSA ioctl touches data stored in struct kvm_vcpu, and
therefore should not be performed concurrently with any VCPU ioctl
that might cause KVM or the processor to use the same data.

Adds vcpu mutex guard to the VMSA updating code. Refactors out
__sev_launch_update_vmsa() function to deal with per vCPU parts
of sev_launch_update_vmsa().

Fixes: ad73109ae7 ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20210915171755.3773766-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:10 -04:00
Yu Zhang ed7023a11b KVM: nVMX: fix comments of handle_vmon()
"VMXON pointer" is saved in vmx->nested.vmxon_ptr since
commit 3573e22cfe ("KVM: nVMX: additional checks on
vmxon region"). Also, handle_vmptrld() & handle_vmclear()
now have logic to check the VMCS pointer against the VMXON
pointer.

So just remove the obsolete comments of handle_vmon().

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210908171731.18885-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:09 -04:00
Haimin Zhang eb7511bf91 KVM: x86: Handle SRCU initialization failure during page track init
Check the return of init_srcu_struct(), which can fail due to OOM, when
initializing the page track mechanism.  Lack of checking leads to a NULL
pointer deref found by a modified syzkaller.

Reported-by: TCS Robot <tcs_robot@tencent.com>
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
[Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:09 -04:00
Sean Christopherson cd36ae8761 KVM: VMX: Remove defunct "nr_active_uret_msrs" field
Remove vcpu_vmx.nr_active_uret_msrs and its associated comment, which are
both defunct now that KVM keeps the list constant and instead explicitly
tracks which entries need to be loaded into hardware.

No functional change intended.

Fixes: ee9d22e08d ("KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210908002401.1947049-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:09 -04:00
Sean Christopherson 03a6e84069 KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT
Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
likely escaped notice because no firmware/kernel puts its page tables root
at PA=0, let alone relies on INIT to get the desired CR3 for such page
tables.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210921000303.400537-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:08 -04:00
Sean Christopherson 7117003fe4 KVM: x86: Mark all registers as avail/dirty at vCPU creation
Mark all registers as available and dirty at vCPU creation, as the vCPU has
obviously not been loaded into hardware, let alone been given the chance to
be modified in hardware.  On SVM, reading from "uninitialized" hardware is
a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
hardware does not allow for arbitrary field encoding schemes.

On VMX, backing memory for VMCSes is also zero allocated, but true
initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
architectural specification technically allows CPU implementations to
encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
the inverted value of every field, which would result in VMREAD to a
zero-allocated field returns all ones.

In practice, only the AR_BYTES fields are known to be manipulated by
hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
other words, this is technically a bug fix, but practically speakings it's
a glorified nop.

Failure to mark registers as available has been a lurking bug for quite
some time.  The original register caching supported only GPRs (+RIP, which
is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
worked because the two cacheable registers, RIP and RSP, are generally
speaking not read as side effects in other flows.

Arguably, commit aff48baa34 ("KVM: Fetch guest cr3 from hardware on
demand") was the first instance of failure to mark regs available.  While
_just_ marking CR3 available during vCPU creation wouldn't have fixed the
VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
reading GUEST_CR3 when it's available would have avoided VMREAD to a
technically-uninitialized VMCS.

Fixes: aff48baa34 ("KVM: Fetch guest cr3 from hardware on demand")
Fixes: 6de4f3ada4 ("KVM: Cache pdptrs")
Fixes: 6de12732c4 ("KVM: VMX: Optimize vmx_get_rflags()")
Fixes: 2fb92db1ec ("KVM: VMX: Cache vmcs segment fields")
Fixes: bd31fe495d ("KVM: VMX: Add proper cache tracking for CR0")
Fixes: f98c1e7712 ("KVM: VMX: Add proper cache tracking for CR4")
Fixes: 5addc23519 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
Fixes: 8791585837 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210921000303.400537-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:07 -04:00
Peter Zijlstra 7fae4c24a2 x86: Increase exception stack sizes
It turns out that a single page of stack is trivial to overflow with
all the tracing gunk enabled. Raise the exception stacks to 2 pages,
which is still half the interrupt stacks, which are at 4 pages.

Reported-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YUIO9Ye98S5Eb68w@hirez.programming.kicks-ass.net
2021-09-21 13:57:43 +02:00
Peter Zijlstra 44b979fa30 x86/mm/64: Improve stack overflow warnings
Current code has an explicit check for hitting the task stack guard;
but overflowing any of the other stacks will get you a non-descript
general #DF warning.

Improve matters by using get_stack_info_noinstr() to detetrmine if and
which stack guard page got hit, enabling a better stack warning.

In specific, Michael Wang reported what turned out to be an NMI
exception stack overflow, which is now clearly reported as such:

  [] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)

Reported-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Wang <yun.wang@linux.alibaba.com>
Link: https://lkml.kernel.org/r/YUTE/NuqnaWbST8n@hirez.programming.kicks-ass.net
2021-09-21 13:57:43 +02:00
Peter Zijlstra b968e84b50 x86/iopl: Fake iopl(3) CLI/STI usage
Since commit c8137ace56 ("x86/iopl: Restrict iopl() permission
scope") it's possible to emulate iopl(3) using ioperm(), except for
the CLI/STI usage.

Userspace CLI/STI usage is very dubious (read broken), since any
exception taken during that window can lead to rescheduling anyway (or
worse). The IOPL(2) manpage even states that usage of CLI/STI is highly
discouraged and might even crash the system.

Of course, that won't stop people and HP has the dubious honour of
being the first vendor to be found using this in their hp-health
package.

In order to enable this 'software' to still 'work', have the #GP treat
the CLI/STI instructions as NOPs when iopl(3). Warn the user that
their program is doing dubious things.

Fixes: a24ca99768 ("x86/iopl: Remove legacy IOPL option")
Reported-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org # v5.5+
Link: https://lkml.kernel.org/r/20210918090641.GD5106@worktop.programming.kicks-ass.net
2021-09-21 13:52:18 +02:00
Juergen Gross 8aa83e6395 x86/setup: Call early_reserve_memory() earlier
Commit in Fixes introduced early_reserve_memory() to do all needed
initial memblock_reserve() calls in one function. Unfortunately, the call
of early_reserve_memory() is done too late for Xen dom0, as in some
cases a Xen hook called by e820__memory_setup() will need those memory
reservations to have happened already.

Move the call of early_reserve_memory() before the call of
e820__memory_setup() in order to avoid such problems.

Fixes: a799c2bd29 ("x86/setup: Consolidate early memory reservations")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com
2021-09-21 09:52:08 +02:00
Jan Beulich 0594c58161 xen/x86: fix PV trap handling on secondary processors
The initial observation was that in PV mode under Xen 32-bit user space
didn't work anymore. Attempts of system calls ended in #GP(0x402). All
of the sudden the vector 0x80 handler was not in place anymore. As it
turns out up to 5.13 redundant initialization did occur: Once from
cpu_initialize_context() (through its VCPUOP_initialise hypercall) and a
2nd time while each CPU was brought fully up. This 2nd initialization is
now gone, uncovering that the 1st one was flawed: Unlike for the
set_trap_table hypercall, a full virtual IDT needs to be specified here;
the "vector" fields of the individual entries are of no interest. With
many (kernel) IDT entries still(?) (i.e. at that point at least) empty,
the syscall vector 0x80 ended up in slot 0x20 of the virtual IDT, thus
becoming the domain's handler for vector 0x20.

Make xen_convert_trap_info() fit for either purpose, leveraging the fact
that on the xen_copy_trap_info() path the table starts out zero-filled.
This includes moving out the writing of the sentinel, which would also
have lead to a buffer overrun in the xen_copy_trap_info() case if all
(kernel) IDT entries were populated. Convert the writing of the sentinel
to clearing of the entire table entry rather than just the address
field.

(I didn't bother trying to identify the commit which uncovered the issue
in 5.14; the commit named below is the one which actually introduced the
bad code.)

Fixes: f87e4cac4f ("xen: SMP guest support")
Cc: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/7a266932-092e-b68f-f2bb-1473b61adc6e@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-21 09:40:30 +02:00
Jiashuo Liang d4ffd5df9d x86/fault: Fix wrong signal when vsyscall fails with pkey
The function __bad_area_nosemaphore() calls kernelmode_fixup_or_oops()
with the parameter @signal being actually @pkey, which will send a
signal numbered with the argument in @pkey.

This bug can be triggered when the kernel fails to access user-given
memory pages that are protected by a pkey, so it can go down the
do_user_addr_fault() path and pass the !user_mode() check in
__bad_area_nosemaphore().

Most cases will simply run the kernel fixup code to make an -EFAULT. But
when another condition current->thread.sig_on_uaccess_err is met, which
is only used to emulate vsyscall, the kernel will generate the wrong
signal.

Add a new parameter @pkey to kernelmode_fixup_or_oops() to fix this.

 [ bp: Massage commit message, fix build error as reported by the 0day
   bot: https://lkml.kernel.org/r/202109202245.APvuT8BX-lkp@intel.com ]

Fixes: 5042d40a26 ("x86/fault: Bypass no_context() for implicit kernel faults from usermode")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jiashuo Liang <liangjs@pku.edu.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20210730030152.249106-1-liangjs@pku.edu.cn
2021-09-20 22:28:47 +02:00
Tony Luck 690658471b x86/mce: Drop copyin special case for #MC
Fixes to the iterator code to handle faults that are not on page
boundaries mean that the special case for machine check during copy from
user is no longer needed.

For a full list of those fixes, see the output of:

  git log --oneline v5.14 ^v5.13 -- lib/iov_iter.c

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-4-tony.luck@intel.com
2021-09-20 21:18:23 +02:00
Jan Beulich 794d5b8a49 swiotlb-xen: this is PV-only on x86
The code is unreachable for HVM or PVH, and it also makes little sense
in auto-translated environments. On Arm, with
xen_{create,destroy}_contiguous_region() both being stubs, I have a hard
time seeing what good the Xen specific variant does - the generic one
ought to be fine for all purposes there. Still Arm code explicitly
references symbols here, so the code will continue to be included there.

Instead of making PCI_XEN's "select" conditional, simply drop it -
SWIOTLB_XEN will be available unconditionally in the PV case anyway, and
is - as explained above - dead code in non-PV environments.

This in turn allows dropping the stubs for
xen_{create,destroy}_contiguous_region(), the former of which was broken
anyway - it failed to set the DMA handle output.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

Link: https://lore.kernel.org/r/5947b8ae-fdc7-225c-4838-84712265fc1e@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-20 17:01:21 +02:00
Jan Beulich 8e1034a526 xen/pci-swiotlb: reduce visibility of symbols
xen_swiotlb and pci_xen_swiotlb_init() are only used within the file
defining them, so make them static and remove the stubs. Otoh
pci_xen_swiotlb_detect() has a use (as function pointer) from the main
pci-swiotlb.c file - convert its stub to a #define to NULL.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Link: https://lore.kernel.org/r/aef5fc33-9c02-4df0-906a-5c813142e13c@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-20 17:01:19 +02:00
Jan Beulich 4403f8062a xen/x86: drop redundant zeroing from cpu_initialize_context()
Just after having obtained the pointer from kzalloc() there's no reason
at all to set part of the area to all zero yet another time. Similarly
there's no point explicitly clearing "ldt_ents".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky@oracle.com>

Link: https://lore.kernel.org/r/14881835-a48e-29fa-0870-e177b10fcf65@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-20 16:55:54 +02:00
Tony Luck a6e3cf70b7 x86/mce: Change to not send SIGBUS error during copy from user
Sending a SIGBUS for a copy from user is not the correct semantic.
System calls should return -EFAULT (or a short count for write(2)).

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
2021-09-20 10:55:41 +02:00
Linus Torvalds 20621d2f27 A set of x86 fixes:
- Prevent a infinite loop in the MCE recovery on return to user space,
     which was caused by a second MCE queueing work for the same page and
     thereby creating a circular work list.
 
   - Make kern_addr_valid() handle existing PMD entries, which are marked not
     present in the higher level page table, correctly instead of blindly
     dereferencing them.
 
   - Pass a valid address to sanitize_phys(). This was caused by the mixture
     of inclusive and exclusive ranges. memtype_reserve() expect 'end' being
     exclusive, but sanitize_phys() wants it inclusive. This worked so far,
     but with end being the end of the physical address space the fail is
     exposed.
 
  - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs exceed
    the previous maximum.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFHhPIACgkQEsHwGGHe
 VUqqQA/+MHQ2HxVOPxnJ0i/D1nK8ccNqTEkSN08z23RGnjqKQun/VaNIIceJY25f
 Abeb2tI+0qRrdWVPVd5YqcTHuBLmnPs6Je3MfOrG47eQNW4/SmkXYuOexK80Bew3
 YDgEV73d40rHcolXZCaonVajx+FmjoNvkDt5LpLvLcCxIyv0GClFBcZrFAm70AxI
 Feax30koh3/MIFxHoXyADN8D+MJu1GxA6QWuoTK40s3G/gTTAwimkDgnNU1JXbcj
 VvVVZaNnnAxjxrCa81blr9nDpHJCDinG9bdvDT3UDLous52hGMZTsHoHogxwfogT
 EhIgPvL8hf+wm1WXA4NyvSNKZxsGfdkvIXaUq9XYHpLRD6Ao6x7jQDL039imucqb
 9YtaH52GhG0SgJlYjkm/zrKezIjKLDen0ZYr/2iNTDM1p2GqQEFo07wC/ME8TkQ6
 /BvtbkIvOuUz3nJeV4/AO+O4kaNvto9O2eHq9oodIN9nrwmlO5fMg8XO9nrhWB11
 ChXEz6kPqta1nyZXy0mwOrlXlqzcusiroG4G9F7IBBz+t/gNwlu3uZuIgkQCXyYw
 DgKz9cnQ3RdgCFknbmEwV5oCjewm7UdcgwaDAaelHIDuWMcshZFvMf1uSjnyg4Z/
 39WI8W7W2aZnIoKWpvu8s7Gr8f1krE7C3xrkvl2WmbKPkxNAin8=
 =7cq3
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Prevent a infinite loop in the MCE recovery on return to user space,
   which was caused by a second MCE queueing work for the same page and
   thereby creating a circular work list.

 - Make kern_addr_valid() handle existing PMD entries, which are marked
   not present in the higher level page table, correctly instead of
   blindly dereferencing them.

 - Pass a valid address to sanitize_phys(). This was caused by the
   mixture of inclusive and exclusive ranges. memtype_reserve() expect
   'end' being exclusive, but sanitize_phys() wants it inclusive. This
   worked so far, but with end being the end of the physical address
   space the fail is exposed.

 - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs
   exceed the previous maximum.

* tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Avoid infinite loop for copy from user recovery
  x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
  x86/platform: Increase maximum GPIO number for X86_64
  x86/pat: Pass valid address to sanitize_phys()
2021-09-19 13:29:36 -07:00
Linus Torvalds 2f629969b0 Kbuild fixes for v5.15
- Fix bugs in checkkconfigsymbols.py
 
  - Fix missing sys import in gen_compile_commands.py
 
  - Fix missing FORCE warning for ARCH=sh builds
 
  - Fix -Wignored-optimization-argument warnings for Clang builds
 
  - Turn -Wignored-optimization-argument into an error in order to stop
    building instead of sprinkling warnings
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmFG8Q8VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGcgsP/RjnBZMNw7hXOzDOB64SrgyrmKkA
 tjfZn6uFI8RA4s/IwxQG0sPuM3Fg7YkGFNvfwwIq+c7aEhlW82NmrdxN4/0FBL4c
 4P97UUubEvQSL8o1mx5NwCdSOmatxlCQJ/lx43UrUVfxPmjug+fs5T3HzEMeWf6N
 CEoYtSq/dle6/8UJeSv7GVQGm3icvTzp8FsmNlDQTvJqNX5VgV6pXV8C3LcL+3/F
 BUI08lZz7DzPTsR0zTa5pTnCG0KRhGHIEB5m3KS79tETheOBHWHvDlHf4TSeHlSE
 9kyBDJHCuRuhfooGLXZv6CdKwsNFtLjl+xyyZLX+mawIpKvjwN+aPworfxpOfhxp
 ap6i5NaqWHsFuWPGyIpyvZytb4dljGv/c5c7tNtaTVCvz67jmUlFckPZ8wEFHr/P
 QnMvtKX7L7wwjwsqE5309UeCM2Zpn3PXrwABHJFQc5C15bp5fP1/PY0JCPkd7D5Z
 jqlzo1a8kcptSr6KHlS/k8nARoWyjhMkFMsY8587tv1d1L1zXaaqnhzz8C4v38AB
 N9Ctv6RiKhFGAIlw/AnBpywGf1EtycjbRdlmxFntw/TverQViej2AVDl1AjZhwwB
 aSh/VgJQ8Xf8lFoStl25k/oP5kjKUGOWZAfqvNQx68L0ZyoKH/okQhHj76uD/CVQ
 2pY+1ioaZLsZ4Kdj
 =EBGX
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix bugs in checkkconfigsymbols.py

 - Fix missing sys import in gen_compile_commands.py

 - Fix missing FORCE warning for ARCH=sh builds

 - Fix -Wignored-optimization-argument warnings for Clang builds

 - Turn -Wignored-optimization-argument into an error in order to stop
   building instead of sprinkling warnings

* tag 'kbuild-fixes-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: Add -Werror=ignored-optimization-argument to CLANG_FLAGS
  x86/build: Do not add -falign flags unconditionally for clang
  kbuild: Fix comment typo in scripts/Makefile.modpost
  sh: Add missing FORCE prerequisites in Makefile
  gen_compile_commands: fix missing 'sys' package
  checkkconfigsymbols.py: Remove skipping of help lines in parse_kconfig_file
  checkkconfigsymbols.py: Forbid passing 'HEAD' to --commit
2021-09-19 12:55:12 -07:00
Nathan Chancellor 7fa6a27466 x86/build: Do not add -falign flags unconditionally for clang
clang does not support -falign-jumps and only recently gained support
for -falign-loops. When one of the configuration options that adds these
flags is enabled, clang warns and all cc-{disable-warning,option} that
follow fail because -Werror gets added to test for the presence of this
warning:

clang-14: warning: optimization flag '-falign-jumps=0' is not supported
[-Wignored-optimization-argument]

To resolve this, add a couple of cc-option calls when building with
clang; gcc has supported these options since 3.2 so there is no point in
testing for their support. -falign-functions was implemented in clang-7,
-falign-loops was implemented in clang-14, and -falign-jumps has not
been implemented yet.

Link: https://lore.kernel.org/r/YSQE2f5teuvKLkON@Ryzen-9-3900X.localdomain/
Link: https://lore.kernel.org/r/20210824022640.2170859-2-nathan@kernel.org/
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-09-19 10:35:53 +09:00
Jakub Kicinski af54faab84 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-09-17

We've added 63 non-merge commits during the last 12 day(s) which contain
a total of 65 files changed, 2653 insertions(+), 751 deletions(-).

The main changes are:

1) Streamline internal BPF program sections handling and
   bpf_program__set_attach_target() in libbpf, from Andrii.

2) Add support for new btf kind BTF_KIND_TAG, from Yonghong.

3) Introduce bpf_get_branch_snapshot() to capture LBR, from Song.

4) IMUL optimization for x86-64 JIT, from Jie.

5) xsk selftest improvements, from Magnus.

6) Introduce legacy kprobe events support in libbpf, from Rafael.

7) Access hw timestamp through BPF's __sk_buff, from Vadim.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits)
  selftests/bpf: Fix a few compiler warnings
  libbpf: Constify all high-level program attach APIs
  libbpf: Schedule open_opts.attach_prog_fd deprecation since v0.7
  selftests/bpf: Switch fexit_bpf2bpf selftest to set_attach_target() API
  libbpf: Allow skipping attach_func_name in bpf_program__set_attach_target()
  libbpf: Deprecated bpf_object_open_opts.relaxed_core_relocs
  selftests/bpf: Stop using relaxed_core_relocs which has no effect
  libbpf: Use pre-setup sec_def in libbpf_find_attach_btf_id()
  bpf: Update bpf_get_smp_processor_id() documentation
  libbpf: Add sphinx code documentation comments
  selftests/bpf: Skip btf_tag test if btf_tag attribute not supported
  docs/bpf: Add documentation for BTF_KIND_TAG
  selftests/bpf: Add a test with a bpf program with btf_tag attributes
  selftests/bpf: Test BTF_KIND_TAG for deduplication
  selftests/bpf: Add BTF_KIND_TAG unit tests
  selftests/bpf: Change NAME_NTH/IS_NAME_NTH for BTF_KIND_TAG format
  selftests/bpf: Test libbpf API function btf__add_tag()
  bpftool: Add support for BTF_KIND_TAG
  libbpf: Add support for BTF_KIND_TAG
  libbpf: Rename btf_{hash,equal}_int to btf_{hash,equal}_int_tag
  ...
====================

Link: https://lore.kernel.org/r/20210917173738.3397064-1-ast@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-17 12:40:21 -07:00
Tim Gardner 85784470ef x86/smp: Remove unnecessary assignment to local var freq_scale
Coverity warns of an unused value in arch_scale_freq_tick():

  CID 100778 (#1 of 1): Unused value (UNUSED_VALUE)
  assigned_value: Assigning value 1024ULL to freq_scale here, but that stored
  value is overwritten before it can be used.

It was introduced by commit:

  e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")

Remove the variable initializer.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/20210910184405.24422-1-tim.gardner@canonical.com
2021-09-17 21:20:34 +02:00
Linus Torvalds c6460daea2 xen: branch for v5.15-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYUR8DAAKCRCAXGG7T9hj
 vjclAP0ekisZSabOsmzM0iz8rpuYj113/kQozokFeFD5eQlEiAD/b5LytLosic/1
 5XKa5fbOyimAyRBwblWhpLLhrW6uiAg=
 =WoAB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.15b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - The first hunk of a Xen swiotlb fixup series fixing multiple minor
   issues and doing some small cleanups

 - Some further Xen related fixes avoiding WARN() splats when running as
   Xen guests or dom0

 - A Kconfig fix allowing the pvcalls frontend to be built as a module

* tag 'for-linus-5.15b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  swiotlb-xen: drop DEFAULT_NSLABS
  swiotlb-xen: arrange to have buffer info logged
  swiotlb-xen: drop leftover __ref
  swiotlb-xen: limit init retries
  swiotlb-xen: suppress certain init retries
  swiotlb-xen: maintain slab count properly
  swiotlb-xen: fix late init retry
  swiotlb-xen: avoid double free
  xen/pvcalls: backend can be a module
  xen: fix usage of pmd_populate in mremap for pv guests
  xen: reset legacy rtc flag for PV domU
  PM: base: power: don't try to use non-existing RTC for storing data
  xen/balloon: use a kernel thread instead a workqueue
2021-09-17 08:31:49 -07:00
Leo Yan 41100833cd perf/x86: Add compiler barrier after updating BTS
Since BTS is coherent, simply add a compiler barrier to separate the BTS
update and aux_head store.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210809111407.596077-5-leo.yan@linaro.org
2021-09-17 15:08:38 +02:00
Peter Zijlstra 1462eb381b x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
In order to allow objtool to make sense of all the various paravirt
functions, it needs to either parse whole pv_ops[] tables, or observe
individual assignments in the form:

  bf87:       48 c7 05 00 00 00 00 00 00 00 00        movq   $0x0,0x0(%rip)
    bf92 <xen_init_spinlocks+0x5f>
    bf8a: R_X86_64_PC32     pv_ops+0x268

As is, xen_cpu_ops[] is at offset +0 in pv_ops[] and could thus be
parsed as a 'normal' pv_ops[] table, however xen_irq_ops[] and
xen_mmu_ops[] are not.

Worse, both the latter two are compiled into the individual assignment
for by current GCC, but that's not something one can rely on.

Therefore, convert all three into full pv_ops[] tables. This has the
benefit of not needing to teach objtool about the offsets and
resulting in more conservative code-gen.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095149.057262522@infradead.org
2021-09-17 13:20:26 +02:00
Peter Zijlstra 847d9317b2 x86/xen: Mark xen_force_evtchn_callback() noinstr
vmlinux.o: warning: objtool: check_events()+0xd: call to xen_force_evtchn_callback() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.996055323@infradead.org
2021-09-17 13:20:25 +02:00
Peter Zijlstra 09c413071e x86/xen: Make irq_disable() noinstr
vmlinux.o: warning: objtool: pv_ops[31]: native_irq_disable
vmlinux.o: warning: objtool: pv_ops[31]: __raw_callee_save_xen_irq_disable
vmlinux.o: warning: objtool: pv_ops[31]: xen_irq_disable_direct
vmlinux.o: warning: objtool: lock_is_held_type()+0x5b: call to pv_ops[31]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.933869441@infradead.org
2021-09-17 13:20:23 +02:00
Peter Zijlstra d7bfc7d57c x86/xen: Make irq_enable() noinstr
vmlinux.o: warning: objtool: pv_ops[32]: native_irq_enable
vmlinux.o: warning: objtool: pv_ops[32]: __raw_callee_save_xen_irq_enable
vmlinux.o: warning: objtool: pv_ops[32]: xen_irq_enable_direct
vmlinux.o: warning: objtool: lock_is_held_type()+0xfe: call to pv_ops[32]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.872254932@infradead.org
2021-09-17 13:17:12 +02:00
Peter Zijlstra 74ea805b79 x86/xen: Make hypercall_page noinstr
vmlinux.o: warning: objtool: xen_set_debugreg()+0x3: call to hypercall_page() leaves .noinstr.text section
vmlinux.o: warning: objtool: xen_get_debugreg()+0x3: call to hypercall_page() leaves .noinstr.text section
vmlinux.o: warning: objtool: xen_irq_enable()+0x24: call to hypercall_page() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.810950584@infradead.org
2021-09-17 13:14:44 +02:00
Peter Zijlstra 20125c872a x86/xen: Make save_fl() noinstr
vmlinux.o: warning: objtool: pv_ops[30]: native_save_fl
vmlinux.o: warning: objtool: pv_ops[30]: __raw_callee_save_xen_save_fl
vmlinux.o: warning: objtool: pv_ops[30]: xen_save_fl_direct
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x73: call to pv_ops[30]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.749712274@infradead.org
2021-09-17 13:14:44 +02:00
Peter Zijlstra 7361fac046 x86/xen: Make set_debugreg() noinstr
vmlinux.o: warning: objtool: pv_ops[2]: xen_set_debugreg
vmlinux.o: warning: objtool: pv_ops[2]: native_set_debugreg
vmlinux.o: warning: objtool: exc_debug()+0x3b: call to pv_ops[2]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.687755639@infradead.org
2021-09-17 13:14:39 +02:00
Peter Zijlstra f4afb713e5 x86/xen: Make get_debugreg() noinstr
vmlinux.o: warning: objtool: pv_ops[1]: xen_get_debugreg
vmlinux.o: warning: objtool: pv_ops[1]: native_get_debugreg
vmlinux.o: warning: objtool: exc_debug()+0x25: call to pv_ops[1]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.625523645@infradead.org
2021-09-17 13:12:34 +02:00
Peter Zijlstra 209cfd0cbb x86/xen: Make write_cr2() noinstr
vmlinux.o: warning: objtool: pv_ops[42]: native_write_cr2
vmlinux.o: warning: objtool: pv_ops[42]: xen_write_cr2
vmlinux.o: warning: objtool: exc_nmi()+0x127: call to pv_ops[42]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.563524913@infradead.org
2021-09-17 13:12:16 +02:00
Peter Zijlstra 0a53c9acf4 x86/xen: Make read_cr2() noinstr
vmlinux.o: warning: objtool: pv_ops[41]: native_read_cr2
vmlinux.o: warning: objtool: pv_ops[41]: xen_read_cr2
vmlinux.o: warning: objtool: pv_ops[41]: xen_read_cr2_direct
vmlinux.o: warning: objtool: exc_double_fault()+0x15: call to pv_ops[41]() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.500331616@infradead.org
2021-09-17 13:11:50 +02:00
Shreyansh Chouhan a2d3cbc80d crypto: aesni - check walk.nbytes instead of err
In the code for xts_crypt(), we check for the err value returned by
skcipher_walk_virt() and return from the function if it is non zero.
However, skcipher_walk_virt() can set walk.nbytes to 0, which would cause
us to call kernel_fpu_begin(), and then skip the kernel_fpu_end() call.

This patch checks for the walk.nbytes value instead, and returns if
walk.nbytes is 0. This prevents us from calling kernel_fpu_begin() in
the first place and also covers the case of having a non zero err value
returned from skcipher_walk_virt().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-09-17 11:06:15 +08:00
Linus Torvalds ff1ffd71d5 hyperv-fixes for 5.15-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmFB6pwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXoo5CAChbzKMMbqBHArnNCO+pKkUWmc7eYqJ
 U368ux75wWEy6ywCUxCHqhwnTrp5KJhyjTPi89V8Vwh+aNG6q86g2dT3I6qsoIby
 Dav9yw1NiExxNzAEiJVH/WgE+WGZUvWqzbKixdZWjDk9DWhVv7h96chik9dvh9SW
 /nm27o4sNmnFETQ+kh/hmX+8T6V8HeqZuL9WrGw4EW9At/WE16vjk47Wm5gJRl+j
 Z1KylALvOiarzzMH3Qx1IxvZ1789JtCIr2b5rHJH8tCPvPF0P2dihm/Wjf6xguyT
 tDMvquBdQnfugbZXQDy58Agp34Dw+fHCFaOmoruJePa78qqBYzujHvW9
 =gBaz
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20210915' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Fix kernel crash caused by uio driver (Vitaly Kuznetsov)

 - Remove on-stack cpumask from HV APIC code (Wei Liu)

* tag 'hyperv-fixes-signed-20210915' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: remove on-stack cpumask from hv_send_ipi_mask_allbutself
  asm-generic/hyperv: provide cpumask_to_vpset_noself
  Drivers: hv: vmbus: Fix kernel crash upon unbinding a device from uio_hv_generic driver
2021-09-15 17:18:56 -07:00
Peter Zijlstra eac46b323b x86/paravirt: Use PVOP_* for paravirt calls
Doing unconditional indirect calls through the pv_ops vector is weird.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.437720419@infradead.org
2021-09-15 15:51:48 +02:00
Peter Zijlstra e9382440de x86/paravirt: Mark arch_local_irq_*() __always_inline
vmlinux.o: warning: objtool: lockdep_hardirqs_on()+0x72: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x73: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: match_held_lock()+0x11f: call to arch_local_save_flags() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0x4e: call to arch_local_irq_save() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0x65: call to arch_local_irq_disable() leaves .noinstr.text section

vmlinux.o: warning: objtool: lock_is_held_type()+0xfe: call to arch_local_irq_enable() leaves .noinstr.text section

It makes no sense to not inline these things.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210624095148.373073648@infradead.org
2021-09-15 15:51:47 +02:00
Peter Zijlstra 2c36d87be4 x86/sev: Fix noinstr for vc_ghcb_invalidate()
vmlinux.o: warning: objtool: __sev_put_ghcb()+0x88: call to __memset() leaves .noinstr.text section
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0x39: call to __memset() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095148.250770465@infradead.org
2021-09-15 15:51:47 +02:00
Peter Zijlstra c6b01dace2 x86: Always inline ip_within_syscall_gap()
vmlinux.o: warning: objtool: vc_switch_off_ist()+0x20: call to ip_within_syscall_gap.isra.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095148.188166492@infradead.org
2021-09-15 15:51:47 +02:00
Peter Zijlstra 010050a863 x86/kvm: Always inline evmcs_write64()
vmlinux.o: warning: objtool: vmx_update_host_rsp()+0x64: call to evmcs_write64() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095148.126956644@infradead.org
2021-09-15 15:51:46 +02:00
Peter Zijlstra aee045ed0a x86/kvm: Always inline to_svm()
vmlinux.o: warning: objtool: svm_vcpu_enter_exit()+0x13: call to to_svm() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095148.066347165@infradead.org
2021-09-15 15:51:46 +02:00
Peter Zijlstra a168233a44 x86/kvm: Always inline vmload() / vmsave()
vmlinux.o: warning: objtool: svm_vcpu_enter_exit()+0xea: call to vmload() leaves .noinstr.text section
vmlinux.o: warning: objtool: svm_vcpu_enter_exit()+0x133: call to vmsave() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095147.942250748@infradead.org
2021-09-15 15:51:45 +02:00
Peter Zijlstra 2b2f72d4d8 x86/kvm: Always inline sev_*guest()
vmlinux.o: warning: objtool: svm_vcpu_enter_exit()+0x4d: call to sev_es_guest() leaves .noinstr.text section
vmlinux.o: warning: objtool: svm_vcpu_enter_exit()+0x50: call to sev_guest() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210624095147.880513802@infradead.org
2021-09-15 15:51:45 +02:00
Josh Poimboeuf b7b205c3a0 x86/xen: Move hypercall_page to top of the file
Because hypercall_page is page-aligned, the assembler inexplicably adds
an unreachable jump from after the end of the previous code to the
beginning of hypercall_page.

That confuses objtool, understandably.  It also creates significant text
fragmentation.  As a result, much of the object file is wasted text
(nops).

Move hypercall_page to the beginning of the file to both prevent the
text fragmentation and avoid the dead jump instruction.

$ size /tmp/head_64.before.o /tmp/head_64.after.o
   text	   data	    bss	    dec	    hex	filename
  10924	 307252	   4096	 322272	  4eae0	/tmp/head_64.before.o
   6823	 307252	   4096	 318171	  4dadb	/tmp/head_64.after.o

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/20210820193107.omvshmsqbpxufzkc@treble
2021-09-15 15:51:44 +02:00
Juergen Gross 36c9b5929b xen: fix usage of pmd_populate in mremap for pv guests
Commit 0881ace292 ("mm/mremap: use pmd/pud_poplulate to update page
table entries") introduced a regression when running as Xen PV guest.

Today pmd_populate() for Xen PV assumes that the PFN inserted is
referencing a not yet used page table. In case of move_normal_pmd()
this is not true, resulting in WARN splats like:

[34321.304270] ------------[ cut here ]------------
[34321.304277] WARNING: CPU: 0 PID: 23628 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x176/0x1a0
[34321.304288] Modules linked in:
[34321.304291] CPU: 0 PID: 23628 Comm: apt-get Not tainted 5.14.1-20210906-doflr-mac80211debug+ #1
[34321.304294] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[34321.304296] RIP: e030:xen_mc_flush+0x176/0x1a0
[34321.304300] Code: 89 45 18 48 c1 e9 3f 48 89 ce e9 20 ff ff ff e8 60 03 00 00 66 90 5b 5d 41 5c 41 5d c3 48 c7 45 18 ea ff ff ff be 01 00 00 00 <0f> 0b 8b 55 00 48 c7 c7 10 97 aa 82 31 db 49 c7 c5 38 97 aa 82 65
[34321.304303] RSP: e02b:ffffc90000a97c90 EFLAGS: 00010002
[34321.304305] RAX: ffff88807d416398 RBX: ffff88807d416350 RCX: ffff88807d416398
[34321.304306] RDX: 0000000000000001 RSI: 0000000000000001 RDI: deadbeefdeadf00d
[34321.304308] RBP: ffff88807d416300 R08: aaaaaaaaaaaaaaaa R09: ffff888006160cc0
[34321.304309] R10: deadbeefdeadf00d R11: ffffea000026a600 R12: 0000000000000000
[34321.304310] R13: ffff888012f6b000 R14: 0000000012f6b000 R15: 0000000000000001
[34321.304320] FS:  00007f5071177800(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
[34321.304322] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[34321.304323] CR2: 00007f506f542000 CR3: 00000000160cc000 CR4: 0000000000000660
[34321.304326] Call Trace:
[34321.304331]  xen_alloc_pte+0x294/0x320
[34321.304334]  move_pgt_entry+0x165/0x4b0
[34321.304339]  move_page_tables+0x6fa/0x8d0
[34321.304342]  move_vma.isra.44+0x138/0x500
[34321.304345]  __x64_sys_mremap+0x296/0x410
[34321.304348]  do_syscall_64+0x3a/0x80
[34321.304352]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[34321.304355] RIP: 0033:0x7f507196301a
[34321.304358] Code: 73 01 c3 48 8b 0d 76 0e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 46 0e 0c 00 f7 d8 64 89 01 48
[34321.304360] RSP: 002b:00007ffda1eecd38 EFLAGS: 00000246 ORIG_RAX: 0000000000000019
[34321.304362] RAX: ffffffffffffffda RBX: 000056205f950f30 RCX: 00007f507196301a
[34321.304363] RDX: 0000000001a00000 RSI: 0000000001900000 RDI: 00007f506dc56000
[34321.304364] RBP: 0000000001a00000 R08: 0000000000000010 R09: 0000000000000004
[34321.304365] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f506dc56060
[34321.304367] R13: 00007f506dc56000 R14: 00007f506dc56060 R15: 000056205f950f30
[34321.304368] ---[ end trace a19885b78fe8f33e ]---
[34321.304370] 1 of 2 multicall(s) failed: cpu 0
[34321.304371]   call  2: op=12297829382473034410 arg=[aaaaaaaaaaaaaaaa] result=-22

Fix that by modifying xen_alloc_ptpage() to only pin the page table in
case it wasn't pinned already.

Fixes: 0881ace292 ("mm/mremap: use pmd/pud_poplulate to update page table entries")
Cc: <stable@vger.kernel.org>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210908073640.11299-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-15 08:42:04 +02:00
Juergen Gross f68aa100d8 xen: reset legacy rtc flag for PV domU
A Xen PV guest doesn't have a legacy RTC device, so reset the legacy
RTC flag. Otherwise the following WARN splat will occur at boot:

[    1.333404] WARNING: CPU: 1 PID: 1 at /home/gross/linux/head/drivers/rtc/rtc-mc146818-lib.c:25 mc146818_get_time+0x1be/0x210
[    1.333404] Modules linked in:
[    1.333404] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W         5.14.0-rc7-default+ #282
[    1.333404] RIP: e030:mc146818_get_time+0x1be/0x210
[    1.333404] Code: c0 64 01 c5 83 fd 45 89 6b 14 7f 06 83 c5 64 89 6b 14 41 83 ec 01 b8 02 00 00 00 44 89 63 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 30 0e ef 82 4c 89 e6 e8 71 2a 24 00 48 c7 c0 ff ff
[    1.333404] RSP: e02b:ffffc90040093df8 EFLAGS: 00010002
[    1.333404] RAX: 00000000000000ff RBX: ffffc90040093e34 RCX: 0000000000000000
[    1.333404] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000d
[    1.333404] RBP: ffffffff82ef0e30 R08: ffff888005013e60 R09: 0000000000000000
[    1.333404] R10: ffffffff82373e9b R11: 0000000000033080 R12: 0000000000000200
[    1.333404] R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff82cdc6d4
[    1.333404] FS:  0000000000000000(0000) GS:ffff88807d440000(0000) knlGS:0000000000000000
[    1.333404] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.333404] CR2: 0000000000000000 CR3: 000000000260a000 CR4: 0000000000050660
[    1.333404] Call Trace:
[    1.333404]  ? wakeup_sources_sysfs_init+0x30/0x30
[    1.333404]  ? rdinit_setup+0x2b/0x2b
[    1.333404]  early_resume_init+0x23/0xa4
[    1.333404]  ? cn_proc_init+0x36/0x36
[    1.333404]  do_one_initcall+0x3e/0x200
[    1.333404]  kernel_init_freeable+0x232/0x28e
[    1.333404]  ? rest_init+0xd0/0xd0
[    1.333404]  kernel_init+0x16/0x120
[    1.333404]  ret_from_fork+0x1f/0x30

Cc: <stable@vger.kernel.org>
Fixes: 8d152e7a5c ("x86/rtc: Replace paravirt rtc check with platform legacy quirk")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210903084937.19392-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-15 08:42:03 +02:00
Jie Meng c035407743 bpf,x64 Emit IMUL instead of MUL for x86-64
IMUL allows for multiple operands and saving and storing rax/rdx is no
longer needed. Signedness of the operands doesn't matter here because
the we only keep the lower 32/64 bit of the product for 32/64 bit
multiplications.

Signed-off-by: Jie Meng <jmeng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210913211337.1564014-1-jmeng@fb.com
2021-09-14 18:36:36 -07:00
Linus Torvalds 77e02cf57b memblock: introduce saner 'memblock_free_ptr()' interface
The boot-time allocation interface for memblock is a mess, with
'memblock_alloc()' returning a virtual pointer, but then you are
supposed to free it with 'memblock_free()' that takes a _physical_
address.

Not only is that all kinds of strange and illogical, but it actually
causes bugs, when people then use it like a normal allocation function,
and it fails spectacularly on a NULL pointer:

   https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/

or just random memory corruption if the debug checks don't catch it:

   https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/

I really don't want to apply patches that treat the symptoms, when the
fundamental cause is this horribly confusing interface.

I started out looking at just automating a sane replacement sequence,
but because of this mix or virtual and physical addresses, and because
people have used the "__pa()" macro that can take either a regular
kernel pointer, or just the raw "unsigned long" address, it's all quite
messy.

So this just introduces a new saner interface for freeing a virtual
address that was allocated using 'memblock_alloc()', and that was kept
as a regular kernel pointer.  And then it converts a couple of users
that are obvious and easy to test, including the 'xbc_nodes' case in
lib/bootconfig.c that caused problems.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 40caa127f3 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14 13:23:22 -07:00
Thomas Gleixner a2a8fd9a3e x86/fpu/signal: Change return code of restore_fpregs_from_user() to boolean
__fpu_sig_restore() only needs information about success or fail and no
real error code.

This cleans up the confusing conversion of the trap number, which is
returned by the *RSTOR() exception fixups, to an error code.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132526.084109938@linutronix.de
2021-09-14 21:10:04 +02:00
Thomas Gleixner be00401441 x86/fpu/signal: Change return code of check_xstate_in_sigframe() to boolean
__fpu_sig_restore() only needs success/fail information and no detailed
error code.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132526.024024598@linutronix.de
2021-09-14 21:10:04 +02:00
Thomas Gleixner 1193f408cd x86/fpu/signal: Change return type of __fpu_restore_sig() to boolean
Now that fpu__restore_sig() returns a boolean get rid of the individual
error codes in __fpu_restore_sig() as well.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.966197097@linutronix.de
2021-09-14 21:10:03 +02:00
Thomas Gleixner f3305be5fe x86/fpu/signal: Change return type of fpu__restore_sig() to boolean
None of the call sites cares about the error code. All they need to know is
whether the function succeeded or not.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.909065931@linutronix.de
2021-09-14 21:10:03 +02:00
Thomas Gleixner ee4ecdfbd2 x86/signal: Change return type of restore_sigcontext() to boolean
None of the call sites cares about the return code. All they are interested
in is success or fail.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.851280949@linutronix.de
2021-09-14 21:10:03 +02:00
Thomas Gleixner 2af07f3a6e x86/fpu/signal: Change return type of copy_fpregs_to_sigframe() helpers to boolean
Now that copy_fpregs_to_sigframe() returns boolean the individual return
codes in the related helper functions do not make sense anymore. Change
them to return boolean success/fail.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.794334915@linutronix.de
2021-09-14 21:10:03 +02:00