Pull s390 updates from Martin Schwidefsky:
"There is only one new feature in this pull for the 4.4 merge window,
most of it is small enhancements, cleanup and bug fixes:
- Add the s390 backend for the software dirty bit tracking. This
adds two new pgtable functions pte_clear_soft_dirty and
pmd_clear_soft_dirty which is why there is a hit to
arch/x86/include/asm/pgtable.h in this pull request.
- A series of cleanup patches for the AP bus, this includes the
removal of the support for two outdated crypto cards (PCICC and
PCICA).
- The irq handling / signaling on buffer full in the runtime
instrumentation code is dropped.
- Some micro optimizations: remove unnecessary memory barriers for a
couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for
spin_unlock. Use the builtin bswap if available and make
test_and_set_bit_lock more cache friendly.
- Statistics and a tracepoint for the diagnose calls to the
hypervisor.
- The CPU measurement facility support to sample KVM guests is
improved.
- The vector instructions are now always enabled for user space
processes if the hardware has the vector facility. This simplifies
the FPU handling code. The fpu-internal.h header is split into fpu
internals, api and types just like x86.
- Cleanup and improvements for the common I/O layer.
- Rework udelay to solve a problem with kprobe. udelay has busy loop
semantics but still uses an idle processor state for the wait"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits)
s390: remove runtime instrumentation interrupts
s390/cio: de-duplicate subchannel validation
s390/css: unneeded initialization in for_each_subchannel
s390/Kconfig: use builtin bswap
s390/dasd: fix disconnected device with valid path mask
s390/dasd: fix invalid PAV assignment after suspend/resume
s390/dasd: fix double free in dasd_eckd_read_conf
s390/kernel: fix ptrace peek/poke for floating point registers
s390/cio: move ccw_device_stlck functions
s390/cio: move ccw_device_call_handler
s390/topology: reduce per_cpu() invocations
s390/nmi: reduce size of percpu variable
s390/nmi: fix terminology
s390/nmi: remove casts
s390/nmi: remove pointless error strings
s390: don't store registers on disabled wait anymore
s390: get rid of __set_psw_mask()
s390/fpu: split fpu-internal.h into fpu internals, api, and type headers
s390/dasd: fix list_del corruption after lcu changes
s390/spinlock: remove unneeded serializations at unlock
...
Pull networking updates from David Miller:
Changes of note:
1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.
2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
David Ahern.
3) Allow the user to ask for the statistics to be filtered out of
ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.
4) More work to pass the network namespace context around deep into
various packet path APIs, starting with the netfilter hooks. From
Eric W Biederman.
5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
Richter.
6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.
7) Support Very High Throughput in wireless MESH code, from Bob
Copeland.
8) Allow setting the ageing_time in switchdev/rocker. From Scott
Feldman.
9) Properly autoload L2TP type modules, from Stephen Hemminger.
10) Fix and enable offload features by default in 8139cp driver, from
David Woodhouse.
11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
Jiri Benc.
12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
Opstad.
13) Fix IPSEC flowcache overflows on large systems, from Steffen
Klassert.
14) Convert bridging to track VLANs using rhashtable entries rather than
a bitmap. From Nikolay Aleksandrov.
15) Make TCP listener handling completely lockless, this is a major
accomplishment. Incoming request sockets now live in the
established hash table just like any other socket too.
From Eric Dumazet.
15) Provide more bridging attributes to netlink, from Nikolay
Aleksandrov.
16) Use hash based algorithm for ipv4 multipath routing, this was very
long overdue. From Peter Nørlund.
17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.
18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.
19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
influences the port binding selection logic used by SO_REUSEPORT.
20) Add ipv6 support to VRF, from David Ahern.
21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.
22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.
23) Implement RACK loss recovery in TCP, from Yuchung Cheng.
24) Support multipath routes in MPLS, from Roopa Prabhu.
25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
Dumazet.
26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
Sudarsana Kalluru.
27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
Sowa.
28) Support ipv6 geneve tunnels, from John W Linville.
29) Add flood control support to switchdev layer, from Ido Schimmel.
30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
Hannes Frederic Sowa.
31) Support persistent maps and progs in bpf, from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
sh_eth: use DMA barriers
switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
net: sched: kill dead code in sch_choke.c
irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
net: dsa: mv88e6xxx: include DSA ports in VLANs
net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
net/core: fix for_each_netdev_feature
vlan: Invoke driver vlan hooks only if device is present
arcnet/com20020: add LEDS_CLASS dependency
bpf, verifier: annotate verbose printer with __printf
dp83640: Only wait for timestamps for packets with timestamping enabled.
ptp: Change ptp_class to a proper bitmask
dp83640: Prune rx timestamp list before reading from it
dp83640: Delay scheduled work.
dp83640: Include hash in timestamp/packet matching
ipv6: fix tunnel error handling
net/mlx5e: Fix LSO vlan insertion
net/mlx5e: Re-eanble client vlan TX acceleration
net/mlx5e: Return error in case mlx5e_set_features() fails
net/mlx5e: Don't allow more than max supported channels
...
Pull crypto update from Herbert Xu:
"API:
- Add support for cipher output IVs in testmgr
- Add missing crypto_ahash_blocksize helper
- Mark authenc and des ciphers as not allowed under FIPS.
Algorithms:
- Add CRC support to 842 compression
- Add keywrap algorithm
- A number of changes to the akcipher interface:
+ Separate functions for setting public/private keys.
+ Use SG lists.
Drivers:
- Add Intel SHA Extension optimised SHA1 and SHA256
- Use dma_map_sg instead of custom functions in crypto drivers
- Add support for STM32 RNG
- Add support for ST RNG
- Add Device Tree support to exynos RNG driver
- Add support for mxs-dcp crypto device on MX6SL
- Add xts(aes) support to caam
- Add ctr(aes) and xts(aes) support to qat
- A large set of fixes from Russell King for the marvell/cesa driver"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (115 commits)
crypto: asymmetric_keys - Fix unaligned access in x509_get_sig_params()
crypto: akcipher - Don't #include crypto/public_key.h as the contents aren't used
hwrng: exynos - Add Device Tree support
hwrng: exynos - Fix missing configuration after suspend to RAM
hwrng: exynos - Add timeout for waiting on init done
dt-bindings: rng: Describe Exynos4 PRNG bindings
crypto: marvell/cesa - use __le32 for hardware descriptors
crypto: marvell/cesa - fix missing cpu_to_le32() in mv_cesa_dma_add_op()
crypto: marvell/cesa - use memcpy_fromio()/memcpy_toio()
crypto: marvell/cesa - use gfp_t for gfp flags
crypto: marvell/cesa - use dma_addr_t for cur_dma
crypto: marvell/cesa - use readl_relaxed()/writel_relaxed()
crypto: caam - fix indentation of close braces
crypto: caam - only export the state we really need to export
crypto: caam - fix non-block aligned hash calculation
crypto: caam - avoid needlessly saving and restoring caam_hash_ctx
crypto: caam - print errno code when hash registration fails
crypto: marvell/cesa - fix memory leak
crypto: marvell/cesa - fix first-fragment handling in mv_cesa_ahash_dma_last_req()
crypto: marvell/cesa - rearrange handling for sw padded hashes
...
Commit b18d5431ac ("KVM: x86: fix CR0.CD virtualization") was
technically correct, but it broke OVMF guests by slowing down various
parts of the firmware.
Commit fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
first function modified by b18d5431ac, vmx_get_mt_mask(), for OVMF's
sake. This restored the speed of the OVMF code that runs before
PlatformPei (including the memory intensive LZMA decompression in SEC).
This patch extends the quirk to the second function modified by
b18d5431ac, kvm_set_cr0(). It eliminates the intrusive slowdown that
hits the EFI_MP_SERVICES_PROTOCOL implementation of edk2's
UefiCpuPkg/CpuDxe -- which is built into OVMF --, when CpuDxe starts up
all APs at once for initialization, in order to count them.
We also carry over the kvm_arch_has_noncoherent_dma() sub-condition from
the other half of the original commit b18d5431ac.
Fixes: b18d5431ac
Cc: stable@vger.kernel.org
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Tested-by: Janusz Mocek <januszmk6@gmail.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>#
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SDM says that exiting system management mode from 64-bit mode
is invalid, but that would be too good to be true. But actually,
most of the code is already there to support exiting from compat
mode (EFER.LME=1, EFER.LMA=0). Getting all the way from 64-bit
mode to real mode only requires clearing CS.L and CR4.PCIDE.
Cc: stable@vger.kernel.org
Fixes: 660a5d517a
Tested-by: Laszlo Ersek <lersek@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The comment in code had it mostly right, but we enable paging for
emulated real mode regardless of EPT.
Without EPT (which implies emulated real mode), secondary VCPUs won't
start unless we disable SM[AE]P when the guest doesn't use paging.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function is not used outside device assignment, and
kvm_arch_set_irq_inatomic has a different prototype. Move it here and
make it static to avoid confusion.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We do not want to do too much work in atomic context, in particular
not walking all the VCPUs of the virtual machine. So we want
to distinguish the architecture-specific injection function for irqfd
from kvm_set_msi. Since it's still empty, reuse the newly added
kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
BSP doesn't get INIT so its apic_arb_prio isn't zeroed after reboot.
BSP won't get lowest priority interrupts until other VCPUs get enough
interrupts to match their pre-reboot apic_arb_prio.
That behavior doesn't fit into KVM's round-robin-like interpretation of
lowest priority delivery ... userspace should KVM_SET_LAPIC on reset, so
just zero apic_arb_prio there.
Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
GET_SMSTATE depends on real mode to ensure that smbase+offset is treated
as a physical address, which has already caused a bug after shuffling
the code. Enforce physical addressing.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to read the physical memory when emulating RSM.
X86EMUL_IO_NEEDED is returned on all errors for consistency with other
helpers.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
removing unused variables, found by coccinelle
Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWOhgAAAoJEEtpOizt6ddyKS8H/2ZHMTPjo6yChnrusNWy4Qbr
6laPDlzL+g45oMQRwNL7GnM1deRftaxvT2Wi+X84D/6Y/BD6MPds4HgtBfuWcSZ1
CyRJ0Ot/zrxenucSuJuOjq+a9gdizdAczkbB1MfYDULJH8fb6D+7RYLo3zgh4Xo4
pla3L9U6gSWe+YopBjZtZH43m3fwiwSM/v+uHOTIcXrsbR+fEgx/EFSKmA/DUCuo
P5cFO/ceUGu7nATCexu5V82TgR2hvurrsR7mqfwY8YcF6HRM+NEOoS29xWC77v5S
u/F08TKuKQLv0YTEFTyLETI/oEeuC0cHtrRQBNf4+9kXEOzKyXaae0wR/I6X2Ss=
=GMNk
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM Changes for v4.4-rc1
Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.
Conflicts:
arch/x86/include/asm/kvm_host.h
Pull x86 platform changes from Ingo Molnar:
"Misc updates to the Intel MID and SGI UV platforms"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel-mid: Make intel_mid_ops static
arch/x86/intel-mid: Use kmemdup rather than duplicating its implementation
x86/platform/uv: Implement simple dump failover if kdump fails
x86/platform/uv: Insert per_cpu accessor function on uv_hub_nmi
Pull x86 mm changes from Ingo Molnar:
"The main changes are: continued PAT work by Toshi Kani, plus a new
boot time warning about insecure RWX kernel mappings, by Stephen
Smalley.
The new CONFIG_DEBUG_WX=y warning is marked default-y if
CONFIG_DEBUG_RODATA=y is already eanbled, as a special exception, as
these bugs are hard to notice and this check already found several
live bugs"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Warn on W^X mappings
x86/mm: Fix no-change case in try_preserve_large_page()
x86/mm: Fix __split_large_page() to handle large PAT bit
x86/mm: Fix try_preserve_large_page() to handle large PAT bit
x86/mm: Fix gup_huge_p?d() to handle large PAT bit
x86/mm: Fix slow_virt_to_phys() to handle large PAT bit
x86/mm: Fix page table dump to show PAT bit
x86/asm: Add pud_pgprot() and pmd_pgprot()
x86/asm: Fix pud/pmd interfaces to handle large PAT bit
x86/asm: Add pud/pmd mask interfaces to handle large PAT bit
x86/asm: Move PUD_PAGE macros to page_types.h
x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO
Pull x86 sigcontext header cleanups from Ingo Molnar:
"This series reorganizes and cleans up various aspects of the main
sigcontext UAPI headers, such as unifying the data structures and
updating/adding lots of comments to explain all the ABI details and
quirks. The headers can now also be built in user-space standalone"
* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Clean up too long lines
x86/headers: Remove <asm/sigcontext.h> references on the kernel side
x86/headers: Remove direct sigcontext32.h uses
x86/headers: Convert sigcontext_ia32 uses to sigcontext_32
x86/headers: Unify 'struct sigcontext_ia32' and 'struct sigcontext_32'
x86/headers: Make sigcontext pointers bit independent
x86/headers: Move the 'struct sigcontext' definitions into the UAPI header
x86/headers: Clean up the kernel's struct sigcontext types to be ABI-clean
x86/headers: Convert uses of _fpstate_ia32 to _fpstate_32
x86/headers: Unify 'struct _fpstate_ia32' and i386 struct _fpstate
x86/headers: Unify register type definitions between 32-bit compat and i386
x86/headers: Use ABI types consistently in sigcontext*.h
x86/headers: Separate out legacy user-space structure definitions
x86/headers: Clean up and better document uapi/asm/sigcontext.h
x86/headers: Clean up uapi/asm/sigcontext32.h
x86/headers: Fix (old) header file dependency bug in uapi/asm/sigcontext32.h
Pull x86 fpu changes from Ingo Molnar:
"There are two main areas of changes:
- Rework of the extended FPU state code to robustify the kernel's
usage of cpuid provided xstate sizes - and related changes (Dave
Hansen)"
- math emulation enhancements: new modern FPU instructions support,
with testcases, plus cleanups (Denys Vlasnko)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/fpu: Fixup uninitialized feature_name warning
x86/fpu/math-emu: Add support for FISTTP instructions
x86/fpu/math-emu, selftests: Add test for FISTTP instructions
x86/fpu/math-emu: Add support for FCMOVcc insns
x86/fpu/math-emu: Add support for F[U]COMI[P] insns
x86/fpu/math-emu: Remove define layer for undocumented opcodes
x86/fpu/math-emu, selftests: Add tests for FCMOV and FCOMI insns
x86/fpu/math-emu: Remove !NO_UNDOC_CODE
x86/fpu: Check CPU-provided sizes against struct declarations
x86/fpu: Check to ensure increasing-offset xstate offsets
x86/fpu: Correct and check XSAVE xstate size calculations
x86/fpu: Add C structures for AVX-512 state components
x86/fpu: Rework YMM definition
x86/fpu/mpx: Rework MPX 'xstate' types
x86/fpu: Add xfeature_enabled() helper instead of test_bit()
x86/fpu: Remove 'xfeature_nr'
x86/fpu: Rework XSTATE_* macros to remove magic '2'
x86/fpu: Rename XFEATURES_NR_MAX
x86/fpu: Rename XSAVE macros
x86/fpu: Remove partial LWP support definitions
...
Pull x86 kgdb fixlet from Ingo Molnar:
"A single debugging related commit: compress the memory usage of a kgdb
data structure"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kgdb: Replace bool_int_array[NR_CPUS] with bitmap
Pull x86 cpu changes from Ingo Molnar:
"Two changes in this cycle: a Kconfig help text enhancement, and an AMD
CLZERO instruction capability detection and enumeration"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add CLZERO detection
x86/Kconfig/cpus: Fix/complete CPU type help texts
Pull x86 cleanups from Ingo Molnar:
"An early_printk cleanup plus deinlining enhancements"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/early_printk: Set __iomem address space for IO
x86/signal: Deinline get_sigframe, save 240 bytes
x86: Deinline early_console_register, save 403 bytes
x86/e820: Deinline e820_type_to_string, save 126 bytes
Pull x86 boot cleanup from Ingo Molnar:
"A single commit: remove an obsolete kcrash boot flag"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Remove obsolete 'in_crash_kexec' flag
Pull x86 asm changes from Ingo Molnar:
"The main change in this cycle is another step in the big x86 system
call interface rework by Andy Lutomirski, which moves most of the low
level x86 entry code from assembly to C, for all syscall entries
except native 64-bit system calls:
arch/x86/entry/entry_32.S | 182 ++++------
arch/x86/entry/entry_64_compat.S | 547 ++++++++-----------------------
194 insertions(+), 535 deletions(-)
... our hope is that the final remaining step (converting native
64-bit system calls) will be less painful as all the previous steps,
given that most of the legacies and quirks are concentrated around
native 32-bit and compat environments"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/entry/32: Fix FS and GS restore in opportunistic SYSEXIT
x86/entry/32: Fix entry_INT80_32() to expect interrupts to be on
um/x86: Fix build after x86 syscall changes
x86/asm: Remove the xyz_cfi macros from dwarf2.h
selftests/x86: Style fixes for the 'unwind_vdso' test
x86/entry/64/compat: Document sysenter_fix_flags's reason for existence
x86/entry: Split and inline syscall_return_slowpath()
x86/entry: Split and inline prepare_exit_to_usermode()
x86/entry: Use pt_regs_to_thread_info() in syscall entry tracing
x86/entry: Hide two syscall entry assertions behind CONFIG_DEBUG_ENTRY
x86/entry: Micro-optimize compat fast syscall arg fetch
x86/entry: Force inlining of 32-bit syscall code
x86/entry: Make irqs_disabled checks in exit code depend on lockdep
x86/entry: Remove unnecessary IRQ twiddling in fast 32-bit syscalls
x86/asm: Remove thread_info.sysenter_return
x86/entry/32: Re-implement SYSENTER using the new C path
x86/entry/32: Switch INT80 to the new C syscall path
x86/entry/32: Open-code return tracking from fork and kthreads
x86/entry/compat: Implement opportunistic SYSRETL for compat syscalls
x86/vdso/compat: Wire up SYSENTER and SYSCSALL for compat userspace
...
Pull x86 apic changes from Ingo Molnar:
"The main changes in this cycle were:
- Numachip updates: new hardware support, fixes and cleanups.
(Daniel J Blueman)
- misc smaller cleanups and fixlets"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/io_apic: Make eoi_ioapic_pin() static
x86/irq: Drop unlikely before IS_ERR_OR_NULL
x86/x2apic: Make stub functions available even if !CONFIG_X86_LOCAL_APIC
x86/apic: Deinline various functions
x86/numachip: Fix timer build conflict
x86/numachip: Introduce Numachip2 timer mechanisms
x86/numachip: Add Numachip IPI optimisations
x86/numachip: Add Numachip2 APIC support
x86/numachip: Cleanup Numachip support
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:
- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)
- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)
- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
Pull RAS changes from Ingo Molnar:
"The main system reliability related changes were from x86, but also
some generic RAS changes:
- AMD MCE error injection subsystem enhancements. (Aravind
Gopalakrishnan)
- Fix MCE and CPU hotplug interaction bug. (Ashok Raj)
- kcrash bootup robustness fix. (Baoquan He)
- kcrash cleanups. (Borislav Petkov)
- x86 microcode driver rework: simplify it by unmodularizing it and
other cleanups. (Borislav Petkov)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
x86/mce: Add a Scalable MCA vendor flags bit
MAINTAINERS: Unify the microcode driver section
x86/microcode/intel: Move #ifdef DEBUG inside the function
x86/microcode/amd: Remove maintainers from comments
x86/microcode: Remove modularization leftovers
x86/microcode: Merge the early microcode loader
x86/microcode: Unmodularize the microcode driver
x86/mce: Fix thermal throttling reporting after kexec
kexec/crash: Say which char is the unrecognized
x86/setup/crash: Check memblock_reserve() retval
x86/setup/crash: Cleanup some more
x86/setup/crash: Remove alignment variable
x86/setup: Cleanup crashkernel reservation functions
x86/amd_nb, EDAC: Rename amd_get_node_id()
x86/setup: Do not reserve crashkernel high memory if low reservation failed
x86/microcode/amd: Do not overwrite final patch levels
x86/microcode/amd: Extract current patch level read to a function
x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Improve accuracy of perf/sched clock on x86. (Adrian Hunter)
- Intel DS and BTS updates. (Alexander Shishkin)
- Intel cstate PMU support. (Kan Liang)
- Add group read support to perf_event_read(). (Peter Zijlstra)
- Branch call hardware sampling support, implemented on x86 and
PowerPC. (Stephane Eranian)
- Event groups transactional interface enhancements. (Sukadev
Bhattiprolu)
- Enable proper x86/intel/uncore PMU support on multi-segment PCI
systems. (Taku Izumi)
- ... misc fixes and cleanups.
The perf tooling team was very busy again with 200+ commits, the full
diff doesn't fit into lkml size limits. Here's an (incomplete) list
of the tooling highlights:
New features:
- Change the default event used in all tools (record/top): use the
most precise "cycles" hw counter available, i.e. when the user
doesn't specify any event, it will try using cycles:ppp, cycles:pp,
etc and fall back transparently until it finds a working counter.
(Arnaldo Carvalho de Melo)
- Integration of perf with eBPF that, given an eBPF .c source file
(or .o file built for the 'bpf' target with clang), will get it
automatically built, validated and loaded into the kernel via the
sys_bpf syscall, which can then be used and seen using 'perf trace'
and other tools.
(Wang Nan)
Various user interface improvements:
- Automatic pager invocation on long help output. (Namhyung Kim)
- Search for more options when passing args to -h, e.g.: (Arnaldo
Carvalho de Melo)
$ perf report -h interface
Usage: perf report [<options>]
--gtk Use the GTK2 interface
--stdio Use the stdio interface
--tui Use the TUI interface
- Show ordered command line options when -h is used or when an
unknown option is specified. (Arnaldo Carvalho de Melo)
- If options are passed after -h, show just its descriptions, not all
options. (Arnaldo Carvalho de Melo)
- Implement column based horizontal scrolling in the hists browser
(top, report), making it possible to use the TUI for things like
'perf mem report' where there are many more columns than can fit in
a terminal. (Arnaldo Carvalho de Melo)
- Enhance the error reporting of tracepoint event parsing, e.g.:
$ oldperf record -e sched:sched_switc usleep 1
event syntax error: 'sched:sched_switc'
\___ unknown tracepoint
Run 'perf list' for a list of valid events
Now we get the much nicer:
$ perf record -e sched:sched_switc ls
event syntax error: 'sched:sched_switc'
\___ can't access trace events
Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switc
Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'
And after we have those mount point permissions fixed:
$ perf record -e sched:sched_switc ls
event syntax error: 'sched:sched_switc'
\___ unknown tracepoint
Error: File /sys/kernel/debug/tracing/events/sched/sched_switc not found.
Hint: Perhaps this kernel misses some CONFIG_ setting to enable this feature?.
I.e. basically now the event parsing routing uses the strerror_open()
routines introduced by and used in 'perf trace' work. (Jiri Olsa)
- Fail properly when pattern matching fails to find a tracepoint,
i.e. '-e non:existent' was being correctly handled, with a proper
error message about that not being a valid event, but '-e
non:existent*' wasn't, fix it. (Jiri Olsa)
- Do event name substring search as last resort in 'perf list'.
(Arnaldo Carvalho de Melo)
E.g.:
# perf list clock
List of pre-defined events (to be used in -e):
cpu-clock [Software event]
task-clock [Software event]
uncore_cbox_0/clockticks/ [Kernel PMU event]
uncore_cbox_1/clockticks/ [Kernel PMU event]
kvm:kvm_pvclock_update [Tracepoint event]
kvm:kvm_update_master_clock [Tracepoint event]
power:clock_disable [Tracepoint event]
power:clock_enable [Tracepoint event]
power:clock_set_rate [Tracepoint event]
syscalls:sys_enter_clock_adjtime [Tracepoint event]
syscalls:sys_enter_clock_getres [Tracepoint event]
syscalls:sys_enter_clock_gettime [Tracepoint event]
syscalls:sys_enter_clock_nanosleep [Tracepoint event]
syscalls:sys_enter_clock_settime [Tracepoint event]
syscalls:sys_exit_clock_adjtime [Tracepoint event]
syscalls:sys_exit_clock_getres [Tracepoint event]
syscalls:sys_exit_clock_gettime [Tracepoint event]
syscalls:sys_exit_clock_nanosleep [Tracepoint event]
syscalls:sys_exit_clock_settime [Tracepoint event]
Intel PT hardware tracing enhancements:
- Accept a zero --itrace period, meaning "as often as possible". In
the case of Intel PT that is the same as a period of 1 and a unit
of 'instructions' (i.e. --itrace=i1i). (Adrian Hunter)
- Harmonize itrace's synthesized callchains with the existing
--max-stack tool option. (Adrian Hunter)
- Allow time to be displayed in nanoseconds in 'perf script'.
(Adrian Hunter)
- Fix potential infinite loop when handling Intel PT timestamps.
(Adrian Hunter)
- Slighly improve Intel PT debug logging. (Adrian Hunter)
- Warn when AUX data has been lost, just like when processing
PERF_RECORD_LOST. (Adrian Hunter)
- Further document export-to-postgresql.py script. (Adrian Hunter)
- Add option to synthesize branch stack from auxtrace data. (Adrian
Hunter)
Misc notable changes:
- Switch the default callchain output mode to 'graph,0.5,caller', to
make it look like the default for other tools, reducing the
learning curve for people used to 'caller' based viewing. (Arnaldo
Carvalho de Melo)
- various call chain usability enhancements. (Namhyung Kim)
- Introduce the 'P' event modifier, meaning 'max precision level,
please', i.e.:
$ perf record -e cycles:P usleep 1
Is now similar to:
$ perf record usleep 1
Useful, for instance, when specifying multiple events. (Jiri Olsa)
- Add 'socket' sort entry, to sort by the processor socket in 'perf
top' and 'perf report'. (Kan Liang)
- Introduce --socket-filter to 'perf report', for filtering by
processor socket. (Kan Liang)
- Add new "Zoom into Processor Socket" operation in the perf hists
browser, used in 'perf top' and 'perf report'. (Kan Liang)
- Allow probing on kmodules without DWARF. (Masami Hiramatsu)
- Fix 'perf probe -l' for probes added to kernel module functions.
(Masami Hiramatsu)
- Preparatory work for the 'perf stat record' feature that will allow
generating perf.data files with counting data in addition to the
sampling mode we have now (Jiri Olsa)
- Update libtraceevent KVM plugin. (Paolo Bonzini)
- ... plus lots of other enhancements that I failed to list properly,
by: Adrian Hunter, Alexander Shishkin, Andi Kleen, Andrzej Hajda,
Arnaldo Carvalho de Melo, Dima Kogan, Don Zickus, Geliang Tang, He
Kuang, Huaitong Han, Ingo Molnar, Jan Stancek, Jiri Olsa, Kan
Liang, Kirill Tkhai, Masami Hiramatsu, Matt Fleming, Namhyung Kim,
Paolo Bonzini, Peter Zijlstra, Rabin Vincent, Scott Wood, Stephane
Eranian, Sukadev Bhattiprolu, Taku Izumi, Vaishali Thakkar, Wang
Nan, Yang Shi and Yunlong Song"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (260 commits)
perf unwind: Pass symbol source to libunwind
tools build: Fix libiberty feature detection
perf tools: Compile scriptlets to BPF objects when passing '.c' to --event
perf record: Add clang options for compiling BPF scripts
perf bpf: Attach eBPF filter to perf event
perf tools: Make sure fixdep is built before libbpf
perf script: Enable printing of branch stack
perf trace: Add cmd string table to decode sys_bpf first arg
perf bpf: Collect perf_evsel in BPF object files
perf tools: Load eBPF object into kernel
perf tools: Create probe points for BPF programs
perf tools: Enable passing bpf object file to --event
perf ebpf: Add the libbpf glue
perf tools: Make perf depend on libbpf
perf symbols: Fix endless loop in dso__split_kallsyms_for_kcore
perf tools: Enable pre-event inherit setting by config terms
perf symbols: we can now read separate debug-info files based on a build ID
perf symbols: Fix type error when reading a build-id
perf tools: Search for more options when passing args to -h
perf stat: Cache aggregated map entries in extra cpumap
...
Pull locking changes from Ingo Molnar:
"The main changes in this cycle were:
- More gradual enhancements to atomic ops: new atomic*_read_ctrl()
ops, synchronize atomic_{read,set}() ordering requirements between
architectures, add atomic_long_t bitops. (Peter Zijlstra)
- Add _{relaxed|acquire|release}() variants for inc/dec atomics and
use them in various locking primitives: mutex, rtmutex, mcs, rwsem.
This enables weakly ordered architectures (such as arm64) to make
use of more locking related optimizations. (Davidlohr Bueso)
- Implement atomic[64]_{inc,dec}_relaxed() on ARM. (Will Deacon)
- Futex kernel data cache footprint micro-optimization. (Rasmus
Villemoes)
- pvqspinlock runtime overhead micro-optimization. (Waiman Long)
- misc smaller fixlets"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM, locking/atomics: Implement _relaxed variants of atomic[64]_{inc,dec}
locking/rwsem: Use acquire/release semantics
locking/mcs: Use acquire/release semantics
locking/rtmutex: Use acquire/release semantics
locking/mutex: Use acquire/release semantics
locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec atomics
atomic: Implement atomic_read_ctrl()
atomic, arch: Audit atomic_{read,set}()
atomic: Add atomic_long_t bitops
futex: Force hot variables into a single cache line
locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
locking/osq: Relax atomic semantics
locking/qrwlock: Rename ->lock to ->wait_lock
locking/Documentation/lockstat: Fix typo - lokcing -> locking
locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h
Pull EFI changes from Ingo Molnar:
"The main changes in this cycle were:
- further EFI code generalization to make it more workable for ARM64
- various extensions, such as 64-bit framebuffer address support,
UEFI v2.5 EFI_PROPERTIES_TABLE support
- code modularization simplifications and cleanups
- new debugging parameters
- various fixes and smaller additions"
* 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
efi: Use correct type for struct efi_memory_map::phys_map
x86/efi: Fix kernel panic when CONFIG_DEBUG_VIRTUAL is enabled
efi: Add "efi_fake_mem" boot option
x86/efi: Rename print_efi_memmap() to efi_print_memmap()
efi: Auto-load the efi-pstore module
efi: Introduce EFI_NX_PE_DATA bit and set it from properties table
efi: Add support for UEFIv2.5 Properties table
efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
efifb: Add support for 64-bit frame buffer addresses
efi/arm64: Clean up efi_get_fdt_params() interface
arm64: Use core efi=debug instead of uefi_debug command line parameter
efi/x86: Move efi=debug option parsing to core
drivers/firmware: Make efi/esrt.c driver explicitly non-modular
efi: Use the generic efi.memmap instead of 'memmap'
acpi/apei: Use appropriate pgprot_t to map GHES memory
arm64, acpi/apei: Implement arch_apei_get_mem_attributes()
arm64/mm: Add PROT_DEVICE_nGnRnE and PROT_NORMAL_WT
acpi, x86: Implement arch_apei_get_mem_attributes()
efi, x86: Rearrange efi_mem_attributes()
...
Pull timer updates from Thomas Gleixner:
"The timer departement provides:
- More y2038 work in the area of ntp and pps.
- Optimization of posix cpu timers
- New time related selftests
- Some new clocksource drivers
- The usual pile of fixes, cleanups and improvements"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
timeconst: Update path in comment
timers/x86/hpet: Type adjustments
clocksource/drivers/armada-370-xp: Implement ARM delay timer
clocksource/drivers/tango_xtal: Add new timer for Tango SoCs
clocksource/drivers/imx: Allow timer irq affinity change
clocksource/drivers/exynos_mct: Use container_of() instead of this_cpu_ptr()
clocksource/drivers/h8300_*: Remove unneeded memset()s
clocksource/drivers/sh_cmt: Remove unneeded memset() in sh_cmt_setup()
clocksource/drivers/em_sti: Remove unneeded memset()s
clocksource/drivers/mediatek: Use GPT as sched clock source
clockevents/drivers/mtk: Fix spurious interrupt leading to crash
posix_cpu_timer: Reduce unnecessary sighand lock contention
posix_cpu_timer: Convert cputimer->running to bool
posix_cpu_timer: Check thread timers only when there are active thread timers
posix_cpu_timer: Optimize fastpath_timer_check()
timers, kselftest: Add 'adjtick' test to validate adjtimex() tick adjustments
timers: Use __fls in apply_slack()
clocksource: Remove return statement from void functions
net: sfc: avoid using timespec
ntp/pps: use y2038 safe types in pps_event_time
...
* pci/aer:
PCI/AER: Clear error status registers during enumeration and restore
* pci/hotplug:
PCI: pciehp: Queue power work requests in dedicated function
* pci/misc:
PCI: Turn off Request Attributes to avoid Chelsio T5 Completion erratum
x86/PCI: Make pci_subsys_init() static
PCI: Add builtin_pci_driver() to avoid registration boilerplate
PCI: Remove unnecessary "if" statement
* pci/msi:
x86/PCI: Don't alloc pcibios-irq when MSI is enabled
PCI/MSI: Export all remapped MSIs to sysfs attributes
PCI: Disable MSI on SiS 761
* pci/resource:
sparc/PCI: Add mem64 resource parsing for root bus
PCI: Expand Enhanced Allocation BAR output
PCI: Make Enhanced Allocation bitmasks more obvious
PCI: Handle Enhanced Allocation capability for SR-IOV devices
PCI: Add support for Enhanced Allocation devices
PCI: Add Enhanced Allocation register entries
PCI: Handle IORESOURCE_PCI_FIXED when assigning resources
PCI: Handle IORESOURCE_PCI_FIXED when sizing resources
PCI: Clear IORESOURCE_UNSET when reverting to firmware-assigned address
* pci/virtualization:
PCI: Fix sriov_enable() error path for pcibios_enable_sriov() failures
PCI: Wait 1 second between disabling VFs and clearing NumVFs
PCI: Reorder pcibios_sriov_disable()
PCI: Remove VFs in reverse order if virtfn_add() fails
PCI: Remove redundant validation of SR-IOV offset/stride registers
PCI: Set SR-IOV NumVFs to zero after enumeration
PCI: Enable SR-IOV ARI Capable Hierarchy before reading TotalVFs
PCI: Don't try to restore VF BARs
On some NUMA system, after dom0 up, we see below warning even if there are
enough pfn ranges that could be used for remapping:
"Unable to find available pfn range, not remapping identity pages"
Fix it to avoid getting a memory region of zero size in xen_find_pfn_range.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
* pm-cpufreq:
cpufreq: postfix policy directory with the first CPU in related_cpus
cpufreq: create cpu/cpufreq/policyX directories
cpufreq: remove cpufreq_sysfs_{create|remove}_file()
cpufreq: create cpu/cpufreq at boot time
cpufreq: Use cpumask_copy instead of cpumask_or to copy a mask
cpufreq: ondemand: Drop unnecessary locks from update_sampling_rate()
cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value
cpufreq: intel_pstate: Avoid calculation for max/min
Documentation: kernel_parameters for Intel P state driver
cpufreq: intel_pstate: Use ACPI perf configuration
cpufreq: intel-pstate: Use separate max pstate for scaling
cpufreq: intel_pstate: get P1 from TAR when available
cpufreq: Drop redundant check for inactive policies
cpufreq : powernv: Report Pmax throttling if capped below nominal frequency
cpufreq: imx: update the clock switch flow to support imx6ul
cpufreq: tegra20: remove superfluous CONFIG_PM ifdefs
cpufreq: conservative: remove 'enable' field
cpufreq: integrator: Fix module autoload for OF platform driver
* pm-cpuidle:
cpuidle: mvebu: disable the bind/unbind attributes and use builtin_platform_driver
cpuidle: mvebu: clean up multiple platform drivers
AMD Fam17h processors introduce support for the CLZERO
instruction. It zeroes out the 64 byte cache line specified in
RAX.
Add the bit here to allow /proc/cpuinfo to list the feature.
Boris: we're adding this as a separate ->x86_capability leaf
because CPUID_80000008_EBX is going to contain more feature bits
and it will fill out with time.
Signed-off-by: Wan Zongshun <Vincent.Wan@amd.com>
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
[ Wrap code in patch form, fix comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1446207099-24948-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Scalable MCA (SMCA) is a new feature in AMD Fam17h processors
which indicates presence of MCA extensions.
MCA extensions expands existing register space for the MCE banks
and also introduces a new MSR range to accommodate new banks.
Add the detection bit.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
[ Reformat mce_vendor_flags definitions and save indentation levels. Improve comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1446207099-24948-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
thread.vm86 points to per-task information -- the pointer should not
be copied on clone.
Fixes: d4ce0f26c7 ("x86/vm86: Move fields from 'struct kernel_vm86_struct' to 'struct vm86'")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Stas Sergeev <stsp@list.ru>
Link: http://lkml.kernel.org/r/71c5d6985d70ec8197c8d72f003823c81b7dcf99.1446270067.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Mapping a large range of foreign GFNs can take a long time, add a
reschedule point after each batch of 16 GFNs.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
We have been getting away with using a void* for the physical
address of the UEFI memory map, since, even on 32-bit platforms
with 64-bit physical addresses, no truncation takes place if the
memory map has been allocated by the firmware (which only uses
1:1 virtually addressable memory), which is usually the case.
However, commit:
0f96a99dab ("efi: Add "efi_fake_mem" boot option")
adds code that clones and modifies the UEFI memory map, and the
clone may live above 4 GB on 32-bit platforms.
This means our use of void* for struct efi_memory_map::phys_map has
graduated from 'incorrect but working' to 'incorrect and
broken', and we need to fix it.
So redefine struct efi_memory_map::phys_map as phys_addr_t, and
get rid of a bunch of casts that are now unneeded.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: izumi.taku@jp.fujitsu.com
Cc: kamezawa.hiroyu@jp.fujitsu.com
Cc: linux-efi@vger.kernel.org
Cc: matt.fleming@intel.com
Link: http://lkml.kernel.org/r/1445593697-1342-1-git-send-email-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__pa() in the x86 pageattr code. Since these virtual addreses are
not part of the direct mapping or kernel text mapping, passing them
to __pa() will trigger a BUG_ON() when CONFIG_DEBUG_VIRTUAL is
enabled - Sai Praneeth Prakhya
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWLLEiAAoJEC84WcCNIz1VWL4P/3i6TI8IksbMkjLrUPS80g9Q
QgYw2U1r750JrVkYGMQ6VyE1nZtSI7sbXRAjhYGLOQvs6dnCcsmAcT8zPVNSNVs4
hxRW2+hXe8LfMqddpGhSN7/s1Ssg7weEynUvQ58F+qRI4e5Lbk2zz4pqO6IQQVv2
Q4jIXosy7VlDxbx7Nn0wUov6cHNTeLQhGLqUiDJEw32qUnWA6NhYGrQof98Pu+gi
vcigb62ebicvduv6TRm5zPYANYz8AJqt7cywL7DkjT/vSKF+k/l6W5KWfOoVPd8N
99z9uTSJa1j+LuRYkPr8+YSPW9OtXD55QPkGYAFo/OWTt7j/QBgbGj0zJKWayrdK
3JefcQlH5wau5adaAjJwCm9qbapRS8De1yCMEz05gW8g5eZfqo32dnMg7UAWbWh5
fT6dAWffSxbYBLqPKV18nMgWTopwXh4IGddhB6y811SImUXhOvA7jwjJHYei8dKf
3eBN5rQ4owo0GfIiBGpKOt4ET2klgSF8xnnlYmbA7QeMRPKySNcXowsJ9wPimfZS
MGhmpz+tv/030ROuBqVjrvkGyFdxKgPbtH2hw6Ka7t88YqlyUyUIiu7pxZM2LY9g
l+6tLRWmExF993Tk3S9vCSfYEZwDHU2aYsObscgy1YEdydX4858uTSH91sVkoyBR
dtv8PmonJp/z3aw7XZKU
=xdZV
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi
Pull EFI fix from Matt Fleming:
- Fix a kernel panic by not passing EFI virtual mapping addresses to
__pa() in the x86 pageattr code. Since these virtual addreses are
not part of the direct mapping or kernel text mapping, passing them
to __pa() will trigger a BUG_ON() when CONFIG_DEBUG_VIRTUAL is
enabled. (Sai Praneeth Prakhya)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4857c91f0d changed the way how irq affinity is setup in
setup_ioapic_dest() from using the core helper function to
unconditionally calling the irq_set_affinity() callback of the
underlying irq chip.
That results in a NULL pointer dereference for the rare case where the
underlying irq chip is lapic_chip which has no irq_set_affinity()
callback. lapic_chip is occasionally used for the timer interrupt (irq
0).
The fix is simple: Check the availability of the callback instead of
calling it unconditionally.
Fixes: 4857c91f0d "x86/ioapic: Force affinity setting in setup_ioapic_dest()"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Commit 6894258eda broke drivers that pass NULL as the device pointer
to dma_alloc. The reason is that arch_dma_alloc_attrs() now calls
dma_alloc_coherent_gfp_flags() which in turn calls
dma_alloc_coherent_mask(), where the device pointer is dereferenced
unconditionally.
Fix things by moving the ISA DMA fallback device assignment before the
call to dma_alloc_coherent_gfp_flags().
Fixes: 6894258eda ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}")
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/1445807503-8920-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* acpi-pci:
ia64/PCI/ACPI: Use common interface to support PCI host bridge
x86/PCI/ACPI: Use common interface to support PCI host bridge
ACPI/PCI: Reset acpi_root_dev->domain to 0 when pci_ignore_seg is set
PCI/ACPI: Add interface acpi_pci_root_create()
ia64/PCI: Use common struct resource_entry to replace struct iospace_resource
ia64/PCI/ACPI: Use common ACPI resource parsing interface for host bridge
ACPI/PCI: Enhance ACPI core to support sparse IO space
When CONFIG_DEBUG_VIRTUAL is enabled, all accesses to __pa(address) are
monitored to see whether address falls in direct mapping or kernel text
mapping (see Documentation/x86/x86_64/mm.txt for details), if it does
not, the kernel panics. During 1:1 mapping of EFI runtime services we access
virtual addresses which are == physical addresses, thus the 1:1 mapping
and these addresses do not fall in either of the above two regions and
hence when passed as arguments to __pa() kernel panics as reported by
Dave Hansen here https://lkml.kernel.org/r/5462999A.7090706@intel.com.
So, before calling __pa() virtual addresses should be validated which
results in skipping call to split_page_count() and that should be fine
because it is used to keep track of everything *but* 1:1 mappings.
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Glenn P Williamson <glenn.p.williamson@intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Pull x86 fixes from Ingo Molnar:
"Misc fixes: two KASAN fixes, two EFI boot fixes, two boot-delay
optimization fixes, and a fix for a IRQ handling hang observed on
virtual platforms"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm, kasan: Silence KASAN warnings in get_wchan()
compiler, atomics, kasan: Provide READ_ONCE_NOCHECK()
x86, kasan: Fix build failure on KASAN=y && KMEMCHECK=y kernels
x86/smpboot: Fix CPU #1 boot timeout
x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior
x86/ioapic: Disable interrupts when re-routing legacy IRQs
x86/setup: Extend low identity map to cover whole kernel range
x86/efi: Fix multiple GOP device support
Build cpu_hotplug for ARM and ARM64 guests.
Rename arch_(un)register_cpu to xen_(un)register_cpu and provide an
empty implementation on ARM and ARM64. On x86 just call
arch_(un)register_cpu as we are already doing.
Initialize cpu_hotplug on ARM.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
With 64KB page granularity support, the frame number will be different.
It will be easier to modify the behavior in a single place rather than
in each caller.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Currently, a grant is always based on the Xen page granularity (i.e
4KB). When Linux is using a different page granularity, a single page
will be split between multiple grants.
The new helpers will be in charge of splitting the Linux page into grants
and call a function given by the caller on each grant.
Also provide an helper to count the number of grants within a given
contiguous region.
Note that the x86/include/asm/xen/page.h is now including
xen/interface/grant_table.h rather than xen/grant_table.h. It's
necessary because xen/grant_table.h depends on asm/xen/page.h and will
break the compilation. Furthermore, only definition in
interface/grant_table.h is required.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Rename alloc_p2m() to xen_alloc_p2m_entry() and export it.
This is useful for ensuring that a p2m entry is allocated (i.e., not a
shared missing or identity entry) so that subsequent set_phys_to_machine()
calls will require no further allocations.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Make xen_alloc_p2m_entry() a nop on auto-xlate guests.
All users of alloc_xenballoon_pages() wanted low memory pages, so
remove the option for high memory.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
During setup, discard RAM regions that are above the maximum
reservation (instead of marking them as E820_UNUSABLE). This allows
hotplug memory to be placed at these addresses.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).
Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Now, ftrace only calculate the dyn_ftrace number in the adding
breakpoint loop, not in adding update and finish update loop.
Calculate the correct dyn_ftrace, once ftrace reports the failure message
to the userspace.
Link: http://lkml.kernel.org/r/1442420382-13130-1-git-send-email-mnfhuang@gmail.com
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The pcibios-irq and MSI both use dev->irq to store the IRQ number. While
the MSI code checks for that and frees the pcibios-irq before overwriting
dev->irq, the pcibios_alloc_irq() function does not.
Usually this is not a problem, as the pcibios-irq is allocated before probe
time of the device and the MSI IRQ is allocted from the driver's probe
path.
But there are PCI devices handled by the core kernel and not by a standard
PCI driver, like the AMD IOMMU for example. For the AMD IOMMU a normal PCI
device driver does not make sense, because a driver can be forcibly unbound
from its device, which is not a good idea for an IOMMU.
Nevertheless the PCI core code tries to match the PCI device implementing
the AMD IOMMU against drivers, and allocates/frees a pcibios IRQ every time
it tries out a new driver. This overwrites the dev->irq field set by
pci_enable_msi() and sets it to 0 in the end (because the probe fails and
the pcibios-irq is freed again).
On suspend/resume this breaks the kernel, because the IRQ descriptor for
IRQ 0 is NULL.
Fix this by not allocating a pcibios-irq when MSI is already active. This
also has the benefit, that a device claimed by the core kernel can not be
probed by a PCI driver later.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
... and save us the stub.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have the MAINTAINERS file for that. Also, Andreas doesn't
have the time for this work anymore.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the remaining module functionality leftovers. Make
"dis_ucode_ldr" an early_param and make it static again. Drop
module aliases, autoloading table, description, etc.
Bump version number, while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge the early loader functionality into the driver proper. The
diff is huge but logically, it is simply moving code from the
_early.c files into the main driver.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make CONFIG_MICROCODE a bool. It was practically a bool already anyway,
since early loader was forcing it to =y.
Regardless, there's no real reason to have something be a module which
gets built-in on the majority of installations out there. And its not
like there's noticeable change in functionality - we still can load late
microcode - just the module glue disappears.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Standardize on bool instead of an inconsistent mixture of u8 and plain 'int'.
Also use u32 or 'unsigned int' instead of 'unsigned long' when a 32-bit type
suffices, generating slightly better code on x86-64.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5624E3A002000078000AC49A@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the generic help text explaining each CPU type and what to
select under the "Processor Family" prompt and not under the
M486 option. Also, amend it with the missing options.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445244077-25120-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The per CPU thermal vector init code checks if the thermal
vector is already installed and complains and bails out if it
is.
This happens after kexec, as kernel shut down does not clear the
thermal vector APIC register.
This causes two problems:
1. So we always do not fully initialize thermal reports after
kexec. The CPU is still likely initialized, as the previous
kernel should have done it. But we don't set up the software
pointer to the thermal vector, so reporting may end up with a
unknown thermal interrupt message.
2. Also it complains for every logical CPU, even though the
value is actually derived from BP only.
The problem is that we end up with one message per CPU, so on
larger systems it becomes very noisy and messes up the otherwise
nicely formatted CPU bootup numbers in the kernel log.
Just remove the check. I checked the code and there's no valid
code paths where the thermal init code for a CPU could be called
multiple times.
Why the kernel does not clean up this value on shutdown:
The thermal monitoring is controlled per logical CPU thread.
Normal shutdown code is just running on one CPU. To disable it
we would need a broadcast NMI to all CPUs on shut down. That's
overkill for this. So we just ignore it after kexec.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1445246268-26285-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
memblock_reserve() can fail but the crashkernel reservation code
doesn't check that and this can lead the user into believing
that the crashkernel region was actually reserved. Make sure we
check that return value and we exit early with a failure message
in the error case.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function doesn't give us the "Node ID" as the function name
suggests. Rather, it receives a PCI device as argument, checks
the available F3 PCI device IDs in the system and returns the
index of the matching Bus/Device IDs.
Rename it to amd_pci_dev_to_node_id().
No functional change is introduced.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1445246268-26285-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
People reported that when allocating crashkernel memory using
the ",high" and ",low" syntax, there were cases where the
reservation of the high portion succeeds but the reservation of
the low portion fails.
Then kexec can load the kdump kernel successfully, but booting
the kdump kernel fails as there's no low memory.
The low memory allocation for the kdump kernel can fail on large
systems for a couple of reasons. For example, the manually
specified crashkernel low memory can be too large and thus no
adequate memblock region would be found.
Therefore, we try to reserve low memory for the crash kernel
*after* the high memory portion has been allocated. If that
fails, we free crashkernel high memory too and return. The user
can then take measures accordingly.
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Massage text. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1445246268-26285-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/usb/asix_common.c
net/ipv4/inet_connection_sock.c
net/switchdev/switchdev.c
In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.
The other two conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
get_wchan() is racy by design, it may access volatile stack
of running task, thus it may access redzone in a stack frame
and cause KASAN to warn about this.
Use READ_ONCE_NOCHECK() to silence these warnings.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Link: http://lkml.kernel.org/r/1445243838-17763-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables the suport for the PERF_SAMPLE_BRANCH_CALL
for Intel x86 processors. When the processor support LBR filtering
this the selection is done in hardware. Otherwise, the filter is
applied by software. Note that we chose to include zero length calls
because they also represent calls.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: khandual@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1444720151-10275-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
b20112edea ("perf/x86: Improve accuracy of perf/sched clock")
allowed the time_shift value in perf_event_mmap_page to be as much
as 32. Unfortunately the documented algorithms for using time_shift
have it shifting an integer, whereas to work correctly with the value
32, the type must be u64.
In the case of perf tools, Intel PT decodes correctly but the timestamps
that are output (for example by perf script) have lost 32-bits of
granularity so they look like they are not changing at all.
Fix by limiting the shift to 31 and adjusting the multiplier accordingly.
Also update the documentation of perf_event_mmap_page so that new code
based on it will be more future-proof.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: b20112edea ("perf/x86: Improve accuracy of perf/sched clock")
Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
modify_ldt() was declared as an external symbol. Despite the man
page for this syscall telling that there is no wrapper in glibc,
since version 2.1 there actually is, so linking to the glibc
works.
Since modify_ldt() is not a POSIX interface, other libc
implementations do not always provide a wrapper function.
Even glibc headers do not provide a corresponding declaration.
So go the recommended way to call this using syscall().
Signed-off-by: Hans-Werner Hilse <hwhilse@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Commit fd13690218 ("KVM: x86: MMU: Move mapping_level_dirty_bitmap()
call in mapping_level()") forgot to initialize force_pt_level to false
in FNAME(page_fault)() before calling mapping_level() like
nonpaging_map() does. This can sometimes result in forcing page table
level mapping unnecessarily.
Fix this and move the first *force_pt_level check in mapping_level()
before kvm_vcpu_gfn_to_memslot() call to make it a bit clearer that
the variable must be initialized before mapping_level() gets called.
This change can also avoid calling kvm_vcpu_gfn_to_memslot() when
!check_hugepage_cache_consistency() check in tdp_page_fault() forces
page table level mapping.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Not zeroing EFER means that a 32-bit firmware cannot enter paging mode
without clearing EFER.LME first (which it should not know about).
Yang Zhang from Intel confirmed that the manual is wrong and EFER is
cleared to zero on INIT.
Fixes: d28bc9dd25
Cc: stable@vger.kernel.org
Cc: Yang Z Zhang <yang.z.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Declaration of memcpy() is hidden under #ifndef CONFIG_KMEMCHECK.
In asm/efi.h under #ifdef CONFIG_KASAN we #undef memcpy(), due to
which the following happens:
In file included from arch/x86/kernel/setup.c:96:0:
./arch/x86/include/asm/desc.h: In function ‘native_write_idt_entry’:
./arch/x86/include/asm/desc.h:122:2: error: implicit declaration of function ‘memcpy’ [-Werror=implicit-function-declaration] memcpy(&idt[entry], gate, sizeof(*gate));
^
cc1: some warnings being treated as errors
make[2]: *** [arch/x86/kernel/setup.o] Error 1
We will get rid of that #undef in asm/efi.h eventually.
But in the meanwhile move memcpy() declaration out of #ifdefs
to fix the build.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444994933-28328-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
a9bcaa02a5 ("x86/smpboot: Remove SIPI delays from cpu_up()")
Caused some Intel Core2 processors to time-out when bringing up CPU #1,
resulting in the missing of that CPU after bootup.
That patch reduced the SIPI delays from udelay() 300, 200 to udelay() 0,
0 on modern processors.
Several Intel(R) Core(TM)2 systems failed to bring up CPU #1 10/10 times
after that change.
Increasing either of the SIPI delays to udelay(1) results in
success. So here we increase both to udelay(10). While this may
be 20x slower than the absolute minimum, it is still 20x to 30x
faster than the original code.
Tested-by: Donald Parsons <dparsons@brightdsl.net>
Tested-by: Shane <shrybman@teksavvy.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/6dd554ee8945984d85aafb2ad35793174d068af0.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For legacy machines cpu_init_udelay defaults to 10,000.
For modern machines it is set to 0.
The user should be able to set cpu_init_udelay to
any value on the cmdline, including 10,000.
Before this patch, that was seen as "unchanged from default"
and thus on a modern machine, the user request was ignored
and the delay was set to 0.
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/de363cdbbcfcca1d22569683f7eb9873e0177251.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We either need to restore them before popping and thus changing
ESP, or we need to adjust the offsets. The former is simpler.
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5f310f739b x86/entry/32: ("Re-implement SYSENTER using the new C path")
Link: http://lkml.kernel.org/r/461e5c7d8fa3821529893a4893ac9c4bc37f9e17.1445035014.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I rewrote entry_INT80_32, I thought that int80 was an
interrupt gate. It's a trap gate. *facepalm*
Thanks to Brian Gerst for pointing out that it's better to
change the entry code than to change the gate type.
Suggested-by: Brian Gerst <brgerst@gmail.com>
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 150ac78d63 ("x86/entry/32: Switch INT80 to the new C syscall path")
Link: http://lkml.kernel.org/r/dc09d9b574a5c1dcca996847875c73f8341ce0ad.1445035014.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use common interface to simplify ACPI PCI host bridge implementation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reset acpi_root_dev->domain to 0 when pci_ignore_seg is set to keep
consistence between ACPI PCI root device and PCI host bridge device.
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A sporadic hang with consequent crash is observed when booting Hyper-V Gen1
guests:
Call Trace:
<IRQ>
[<ffffffff810ab68d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8107b616>] queue_work_on+0x46/0x90
[<ffffffff81365696>] ? add_interrupt_randomness+0x176/0x1d0
...
<EOI>
[<ffffffff81471ddb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60
[<ffffffff810c295e>] __irq_put_desc_unlock+0x1e/0x40
[<ffffffff810c5c35>] irq_modify_status+0xb5/0xd0
[<ffffffff8104adbb>] mp_register_handler+0x4b/0x70
[<ffffffff8104c55a>] mp_irqdomain_alloc+0x1ea/0x2a0
[<ffffffff810c7f10>] irq_domain_alloc_irqs_recursive+0x40/0xa0
[<ffffffff810c860c>] __irq_domain_alloc_irqs+0x13c/0x2b0
[<ffffffff8104b070>] alloc_isa_irq_from_domain.isra.1+0xc0/0xe0
[<ffffffff8104bfa5>] mp_map_pin_to_irq+0x165/0x2d0
[<ffffffff8104c157>] pin_2_irq+0x47/0x80
[<ffffffff81744253>] setup_IO_APIC+0xfe/0x802
...
[<ffffffff814631c0>] ? rest_init+0x140/0x140
The issue is easily reproducible with a simple instrumentation: if
mdelay(10) is put between mp_setup_entry() and mp_register_handler() calls
in mp_irqdomain_alloc() Hyper-V guest always fails to boot when re-routing
IRQ0. The issue seems to be caused by the fact that we don't disable
interrupts while doing IOPIC programming for legacy IRQs and IRQ0 actually
happens.
Protect the setup sequence against concurrent interrupts.
[ tglx: Make the protection unconditional and not only for legacy
interrupts ]
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Link: http://lkml.kernel.org/r/1444930943-19336-1-git-send-email-vkuznets@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap. efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.
For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table. This makes the EFI stub's trick "just work".
However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault). Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.
For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM. However, even under KVM one can clearly
see that the page table is bogus:
$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.
Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 000000ff
IDT= fffbb000 000007ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
...
The page directory is sane:
(gdb) x/4wx 0x32b7000
0x32b7000: 0x03398063 0x03399063 0x0339a063 0x0339b063
(gdb) x/4wx 0x3398000
0x3398000: 0x00000163 0x00001163 0x00002163 0x00003163
(gdb) x/4wx 0x3399000
0x3399000: 0x00400003 0x00401003 0x00402003 0x00403003
but our particular page directory entry is empty:
(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070: 0x00000000
[ It appears that you can skate past this issue if you don't receive
any interrupts while the bogus GDT pointer is loaded, or if you avoid
reloading the segment registers in general.
Andy Lutomirski provides some additional insight:
"AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory. Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."
Up until commit 23a0d4e8fa ("efi: Disable interrupts around EFI
calls, not in the epilog/prolog calls") interrupts were disabled
around the prolog and epilog calls, and the functional GDT was
re-installed before interrupts were re-enabled.
Which explains why no one has hit this issue until now. ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
[ Updated changelog. ]
As reported at https://bugs.launchpad.net/qemu/+bug/1494350,
it is possible to have vcpu->arch.st.last_steal initialized
from a thread other than vcpu thread, say the iothread, via
KVM_SET_MSRS.
Which can cause an overflow later (when subtracting from vcpu threads
sched_info.run_delay).
To avoid that, move steal time accumulation to vcpu entry time,
before copying steal time data to guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calling kvm_vcpu_gfn_to_memslot() twice in mapping_level() should be
avoided since getting a slot by binary search may not be negligible,
especially for virtual machines with many memory slots.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that it has only one caller, and its name is not so helpful for
readers, remove it. The new memslot_valid_for_gpte() function
makes it possible to share the common code between
gfn_to_memslot_dirty_bitmap() and mapping_level().
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is necessary to eliminate an extra memory slot search later.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a bonus, an extra memory slot search can be eliminated when
is_self_change_mapping is true.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This will be passed to a function later.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently we always write the next_rip of the shadow vmcb to
the guests vmcb when we emulate a vmexit. This could confuse
the guest when its cpuid indicated no support for the
next_rip feature.
Fix this by only propagating next_rip if the guest actually
supports it.
Cc: Bandan Das <bsd@redhat.com>
Cc: Dirk Mueller <dmueller@suse.com>
Tested-By: Dirk Mueller <dmueller@suse.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose VPID capability to L1. For nested guests, we don't do anything
specific for single context invalidation. Hence, only advertise support
for global context invalidation. The major benefit of nested VPID comes
from having separate vpids when switching between L1 and L2, and also
when L2's vCPUs not sched in/out on L1.
Reviewed-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>