- Use a single source list of hypercalls, generating other tables
etc. at build time.
- Add a "Xen PV" APIC driver to support >255 VCPUs in PV guests.
- Significant performance improve to guest save/restore/migration.
- scsiback/front save/restore support.
- Infrastructure for multi-page xenbus rings.
- Misc fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVL6OZAAoJEFxbo/MsZsTRs3YH/2AycBHs129DNnc2OLmOklBz
AdD43k+FOfZlv0YU80WmPVmOpGGHGB5Pqkix2KtnvPYmtx3pb/5ikhDwSTWZpqBl
Qq6/RgsRjYZ8VMKqrMTkJMrJWHQYbg8lgsP5810nsFBn/Qdbxms+WBqpMkFVo3b2
rvUZj8QijMJPS3qr55DklVaOlXV4+sTAytTdCiubVnaB/agM2jjRflp/lnJrhtTg
yc4NTrIlD1RsMV/lNh92upBP/pCm6Bs0zQ2H1v3hkdhBBmaO0IVXpSheYhfDOHfo
9v209n137N7X86CGWImFk6m2b+EfiFnLFir07zKSA+iZwkYKn75znSdPfj0KCc0=
=bxTm
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen features and fixes from David Vrabel:
- use a single source list of hypercalls, generating other tables etc.
at build time.
- add a "Xen PV" APIC driver to support >255 VCPUs in PV guests.
- significant performance improve to guest save/restore/migration.
- scsiback/front save/restore support.
- infrastructure for multi-page xenbus rings.
- misc fixes.
* tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pci: Try harder to get PXM information for Xen
xenbus_client: Extend interface to support multi-page ring
xen-pciback: also support disabling of bus-mastering and memory-write-invalidate
xen: support suspend/resume in pvscsi frontend
xen: scsiback: add LUN of restored domain
xen-scsiback: define a pr_fmt macro with xen-pvscsi
xen/mce: fix up xen_late_init_mcelog() error handling
xen/privcmd: improve performance of MMAPBATCH_V2
xen: unify foreign GFN map/unmap for auto-xlated physmap guests
x86/xen/apic: WARN with details.
x86/xen: Provide a "Xen PV" APIC driver to support >255 VCPUs
xen/pciback: Don't print scary messages when unsupported by hypervisor.
xen: use generated hypercall symbols in arch/x86/xen/xen-head.S
xen: use generated hypervisor symbols in arch/x86/xen/trace.c
xen: synchronize include/xen/interface/xen.h with xen
xen: build infrastructure for generating hypercall depending symbols
xen: balloon: Use static attribute groups for sysfs entries
xen: pcpu: Use static attribute groups for sysfs entry
Merge first patchbomb from Andrew Morton:
- arch/sh updates
- ocfs2 updates
- kernel/watchdog feature
- about half of mm/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
Documentation: update arch list in the 'memtest' entry
Kconfig: memtest: update number of test patterns up to 17
arm: add support for memtest
arm64: add support for memtest
memtest: use phys_addr_t for physical addresses
mm: move memtest under mm
mm, hugetlb: abort __get_user_pages if current has been oom killed
mm, mempool: do not allow atomic resizing
memcg: print cgroup information when system panics due to panic_on_oom
mm: numa: remove migrate_ratelimited
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
s390: standardize mmap_rnd() usage
powerpc: standardize mmap_rnd() usage
mips: extract logic for mmap_rnd()
arm64: standardize mmap_rnd() usage
x86: standardize mmap_rnd() usage
arm: factor out mmap ASLR into mmap_rnd
...
We would want to use number of page table level to define mm_struct.
Let's expose it as CONFIG_PGTABLE_LEVELS.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull RCU changes from Ingo Molnar:
"The main changes in this cycle were:
- changes permitting use of call_rcu() and friends very early in
boot, for example, before rcu_init() is invoked.
- add in-kernel API to enable and disable expediting of normal RCU
grace periods.
- improve RCU's handling of (hotplug-) outgoing CPUs.
- NO_HZ_FULL_SYSIDLE fixes.
- tiny-RCU updates to make it more tiny.
- documentation updates.
- miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
cpu: Defer smpboot kthread unparking until CPU known to scheduler
rcu: Associate quiescent-state reports with grace period
rcu: Yet another fix for preemption and CPU hotplug
rcu: Add diagnostics to grace-period cleanup
rcutorture: Default to grace-period-initialization delays
rcu: Handle outgoing CPUs on exit from idle loop
cpu: Make CPU-offline idle-loop transition point more precise
rcu: Eliminate ->onoff_mutex from rcu_node structure
rcu: Process offlining and onlining only at grace-period start
rcu: Move rcu_report_unblock_qs_rnp() to common code
rcu: Rework preemptible expedited bitmask handling
rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
rcutorture: Enable slow grace-period initializations
rcu: Provide diagnostic option to slow down grace-period initialization
rcu: Detect stalls caused by failure to propagate up rcu_node tree
rcu: Eliminate empty HOTPLUG_CPU ifdef
rcu: Simplify sync_rcu_preempt_exp_init()
rcu: Put all orphan-callback-related code under same comment
rcu: Consolidate offline-CPU callback initialization
...
Pull x86 asm changes from Ingo Molnar:
"There were lots of changes in this development cycle:
- over 100 separate cleanups, restructuring changes, speedups and
fixes in the x86 system call, irq, trap and other entry code, part
of a heroic effort to deobfuscate a decade old spaghetti asm code
and its C code dependencies (Denys Vlasenko, Andy Lutomirski)
- alternatives code fixes and enhancements (Borislav Petkov)
- simplifications and cleanups to the compat code (Brian Gerst)
- signal handling fixes and new x86 testcases (Andy Lutomirski)
- various other fixes and cleanups
By their nature many of these changes are risky - we tried to test
them well on many different x86 systems (there are no known
regressions), and they are split up finely to help bisection - but
there's still a fair bit of residual risk left so caveat emptor"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (148 commits)
perf/x86/64: Report regs_user->ax too in get_regs_user()
perf/x86/64: Simplify regs_user->abi setting code in get_regs_user()
perf/x86/64: Do report user_regs->cx while we are in syscall, in get_regs_user()
perf/x86/64: Do not guess user_regs->cs, ss, sp in get_regs_user()
x86/asm/entry/32: Tidy up JNZ instructions after TESTs
x86/asm/entry/64: Reduce padding in execve stubs
x86/asm/entry/64: Remove GET_THREAD_INFO() in ret_from_fork
x86/asm/entry/64: Simplify jumps in ret_from_fork
x86/asm/entry/64: Remove a redundant jump
x86/asm/entry/64: Optimize [v]fork/clone stubs
x86/asm/entry: Zero EXTRA_REGS for stub32_execve() too
x86/asm/entry/64: Move stub_x32_execvecloser() to stub_execveat()
x86/asm/entry/64: Use common code for rt_sigreturn() epilogue
x86/asm/entry/64: Add forgotten CFI annotation
x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layout
x86/asm/entry/64: Move opportunistic sysret code to syscall code path
x86, selftests: Add sigreturn selftest
x86/alternatives: Guard NOPs optimization
x86/asm/entry: Clear EXTRA_REGS for all executable formats
x86/signal: Remove pax argument from restore_sigcontext
...
Pull timer updates from Ingo Molnar:
"The main changes in this cycle were:
- clockevents state machine cleanups and enhancements (Viresh Kumar)
- clockevents broadcast notifier horror to state machine conversion
and related cleanups (Thomas Gleixner, Rafael J Wysocki)
- clocksource and timekeeping core updates (John Stultz)
- clocksource driver updates and fixes (Ben Dooks, Dmitry Osipenko,
Hans de Goede, Laurent Pinchart, Maxime Ripard, Xunlei Pang)
- y2038 fixes (Xunlei Pang, John Stultz)
- NMI-safe ktime_get_raw_fast() and general refactoring of the clock
code, in preparation to perf's per event clock ID support (Peter
Zijlstra)
- generic sched/clock fixes, optimizations and cleanups (Daniel
Thompson)
- clockevents cpu_down() race fix (Preeti U Murthy)"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits)
timers/PM: Drop unnecessary braces from tick_freeze()
timers/PM: Fix up tick_unfreeze()
timekeeping: Get rid of stale comment
clockevents: Cleanup dead cpu explicitely
clockevents: Make tick handover explicit
clockevents: Remove broadcast oneshot control leftovers
sched/idle: Use explicit broadcast oneshot control function
ARM: Tegra: Use explicit broadcast oneshot control function
ARM: OMAP: Use explicit broadcast oneshot control function
intel_idle: Use explicit broadcast oneshot control function
ACPI/idle: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast oneshot control function
x86/amd/idle, clockevents: Use explicit broadcast oneshot control functions
clockevents: Provide explicit broadcast oneshot control functions
clockevents: Remove the broadcast control leftovers
ARM: OMAP: Use explicit broadcast control function
intel_idle: Use explicit broadcast control function
cpuidle: Use explicit broadcast control function
ACPI/processor: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast control function
...
Some of x86 bare-metal and Xen CPU initialization code is common
between the two and therefore can be factored out to avoid code
duplication.
As a side effect, doing so will also extend the fix provided by
commit a7fcf28d43 ("x86/asm/entry: Replace this_cpu_sp0() with
current_top_of_stack() to x86_32") to 32-bit Xen PV guests.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: konrad.wilk@oracle.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1427897534-5086-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Xen calls on every cpu into tick_resume() which is just wrong.
tick_resume() is for the syscore global suspend/resume
invocation. What XEN really wants is a per cpu local resume
function.
Provide a tick_resume_local() function and use it in XEN.
Also provide a complementary tick_suspend_local() and modify
tick_unfreeze() and tick_freeze(), respectively, to use the
new local tick resume/suspend functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Combined two patches, rebased, modified subject/changelog. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1698741.eezk9tnXtG@vostro.rjw.lan
[ Merged to latest timers/core. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
clockevents_notify() is a leftover from the early design of the
clockevents facility. It's really not a notification mechanism,
it's a multiplex call.
We are way better off to have explicit calls instead of this
monstrosity. Split out the suspend/resume() calls and invoke
them directly from the call sites.
No locking required at this point because these calls happen
with interrupts disabled and a single cpu online.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebased on top of 4.0-rc5. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/713674030.jVm1qaHuPf@vostro.rjw.lan
[ Rebased on top of latest timers/core. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Changes permitting use of call_rcu() and friends very early in
boot, for example, before rcu_init() is invoked.
- Miscellaneous fixes.
- Add in-kernel API to enable and disable expediting of normal RCU
grace periods.
- Improve RCU's handling of (hotplug-) outgoing CPUs.
Note: ARM support is lagging a bit here, and these improved
diagnostics might generate (harmless) splats.
- NO_HZ_FULL_SYSIDLE fixes.
- Tiny RCU updates to make it more tiny.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PER_CPU_VAR(kernel_stack) was set up in a way where it points
five stack slots below the top of stack.
Presumably, it was done to avoid one "sub $5*8,%rsp"
in syscall/sysenter code paths, where iret frame needs to be
created by hand.
Ironically, none of them benefits from this optimization,
since all of them need to allocate additional data on stack
(struct pt_regs), so they still have to perform subtraction.
This patch eliminates KERNEL_STACK_OFFSET.
PER_CPU_VAR(kernel_stack) now points directly to top of stack.
pt_regs allocations are adjusted to allocate iret frame as well.
Hopefully we can merge it later with 32-bit specific
PER_CPU_VAR(cpu_current_top_of_stack) variable...
Net result in generated code is that constants in several insns
are changed.
This change is necessary for changing struct pt_regs creation
in SYSCALL64 code path from MOV to PUSH instructions.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426785469-15125-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 054954eb05 ("xen: switch to linear
virtual mapped sparse p2m list") introduced a regression regarding to
memory hotplug for a pv-domain: as the virtual space for the p2m list
is allocated for the to be expected memory size of the domain only,
hotplugged memory above that size will not be usable by the domain.
Correct this by using a configurable size for the p2m list in case of
memory hotplug enabled (default supported memory size is 512 GB for
64 bit domains and 4 GB for 32 bit domains).
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org> # 3.19+
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Make clear that the usage of PER_CPU(old_rsp) is purely temporary,
by renaming it to 'rsp_scratch'.
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the IOCTL_PRIVCMD_MMAPBATCH_V2 (and older V1 version) map
multiple frames at a time rather than one at a time, despite the pages
being non-consecutive GFNs.
xen_remap_foreign_mfn_array() is added which maps an array of GFNs
(instead of a consecutive range of GFNs).
Since per-frame errors are returned in an array, privcmd must set the
MMAPBATCH_V1 error bits as part of the "report errors" phase, after
all the frames are mapped.
Migrate times are significantly improved (when using a PV toolstack
domain). For example, for an idle 12 GiB PV guest:
Before After
real 0m38.179s 0m26.868s
user 0m15.096s 0m13.652s
sys 0m28.988s 0m18.732s
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Auto-translated physmap guests (arm, arm64 and x86 PVHVM/PVH) map and
unmap foreign GFNs using the same method (updating the physmap).
Unify the two arm and x86 implementations into one commont one.
Note that on arm and arm64, the correct error code will be returned
(instead of always -EFAULT) and map/unmap failure warnings are no
longer printed. These changes are required if the foreign domain is
paging (-ENOENT failures are expected and must be propagated up to the
caller).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
We should not be writting to the APIC registers under
Xen PV. But if we do instead of just giving an blanket
warning - include some details to help troubleshoot.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Instead of mangling the default APIC driver, provide a Xen PV guest
specific one that explicitly provides appropriate methods.
This allows use to report that all APIC IDs are valid, allowing dom0
to boot with more than 255 VCPUs.
Since the probe order of APIC drivers is link dependent, we add in an
late probe function to change to the Xen PV if it hadn't been done
during bootup.
Suggested-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Cathy Avery <cathy.avery@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Instead of manually list each hypercall in arch/x86/xen/xen-head.S
use the auto generated symbol list.
This also corrects the wrong address of xen_hypercall_mca which was
located 32 bytes higher than it should.
Symbol addresses have been verified to match the correct ones via
objdump output.
Based-on-patch-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Instead of manually list all hypervisor calls in arch/x86/xen/trace.c
use the auto generated list.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The header include/xen/interface/xen.h doesn't contain all definitions
from Xen's version of that header. Update it accordingly.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
This commit removes the open-coded CPU-offline notification with new
common code. Among other things, this change avoids calling scheduler
code using RCU from an offline CPU that RCU is ignoring. It also allows
Xen to notice at online time that the CPU did not go offline correctly.
Note that Xen has the surviving CPU carry out some cleanup operations,
so if the surviving CPU times out, these cleanup operations might have
been carried out while the outgoing CPU was still running. It might
therefore be unwise to bring this CPU back online, and this commit
avoids doing so.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <x86@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: <xen-devel@lists.xenproject.org>
We currently store references to the top of the kernel stack in
multiple places: kernel_stack (with an offset) and
init_tss.x86_tss.sp0 (no offset). The latter is defined by
hardware and is a clean canonical way to find the top of the
stack. Add an accessor so we can start using it.
This needs minor paravirt tweaks. On native, sp0 defines the
top of the kernel stack and is therefore always correct. On Xen
and lguest, the hypervisor tracks the top of the stack, but we
want to start reading sp0 in the kernel. Fixing this is simple:
just update our local copy of sp0 as well as the hypervisor's
copy on task switches.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8d675581859712bee09a055ed8f785d80dac1eca.1425611534.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 054954eb05 ("xen: switch to
linear virtual mapped sparse p2m list") introduced an error.
During initialization of the p2m list a p2m identity area mapped by
a complete identity pmd entry has to be split up into smaller chunks
sometimes, if a non-identity pfn is introduced in this area.
If this non-identity pfn is not at index 0 of a p2m page the new
p2m page needed is initialized with wrong identity entries, as the
identity pfns don't start with the value corresponding to index 0,
but with the initial non-identity pfn. This results in weird wrong
mappings.
Correct the wrong initialization by starting with the correct pfn.
Cc: stable@vger.kernel.org # 3.19
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Commit 1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
introduced CR4 shadows.
These shadows are initialized in early boot code. The commit missed
initialization for 64-bit PV(H) guests that this patch adds.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Commit d524165cb8 ("x86/apic: Check x2apic early") tests X2APIC_ENABLE
bit of MSR_IA32_APICBASE when CONFIG_X86_X2APIC is off and panics
the kernel when this bit is set.
Xen's PV guests will pass this MSR read to the hypervisor which will
return its version of the MSR, where this bit might be set. Make sure
we clear it before returning MSR value to the caller.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Pull locking fixes from Ingo Molnar:
"Two fixes: the paravirt spin_unlock() corruption/crash fix, and an
rtmutex NULL dereference crash fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/spinlocks/paravirt: Fix memory corruption on unlock
locking/rtmutex: Avoid a NULL pointer dereference on deadlock
Paravirt spinlock clears slowpath flag after doing unlock.
As explained by Linus currently it does:
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);
/* add_smp() is a full mb() */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock might come in, and release the whole data
structure.
Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.
So this patch implements the fix with:
1. Moving slowpath flag to head (Oleg):
Unlocked locks don't care about the slowpath flag; therefore we can keep
it set after the last unlock, and clear it again on the first (try)lock.
-- this removes the write after unlock. note that keeping slowpath flag would
result in unnecessary kicks.
By moving the slowpath flag from the tail to the head ticket we also avoid
the need to access both the head and tail tickets on unlock.
2. use xadd to avoid read/write after unlock that checks the need for
unlock_kick (Linus):
We further avoid the need for a read-after-release by using xadd;
the prev head value will include the slowpath flag and indicate if we
need to do PV kicking of suspended spinners -- on modern chips xadd
isn't (much) more expensive than an add + load.
Result:
setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
benchmark overcommit %improve
kernbench 1x -0.13
kernbench 2x 0.02
dbench 1x -1.77
dbench 2x -0.63
[Jeremy: Hinted missing TICKET_LOCK_INC for kick]
[Oleg: Moved slowpath flag to head, ticket_equals idea]
[PeterZ: Added detailed changelog]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Jones <drjones@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: a.ryabinin@samsung.com
Cc: dave@stgolabs.net
Cc: hpa@zytor.com
Cc: jasowang@redhat.com
Cc: jeremy@goop.org
Cc: paul.gortmaker@windriver.com
Cc: riel@redhat.com
Cc: tglx@linutronix.de
Cc: waiman.long@hp.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 perf updates from Ingo Molnar:
"This series tightens up RDPMC permissions: currently even highly
sandboxed x86 execution environments (such as seccomp) have permission
to execute RDPMC, which may leak various perf events / PMU state such
as timing information and other CPU execution details.
This 'all is allowed' RDPMC mode is still preserved as the
(non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is
that RDPMC access is only allowed if a perf event is mmap-ed (which is
needed to correctly interpret RDPMC counter values in any case).
As a side effect of these changes CR4 handling is cleaned up in the
x86 code and a shadow copy of the CR4 value is added.
The extra CR4 manipulation adds ~ <50ns to the context switch cost
between rdpmc-capable and rdpmc-non-capable mms"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
perf/x86: Only allow rdpmc if a perf_event is mapped
perf: Pass the event to arch_perf_update_userpage()
perf: Add pmu callbacks to track event mapping and unmapping
x86: Add a comment clarifying LDT context switching
x86: Store a per-cpu shadow copy of CR4
x86: Clean up cr4 manipulation
This series tightens the rules for ACCESS_ONCE to only work
on scalar types. It also contains the necessary fixups as
indicated by build bots of linux-next.
Now everything is in place to prevent new non-scalar users
of ACCESS_ONCE and we can continue to convert code to
READ_ONCE/WRITE_ONCE.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJU2H5MAAoJEBF7vIC1phx8Jm4QALPqKOMDSUBCrqJFWJeujtv2
ILxJKsnjrAlt3dxnlVI3q6e5wi896hSce75PcvZ/vs/K3GdgMxOjrakBJGTJ2Qjg
5njW9aGJDDr/SYFX33MLWfqy222TLtpxgSz379UgXjEzB0ymMWbJJ3FnGjVqQJdp
RXDutpncRySc/rGHh9UPREIRR5GvimONsWE2zxgXjUzB8vIr2fCGvHTXfIb6RKbQ
yaFoihzn0m+eisc5Gy4tQ1qhhnaYyWEGrINjHTjMFTQOWTlH80BZAyQeLdbyj2K5
qloBPS/VhBTr/5TxV5onM+nVhu0LiblVNrdMHVeb7jyST4LeFOCaWK98lB3axSB5
v/2D1YKNb3g1U1x3In/oNGQvs36zGiO1uEdMF1l8ZFXgCvHmATSFSTWBtqUhb5Ew
JA3YyqMTG6dpRTMSnmu3/frr4wDqnxlB/ktQC1pf3tDp87mr1ZYEy/dQld+tltjh
9Z5GSdrw0nf91wNI3DJf+26ZDdz5B+EpDnPnOKG8anI1lc/mQneI21/K/xUteFXw
UZ1XGPLV2vbv9/a13u44SdjenHvQs1egsGeebMxVPoj6WmDLVmcIqinyS6NawYzn
IlDGy/b3bSnXWMBP0ZVBX94KWLxqDDc4a/ayxsmxsP1tPZ+jDXjVDa7E3zskcHxG
Uj5ULCPyU087t8Sl76mv
=Dj70
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
"Tighten rules for ACCESS_ONCE
This series tightens the rules for ACCESS_ONCE to only work on scalar
types. It also contains the necessary fixups as indicated by build
bots of linux-next. Now everything is in place to prevent new
non-scalar users of ACCESS_ONCE and we can continue to convert code to
READ_ONCE/WRITE_ONCE"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Fix sparse warning for ACCESS_ONCE
next: sh: Fix compile error
kernel: tighten rules for ACCESS ONCE
mm/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4). Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.
This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the "foreign" page flag to mark pages that have a grant map. Use
page->private to store information of the grant (the granting domain
and the grant reference).
Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Ballooned pages are always used for grant maps which means the
original frame does not need to be saved in page->index nor restored
after the grant unmap.
This allows the workaround in netback for the conflicting use of the
(unionized) page->index and page->pfmemalloc to be removed.
Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The scratch frame mappings for ballooned pages and the m2p override
are broken. Remove them in preparation for replacing them with
simpler mechanisms that works.
The scratch pages did not ensure that the page was not in use. In
particular, the foreign page could still be in use by hardware. If
the guest reused the frame the hardware could read or write that
frame.
The m2p override did not handle the same frame being granted by two
different grant references. Trying an M2P override lookup in this
case is impossible.
With the m2p override removed, the grant map/unmap for the kernel
mappings (for x86 PV) can be easily batched in
set_foreign_p2m_mapping() and clear_foreign_p2m_mapping().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
When unmapping grants, instead of converting the kernel map ops to
unmap ops on the fly, pre-populate the set of unmap ops.
This allows the grant unmap for the kernel mappings to be trivially
batched in the future.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
The file arch/x86/xen/mmu.c has some functions that can be annotated
with "__init".
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Some more functions in arch/x86/xen/setup.c can be made "__init".
xen_ignore_unusable() can be made "static".
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
In many places in arch/x86/xen/setup.c wrong types are used for
physical addresses (u64 or unsigned long long). Use phys_addr_t
instead.
Use macros already defined instead of open coding them.
Correct some other type mismatches.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Remove extern declarations in arch/x86/xen/setup.c which are either
not used or redundant. Move needed other extern declarations to
xen-ops.h
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Call __set_current_state() instead of assigning the new state directly.
These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments,
keeping track of who changed the state.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
In Dom0's the use of the TSC clocksource (whenever it is stable enough to
be used) instead of the Xen clocksource should not cause any issues, as
Dom0 VMs never live-migrated. The TSC clocksource is somewhat more
efficient than the Xen paravirtualised clocksource, thus it should have
higher rating.
This patch decreases the rating of the Xen clocksource in Dom0s to 275.
Which is half-way between the rating of the TSC clocksource (300) and the
hpet clocksource (250).
Cc: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Imre Palik <imrep@amazon.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the p2m code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
- Several critical linear p2m fixes that prevented some hosts from
booting.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJUtQHNAAoJEFxbo/MsZsTR/qgH/iiW4k2T8dBGZ7TPyzt88iyT
4caWjuujp2OUaRqhBQdY7z05uai6XxgJLwDyqiO+qHaRUj+ZWCrjh/ZFPU1+09hK
GdwPMWU7xMRs/7F2ANO03jJ/ktvsYXtazcVrV89Q3t+ZZJIQ/THovDkaoa+dF2lh
W8d5H7N2UNCJLe9w2fm5iOq4SKoTsJOq6pVQ6gUBqJcgkSDWavd6bowXnTlcepZN
tNaSMZsOt4CAvYQIa0nKPJo6Q4QN3buRQMWEOAOmGVT/RkVi68wirwk59uNzcS7E
HjhqxFjhXYamNTuwHYZlchBrZutdbymSlucVucb1wAoxRAX+Wd1jk5EPl6zLv4w=
=kFSE
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
"Several critical linear p2m fixes that prevented some hosts from
booting"
* tag 'stable/for-linus-3.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: properly retrieve NMI reason
xen: check for zero sized area when invalidating memory
xen: use correct type for physical addresses
xen: correct race in alloc_p2m_pmd()
xen: correct error for building p2m list on 32 bits
x86/xen: avoid freeing static 'name' when kasprintf() fails
x86/xen: add extra memory for remapped frames during setup
x86/xen: don't count how many PFNs are identity mapped
x86/xen: Free bootmem in free_p2m_page() during early boot
x86/xen: Remove unnecessary BUG_ON(preemptible()) in xen_setup_timer()
Using the native code here can't work properly, as the hypervisor would
normally have cleared the two reason bits by the time Dom0 gets to see
the NMI (if passed to it at all). There's a shared info field for this,
and there's an existing hook to use - just fit the two together. This
is particularly relevant so that NMIs intended to be handled by APEI /
GHES actually make it to the respective handler.
Note that the hook can (and should) be used irrespective of whether
being in Dom0, as accessing port 0x61 in a DomU would be even worse,
while the shared info field would just hold zero all the time. Note
further that hardware NMI handling for PVH doesn't currently work
anyway due to missing code in the hypervisor (but it is expected to
work the native rather than the PV way).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
With the introduction of the linear mapped p2m list setting memory
areas to "invalid" had to be delayed. When doing the invalidation
make sure no zero sized areas are processed.
Signed-off-by: Juegren Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
When converting a pfn to a physical address be sure to use 64 bit
wide types or convert the physical address to a pfn if possible.
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
When allocating a new pmd for the linear mapped p2m list a check is
done for not introducing another pmd when this just happened on
another cpu. In this case the old pte pointer was returned which
points to the p2m_missing or p2m_identity page. The correct value
would be the pointer to the found new page.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
In xen_rebuild_p2m_list() for large areas of invalid or identity
mapped memory the pmd entries on 32 bit systems are initialized
wrong. Correct this error.
Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>