Large page first chunk allocator is primarily used for NUMA machines;
however, its NUMA handling is extremely simplistic. Regardless of
their proximity, each cpu is put into separate large page just to
return most of the allocated space back wasting large amount of
vmalloc space and increasing cache footprint.
This patch teachs NUMA details to large page allocator. Given
processor proximity information, pcpu_lpage_build_unit_map() will find
fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the
same large page and not too much virtual address space is wasted.
This greatly reduces the unit and thus chunk size and wastes much less
address space for the first chunk. For example, on 4/4 NUMA machine,
the original code occupied 16MB of virtual space for the first chunk
while the new code only uses 4MB - one 2MB page for each node.
[ Impact: much better space efficiency on NUMA machines ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Miller <davem@davemloft.net>
Generalize and move x86 setup_pcpu_lpage() into
pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper
around the generalized version. Other than taking size parameters and
using arch supplied callbacks to allocate/free/map memory,
pcpu_lpage_first_chunk() is identical to the original implementation.
This simplifies arch code and will help converting more archs to
dynamic percpu allocator.
While at it, factor out pcpu_calc_fc_sizes() which is common to
pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
[ Impact: code reorganization and generalization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
setup_pcpu_4k() now is a simple wrapper around the generalized
version. Other than taking size parameters and using arch supplied
callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
to the original implementation.
This simplifies arch code and will help converting more archs to
dynamic percpu allocator.
While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
consistency.
[ Impact: code reorganization and generalization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
The only extra feature @unit_size provides is making dead space at the
end of the first chunk which doesn't have any valid usecase. Drop the
parameter. This will increase consistency with generalized 4k
allocator.
James Bottomley spotted missing conversion for the default
setup_per_cpu_areas() which caused build breakage on all arcsh which
use it.
[ Impact: drop unused code path ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ingo Molnar <mingo@elte.hu>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
* git://git.infradead.org/iommu-2.6: (38 commits)
intel-iommu: Don't keep freeing page zero in dma_pte_free_pagetable()
intel-iommu: Introduce first_pte_in_page() to simplify PTE-setting loops
intel-iommu: Use cmpxchg64_local() for setting PTEs
intel-iommu: Warn about unmatched unmap requests
intel-iommu: Kill superfluous mapping_lock
intel-iommu: Ensure that PTE writes are 64-bit atomic, even on i386
intel-iommu: Make iommu=pt work on i386 too
intel-iommu: Performance improvement for dma_pte_free_pagetable()
intel-iommu: Don't free too much in dma_pte_free_pagetable()
intel-iommu: dump mappings but don't die on pte already set
intel-iommu: Combine domain_pfn_mapping() and domain_sg_mapping()
intel-iommu: Introduce domain_sg_mapping() to speed up intel_map_sg()
intel-iommu: Simplify __intel_alloc_iova()
intel-iommu: Performance improvement for domain_pfn_mapping()
intel-iommu: Performance improvement for dma_pte_clear_range()
intel-iommu: Clean up iommu_domain_identity_map()
intel-iommu: Remove last use of PHYSICAL_PAGE_MASK, for reserving PCI BARs
intel-iommu: Make iommu_flush_iotlb_psi() take pfn as argument
intel-iommu: Change aligned_size() to aligned_nrpages()
intel-iommu: Clean up intel_map_sg(), remove domain_page_mapping()
...
fix hang with HIGHMEM_64G and 32bit resource. According to hpa and
Linus, use (resource_size_t)-1 to fend off big ranges.
Analyzed by hpa
Reported-and-tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These macros had two bugs:
- the type of the mask was not correctly expanded to the full size of
the argument being expanded, resulting in possible loss of high bits
when mixing types.
- the alignment argument was evaluated twice, despite the macro looking
like a fancy function (but it really does need to be a macro, since
it works on arbitrary integer types)
Noticed by Peter Anvin, and with a fix that is a modification of his
suggestion (bug noticed by Yinghai Lu).
Cc: Peter Anvin <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can run a 32-bit kernel on boxes with an IOMMU, so we need
pci_unmap_addr() etc. to work -- without it, drivers will leak mappings.
To be honest, this whole thing looks like it's more pain than it's
worth; I'm half inclined to remove the no-op #else case altogether.
But this is the minimal fix, which just does the right thing if
CONFIG_DMAR is set.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org [ for 2.6.30 ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (47 commits)
perf report: Add --symbols parameter
perf report: Add --comms parameter
perf report: Add --dsos parameter
perf_counter tools: Adjust only prelinked symbol's addresses
perf_counter: Provide a way to enable counters on exec
perf_counter tools: Reduce perf stat measurement overhead/skew
perf stat: Use percentages for scaling output
perf_counter, x86: Update x86_pmu after WARN()
perf stat: Micro-optimize the code: memcpy is only required if no event is selected and !null_run
perf stat: Improve output
perf stat: Fix multi-run stats
perf stat: Add -n/--null option to run without counters
perf_counter tools: Remove dead code
perf_counter: Complete counter swap
perf report: Print sorted callchains per histogram entries
perf_counter tools: Prepare a small callchain framework
perf record: Fix unhandled io return value
perf_counter tools: Add alias for 'l1d' and 'l1i'
perf-report: Add bare minimum PERF_EVENT_READ parsing
perf-report: Add modes for inherited stats and no-samples
...
Nathan reported that
| commit 73d60b7f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date: Tue Jun 16 15:33:00 2009 -0700
|
| page-allocator: clear N_HIGH_MEMORY map before we set it again
|
| SRAT tables may contains nodes of very small size. The arch code may
| decide to not activate such a node. However, currently the early boot
| code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
| active although these nodes have no present pages.
|
| For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
an i386 kvm guest, meaning that cpuset.mems can not be used.
Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn
Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "x86: cap iomem_resource to addressable physical memory"
There's no need for the GFX workaround now we have 'iommu=pt' for the
cases where people really care about performance. There's no need to
have a special case for just one type of device.
This also speeds up the iommu=pt path and reduces memory usage by
setting up the si_domain _once_ and then using it for all devices,
rather than giving each device its own private page tables.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The print out should read the value before changing the value.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4A487017.4090007@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: shut up uninit compiler warning in paging_tmpl.h
KVM: Ignore reads to K7 EVNTSEL MSRs
KVM: VMX: Handle vmx instruction vmexits
KVM: s390: Allow stfle instruction in the guest
KVM: kvm/x86_emulate.c toggle_interruptibility() should be static
KVM: ia64: fix ia64 build due to missing kallsyms_lookup() and double export
KVM: protect concurrent make_all_cpus_request
KVM: MMU: Allow 4K ptes with bit 7 (PAT) set
KVM: Fix dirty bit tracking for slots with large pages
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, delay: tsc based udelay should have rdtsc_barrier
x86, setup: correct include file in <asm/boot.h>
x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
x86, mce: percpu mcheck_timer should be pinned
x86: Add sysctl to allow panic on IOCK NMI error
x86: Fix uv bau sending buffer initialization
x86, mce: Fix mce resume on 32bit
x86: Move init_gbpages() to setup_arch()
x86: ensure percpu lpage doesn't consume too much vmalloc space
x86: implement percpu_alloc kernel parameter
x86: fix pageattr handling for lpage percpu allocator and re-enable it
x86: reorganize cpa_process_alias()
x86: prepare setup_pcpu_lpage() for pageattr fix
x86: rename remap percpu first chunk allocator to lpage
x86: fix duplicate free in setup_pcpu_remap() failure path
percpu: fix too lazy vunmap cache flushing
x86: Set cpu_llc_id on AMD CPUs
Dixes compilation warning:
CC arch/x86/kernel/io_delay.o
arch/x86/kvm/paging_tmpl.h: In function ‘paging64_fetch’:
arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function
arch/x86/kvm/paging_tmpl.h: In function ‘paging32_fetch’:
arch/x86/kvm/paging_tmpl.h:279: warning: ‘sptep’ may be used uninitialized in this function
warning is bogus (always have a least one level), but need to shut the compiler
up.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In commit 7fe29e0faa we ignored the
reads to the P6 EVNTSEL MSRs. That fixed crashes on Intel machines.
Ignore the reads to K7 EVNTSEL MSRs as well to fix this on AMD
hosts.
This fixes Kaspersky antivirus crashing Windows guests on AMD hosts.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
IF a guest tries to use vmx instructions, inject a #UD to let it know the
instruction is not implemented, rather than crashing.
This prevents guest userspace from crashing the guest kernel.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
toggle_interruptibility() is used only by same file, it should be static.
Fixed following sparse warning :
arch/x86/kvm/x86_emulate.c:1364:6: warning: symbol 'toggle_interruptibility' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This reverts commit 95ee14e437.
Mikael Petterson <mikepe@it.uu.se> reported that at least one of his
systems will not boot as a result. We have ruled out the detection
algorithm malfunctioning, so it is not a matter of producing the
incorrect bitmasks; rather, something in the application of them
fails.
Revert the commit until we can root cause and correct this problem.
-stable team: this means the underlying commit should be rejected.
Reported-and-isolated-by: Mikael Petterson <mikpe@it.uu.se>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <200906261559.n5QFxJH8027336@pilspetsen.it.uu.se>
Cc: stable@kernel.org
Cc: Grant Grundler <grundler@parisc-linux.org>
<asm/boot.h> needs <asm/pgtable_types.h>, not <asm/page_types.h> in
order to resolve PMD_SHIFT. Also, correct a +1 which really should be
+ THREAD_ORDER.
This is a build error which was masked by a typoed #ifdef.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
CONFIG_X86_64 was misspelled (wrong case), which caused the x86-64
kernel to advertise itself as more relocatable than it really is.
This could in theory cause boot failures once bootloaders start
support the new relocation fields.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might
be migrated on other cpu. Use add_timer_on() instead.
Avoids the following failure:
Maciej Rutecki wrote:
> > After normal boot I try:
> >
> > echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
> >
> > I found this in dmesg:
> >
> > [ 141.704025] ------------[ cut here ]------------
> > [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
> > mcheck_timer+0xf5/0x100()
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch introduces a new sysctl:
/proc/sys/kernel/panic_on_io_nmi
which defaults to 0 (off).
When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.
The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.
This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.
[ SuSE has been shipping it for a while, it was done at the
request of a large database vendor, for their users. ]
Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <20090624213211.GA11291@kroah.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update the mmap control page with the needed information to
use the userspace RDPMC instruction for self monitoring.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 9e9f46c44e.
Quoting from the commit message:
"At this point, it seems to solve more problems than it causes, so let's
try using it by default. It's an easy revert if it ends up causing
trouble."
And guess what? The _CRS code causes trouble.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (72 commits)
asus-laptop: remove EXPERIMENTAL dependency
asus-laptop: use pr_fmt and pr_<level>
eeepc-laptop: cpufv updates
eeepc-laptop: sync eeepc-laptop with asus_acpi
asus_acpi: Deprecate in favor of asus-laptop
acpi4asus: update MAINTAINER and KConfig links
asus-laptop: platform dev as parent for led and backlight
eeepc-laptop: enable camera by default
ACPI: Rename ACPI processor device bus ID
acerhdf: Acer Aspire One fan control
ACPI: video: DMI workaround broken Acer 7720 BIOS enabling display brightness
ACPI: run ACPI device hot removal in kacpi_hotplug_wq
ACPI: Add the reference count to avoid unloading ACPI video bus twice
ACPI: DMI to disable Vista compatibility on some Sony laptops
ACPI: fix a deadlock in hotplug case
Show the physical device node of backlight class device.
ACPI: pdc init related memory leak with physical CPU hotplug
ACPI: pci_root: remove unused dev/fn information
ACPI: pci_root: simplify list traversals
ACPI: pci_root: use driver data rather than list lookup
...
The initialization of the UV Broadcast Assist Unit's sending
buffers was making an invalid assumption about the
initialization of an MMR that defines its address.
The BIOS will not be providing that MMR. So
uv_activation_descriptor_init() should unconditionally set it.
Tested on UV simulator.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org> # for v2.6.30.x
LKML-Reference: <E1MJTfj-0005i1-W8@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previous code made an assumption that the power on value of global
control MSR has enabled all fixed and general purpose counters properly.
However, this is not the case for certain Intel processors, such as
Atom - and it might also be firmware dependent.
Each enable bit in IA32_PERF_GLOBAL_CTRL is AND'ed with the
enable bits for all privilege levels in the respective IA32_PERFEVTSELx
or IA32_PERF_FIXED_CTR_CTRL MSRs to start/stop the counting of
respective counters. Counting is enabled if the AND'ed results is true;
counting is disabled when the result is false.
The end result is that all fixed counters are always disabled on Atom
processors because the assumption is just invalid.
Fix this by not initializing the ctrl-mask out of the global MSR,
but setting it to perf_counter_mask.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090624021324.GA2788@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Currently, the following three different ways to define percpu arrays
are in use.
1. DEFINE_PER_CPU(elem_type[array_len], array_name);
2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
3. DEFINE_PER_CPU(elem_type, array_name)[array_len];
Unify to #1 which correctly separates the roles of the two parameters
and thus allows more flexibility in the way percpu variables are
defined.
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
dynamic percpu allocator. The first chunk is allocated using
embedding helper and 8k is reserved for modules. This ensures that
the new allocator behaves almost identically to the original allocator
as long as static percpu variables are concerned, so it shouldn't
introduce much breakage.
s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
range limit the addressing model imposes. Unfortunately, this breaks
if the address is specified using a variable, so for now, the two
archs aren't converted.
The following architectures are affected by this change.
* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc (broken, under investigation)
* m32r
* powerpc(32)
As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs. These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.
* powerpc(64)
* sparc(64)
* ia64
* alpha
* s390
Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization). Compile tested on
sparc(32), powerpc(32), arm and alpha.
Kyle McMartin reported that this change breaks parisc. The problem is
still under investigation and he is okay with pushing this patch
forward and fixing parisc later.
[ Impact: use dynamic allocator for most archs w/o custom percpu setup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
To support domain-isolation usages, the platform hardware must be
capable of uniquely identifying the requestor (source-id) for each
interrupt message. Without source-id checking for interrupt remapping
, a rouge guest/VM with assigned devices can launch interrupt attacks
to bring down anothe guest/VM or the VMM itself.
This patch adds source-id checking for interrupt remapping, and then
really isolates interrupts for guests/VMs with assigned devices.
Because PCI subsystem is not initialized yet when set up IOAPIC
entries, use read_pci_config_byte to access PCI config space directly.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Calling mcheck_init() on resume is required only with
CONFIG_X86_OLD_MCE=y.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The init_gbpages() function is conditionally called from
init_memory_mapping() function. There are two call-sites where
this 'after_bootmem' condition can be true: setup_arch() and
mem_init() via pci_iommu_alloc().
Therefore, it's safe to move the call to init_gbpages() to
setup_arch() as it's always called before mem_init().
This removes an after_bootmem use - paving the way to remove
all uses of that state variable.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <Pine.LNX.4.64.0906221731210.19474@melkki.cs.Helsinki.FI>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.infradead.org/~dwmw2/iommu-2.6.31:
intel-iommu: Fix one last ia64 build problem in Pass Through Support
VT-d: support the device IOTLB
VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
VT-d: add device IOTLB invalidation support
VT-d: parse ATSR in DMA Remapping Reporting Structure
PCI: handle Virtual Function ATS enabling
PCI: support the ATS capability
intel-iommu: dmar_set_interrupt return error value
intel-iommu: Tidy up iommu->gcmd handling
intel-iommu: Fix tiny theoretical race in write-buffer flush.
intel-iommu: Clean up handling of "caching mode" vs. IOTLB flushing.
intel-iommu: Clean up handling of "caching mode" vs. context flushing.
VT-d: fix invalid domain id for KVM context flush
Fix !CONFIG_DMAR build failure introduced by Intel IOMMU Pass Through Support
Intel IOMMU Pass Through Support
Fix up trivial conflicts in drivers/pci/{intel-iommu.c,intr_remapping.c}
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (74 commits)
PCI: make msi_free_irqs() to use msix_mask_irq() instead of open coded write
PCI: Fix the NIU MSI-X problem in a better way
PCI ASPM: remove get_root_port_link
PCI ASPM: cleanup pcie_aspm_sanity_check
PCI ASPM: remove has_switch field
PCI ASPM: cleanup calc_Lx_latency
PCI ASPM: cleanup pcie_aspm_get_cap_device
PCI ASPM: cleanup clkpm checks
PCI ASPM: cleanup __pcie_aspm_check_state_one
PCI ASPM: cleanup initialization
PCI ASPM: cleanup change input argument of aspm functions
PCI ASPM: cleanup misc in struct pcie_link_state
PCI ASPM: cleanup clkpm state in struct pcie_link_state
PCI ASPM: cleanup latency field in struct pcie_link_state
PCI ASPM: cleanup aspm state field in struct pcie_link_state
PCI ASPM: fix typo in struct pcie_link_state
PCI: drivers/pci/slot.c should depend on CONFIG_SYSFS
PCI: remove redundant __msi_set_enable()
PCI PM: consistently use type bool for wake enable variable
x86/ACPI: Correct maximum allowed _CRS returned resources and warn if exceeded
...
On extreme configuration (e.g. 32bit 32-way NUMA machine), lpage
percpu first chunk allocator can consume too much of vmalloc space.
Make it fall back to 4k allocator if the consumption goes over 20%.
[ Impact: add sanity check for lpage percpu first chunk allocator ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
According to Andi, it isn't clear whether lpage allocator is worth the
trouble as there are many processors where PMD TLB is far scarcer than
PTE TLB. The advantage or disadvantage probably depends on the actual
size of percpu area and specific processor. As performance
degradation due to TLB pressure tends to be highly workload specific
and subtle, it is difficult to decide which way to go without more
data.
This patch implements percpu_alloc kernel parameter to allow selecting
which first chunk allocator to use to ease debugging and testing.
While at it, make sure all the failure paths report why something
failed to help determining why certain allocator isn't working. Also,
kill the "Great future plan" comment which had already been realized
quite some time ago.
[ Impact: allow explicit percpu first chunk allocator selection ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
lpage allocator aliases a PMD page for each cpu and returns whatever
is unused to the page allocator. When the pageattr of the recycled
pages are changed, this makes the two aliases point to the overlapping
regions with different attributes which isn't allowed and known to
cause subtle data corruption in certain cases.
This can be handled in simliar manner to the x86_64 highmap alias.
pageattr code should detect if the target pages have PMD alias and
split the PMD alias and synchronize the attributes.
pcpur allocator is updated to keep the allocated PMD pages map sorted
in ascending address order and provide pcpu_lpage_remapped() function
which binary searches the array to determine whether the given address
is aliased and if so to which address. pageattr is updated to use
pcpu_lpage_remapped() to detect the PMD alias and split it up as
necessary from cpa_process_alias().
Jan Beulich spotted the original problem and incorrect usage of vaddr
instead of laddr for lookup.
With this, lpage percpu allocator should work correctly. Re-enable
it.
[ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reorganize cpa_process_alias() so that new alias condition can be
added easily.
Jan Beulich spotted problem in the original cleanup thread which
incorrectly assumed the two existing conditions were mutially
exclusive.
[ Impact: code reorganization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Make the following changes in preparation of coming pageattr updates.
* Define and use array of struct pcpul_ent instead of array of
pointers. The only difference is ->cpu field which is set but
unused yet.
* Rename variables according to the above change.
* Rename local variable vm to pcpul_vm and move it out of the
function.
[ Impact: no functional difference ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
The "remap" allocator remaps large pages to build the first chunk;
however, the name isn't very good because 4k allocator remaps too and
the whole point of the remap allocator is using large page mapping.
The allocator will be generalized and exported outside of x86, rename
it to lpage before that happens.
percpu_alloc kernel parameter is updated to accept both "remap" and
"lpage" for lpage allocator.
[ Impact: code cleanup, kernel parameter argument updated ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
In the failure path, setup_pcpu_remap() tries to free the area which
has already been freed to make holes in the large page. Fix it.
[ Impact: fix duplicate free in failure path ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>