Граф коммитов

12747 Коммитов

Автор SHA1 Сообщение Дата
Yinghai Lu 3b28cf32cc x86, numa: Fix numa_emulation code with memory-less node0
This crash happens on a system that does not have RAM on node0.

When numa_emulation is compiled in, and:

 1. we boot the system without numa=fake...
 2. or we boot the system with numa=fake=128 to make emulation fail

we will get:

[    0.076025] ------------[ cut here ]------------
[    0.080004] kernel BUG at arch/x86/mm/numa_64.c:788!
[    0.080004] invalid opcode: 0000 [#1] SMP
[...]

need to use early_cpu_to_node() directly, because cpu_to_apicid
and apicid_to_node will return node0 that is not onlined.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
LKML-Reference: <4D6ECF72.5010308@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 15:20:19 +01:00
David Rientjes c09cedf4f7 x86-64, NUMA: Clean up initmem_init()
This patch cleans initmem_init() so that it is more readable and doesn't
use an unnecessary array of function pointers to convolute the flow of
the code.  It also makes it obvious that dummy_numa_init() will always
succeed (and documents that requirement) so that the existing BUG() is
never actually reached.

No functional change.

-tj: Updated comment for dummy_numa_init() slightly.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04 15:17:21 +01:00
Yinghai Lu 51b361b400 x86-64, NUMA: Fix numa_emulation code with node0 without RAM
On one system that does not have RAM on node0.

When numa_emulation is compiled in, and
1. boot system without numa=fake...
2. or boot system with numa=fake=128 to make emulation fail

will get:

[    0.092026] ------------[ cut here ]------------
[    0.096005] kernel BUG at arch/x86/mm/numa_emulation.c:439!
[    0.096005] invalid opcode: 0000 [#1] SMP
[    0.096005] last sysfs file:
[    0.096005] CPU 0
[    0.096005] Modules linked in:
[    0.096005]
[    0.096005] Pid: 0, comm: swapper Not tainted 2.6.38-rc6-tip-yh-03869-gcb0491d-dirty #684 Sun Microsystems     Sun Fire X4240/Sun Fire X4240
[    0.096005] RIP: 0010:[<ffffffff81cdc65b>]  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
[    0.096005] RSP: 0000:ffffffff82437ed8  EFLAGS: 00010246
...
[    0.096005] Call Trace:
[    0.096005]  [<ffffffff81cd7931>] identify_cpu+0x2d7/0x2df
[    0.096005]  [<ffffffff827e54fa>] identify_boot_cpu+0x10/0x30
[    0.096005]  [<ffffffff827e5704>] check_bugs+0x9/0x2d
[    0.096005]  [<ffffffff827dceda>] start_kernel+0x3d7/0x3f1
[    0.096005]  [<ffffffff827dc2cc>] x86_64_start_reservations+0x9c/0xa0
[    0.096005]  [<ffffffff827dc4ad>] x86_64_start_kernel+0x1dd/0x1e8
[    0.096005] Code: 74 06 48 8d 04 90 eb 0f 48 c7 c0 30 d9 00 00 48 03 04 d5 90 0f 60 82 8b 00 83 f8 ff 74 0d 0f a3 05 8b 7e 92 00 19 d2 85 d2 75 02 <0f> 0b 48 98 be 00 01 00 00 48 c7 c7 e0 44 60 82 44 8b 2c 85 e0
[    0.096005] RIP  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
[    0.096005]  RSP <ffffffff82437ed8>
[    0.096026] ---[ end trace a7919e7f17c0a725 ]---

We need to use early_cpu_to_node() directly, because numa_cpu_node()
will return node0 that is not onlined.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04 14:49:28 +01:00
Andi Kleen e994d7d23a perf: Fix LLC-* events on Intel Nehalem/Westmere
On Intel Nehalem and Westmere CPUs the generic perf LLC-* events count the
L2 caches, not the real L3 LLC - this was inconsistent with behavior on
other CPUs.

Fixing this requires the use of the special OFFCORE_RESPONSE
events which need a separate mask register.

This has been implemented by the previous patch, now use this infrastructure
to set correct events for the LLC-* on Nehalem and Westmere.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-3-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 11:32:53 +01:00
Andi Kleen a7e3ed1e47 perf: Add support for supplementary event registers
Change logs against Andi's original version:

- Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
- Fixed a major event scheduling issue. There cannot be a ref++ on an
  event that has already done ref++ once and without calling
  put_constraint() in between. (Stephane Eranian)
- Use thread_cpumask for percore allocation. (Lin Ming)
- Use MSR names in the extra reg lists. (Lin Ming)
- Remove redundant "c = NULL" in intel_percore_constraints
- Fix comment of perf_event_attr::config1

Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
that can be used to monitor any offcore accesses from a core.
This is a very useful event for various tunings, and it's
also needed to implement the generic LLC-* events correctly.

Unfortunately this event requires programming a mask in a separate
register. And worse this separate register is per core, not per
CPU thread.

This patch:

- Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
  The extra parameters are passed by user space in the
  perf_event_attr::config1 field.

- Adds support to the Intel perf_event core to schedule per
  core resources. This adds fairly generic infrastructure that
  can be also used for other per core resources.
  The basic code has is patterned after the similar AMD northbridge
  constraints code.

Thanks to Stephane Eranian who pointed out some problems
in the original version and suggested improvements.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 11:32:53 +01:00
Stephane Eranian 17e3162972 perf_events: Update PEBS event constraints
This patch updates PEBS event constraints for Intel Atom, Nehalem, Westmere.

This patch also reorganizes the PEBS format/constraint detection code. It is
now based on processor model and not PEBS format. Two processors may use the
same PEBS format without have the same list of PEBS events.

In this second version, we simplified the initialization of the PEBS
constraints by leveraging the existing switch() statement in perf_event_intel.c.
We also renamed the constraint tables to be more consistent with regular
constraints.

In this 3rd version, we drop BR_INST_RETIRED.MISPRED from Intel Atom as it does
not seem to work. Use MISPREDICTED_BRANCH_RETIRED instead. Also add FP_ASSIST.*
o both Intel Nehalem and Westmere. I misssed those in the earlier patches.
Events were tested using libpfm4 perf_examples.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d6e6b02.815bdf0a.637b.07a7@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 11:32:52 +01:00
Ingo Molnar 888a8a3e9d Merge branch 'perf/urgent' into perf/core
Merge reason: Pick up updates before queueing up dependent patches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 10:40:25 +01:00
Tejun Heo f891125028 x86-64, NUMA: Revert NUMA affine page table allocation
This patch reverts NUMA affine page table allocation added by commit
1411e0ec31 (x86-64, numa: Put pgtable to local node memory).

The commit made an undocumented change where the kernel linear mapping
strictly follows intersection of e820 memory map and NUMA
configuration.  If the physical memory configuration has holes or NUMA
nodes are not properly aligned, this leads to using unnecessarily
smaller mapping size which leads to increased TLB pressure.  For
details,

  http://thread.gmane.org/gmane.linux.kernel/1104672

Patches to fix the problem have been proposed but the underlying code
needs more cleanup and the approach itself seems a bit heavy handed
and it has been determined to revert the feature for now and come back
to it in the next developement cycle.

  http://thread.gmane.org/gmane.linux.kernel/1105959

As init_memory_mapping_high() callsites have been consolidated since
the commit, reverting is done manually.  Also, the RED-PEN comment in
arch/x86/mm/init.c is not restored as the problem no longer exists
with memblock based top-down early memory allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2011-03-04 10:26:36 +01:00
Ian Campbell f611f2da99 xen/timer: Missing IRQF_NO_SUSPEND in timer code broke suspend.
The patches missed an indirect use of IRQF_NO_SUSPEND pulled in via
IRQF_TIMER. The following patch fixes the issue.

With this fixlet PV guest migration works just fine. I also booted the
entire series as a dom0 kernel and it appeared fine.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-03 12:00:31 -05:00
Ian Campbell 3f2a230caf xen: handled remapped IRQs when enabling a pcifront PCI device.
This happens to not be an issue currently because we take pains to try
to ensure that the GSI-IRQ mapping is 1-1 in a PV guest and that
regular event channels do not clash. However a subsequent patch is
going to break this 1-1 mapping.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
2011-03-03 11:56:57 -05:00
Konrad Rzeszutek Wilk 6eaa412f27 xen: Mark all initial reserved pages for the balloon as INVALID_P2M_ENTRY.
With this patch, we diligently set regions that will be used by the
balloon driver to be INVALID_P2M_ENTRY and under the ownership
of the balloon driver. We are OK using the __set_phys_to_machine
as we do not expect to be allocating any P2M middle or entries pages.
The set_phys_to_machine has the side-effect of potentially allocating
new pages and we do not want that at this stage.

We can do this because xen_build_mfn_list_list will have already
allocated all such pages up to xen_max_p2m_pfn.

We also move the check for auto translated physmap down the
stack so it is present in __set_phys_to_machine.

[v2: Rebased with mmu->p2m code split]
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-03 11:52:48 -05:00
Borislav Petkov 84fd1d35cc x86, amd-nb: Misc cleanliness fixes
Make functions used strictly in bool context return bool. Also,
fixup used types and comments, and make a local function static,
while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Borislav Petkov <bp@amd64.org>
LKML-Reference: <20110303115932.GA8603@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-03 13:06:20 +01:00
Jan Beulich d04c579f97 x86: Work around old gas bug
Add extra parentheses around a couple of definitions introduced
by "x86: Cleanup vector usage" and used in assembly macro
arguments, and remove spaces. Without that old (2.16.1) gas
would see more macro arguments than were actually specified.

Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <4D6F81B10200007800034B0B@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-03 12:47:08 +01:00
Linus Torvalds f7d222ea2a Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
  of/promtree: allow DT device matching by fixing 'name' brokenness (v5)
  x86: OLPC: have prom_early_alloc BUG rather than return NULL
  of/flattree: Drop an uninteresting message to pr_debug level
  of: Add missing of_address.h to xilinx ehci driver
2011-03-02 20:01:57 -08:00
Linus Torvalds ebff7c92ab Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] p4-clockmod: print EST-capable warning message only once
  [CPUFREQ] fix BUG on cpufreq policy init failure
  [CPUFREQ] Fix another notifier leak in powernow-k8.
  [CPUFREQ] Missing "unregister_cpu_notifier" in powernow-k8.c
2011-03-02 19:58:31 -08:00
Linus Torvalds c7b01d3dc2 Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  intel_idle: disable Atom/Lincroft HW C-state auto-demotion
  intel_idle: disable NHM/WSM HW C-state auto-demotion
2011-03-02 18:08:03 -08:00
Andres Salomon 60cba5a57b x86: OLPC: have prom_early_alloc BUG rather than return NULL
..similar to what sparc's prom_early_alloc does.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-03-02 13:45:18 -07:00
Tejun Heo eb8c1e2c83 x86-64, NUMA: Better explain numa_distance handling
Handling of out-of-bounds distances and allocation failure can use
better documentation.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
2011-03-02 16:34:21 +01:00
Yinghai Lu ce0033307f x86-64, NUMA: Fix distance table handling
NUMA distance table handling has the following problems.

* numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
  as the table size when it should be using the square of
  numa_distance.

* The same size miscalculation when allocation space for phys_dist in
  numa_emulation().

* In numa_emulation(), phys_dist must be reserved; otherwise, the new
  emulated distance table may overlap it.

Fix them and, while at it, take numa_distance_cnt resetting in
numa_reset_distance() out of the if block to simplify the code a bit.

David Rientjes reported incorrect handling of distance table during
emulation.

-tj: Edited out numa_alloc_distance() related changes which weren't
     necessary and rewrote patch description.

-v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
     lines run over 80-column.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: David Rientjes <rientjes@google.com>
2011-03-02 16:34:09 +01:00
Lin Ming b06b3d4969 perf, x86: Add Intel SandyBridge CPU support
This patch adds basic SandyBridge support, including hardware
cache events and PEBS events support.

It has been tested on SandyBridge CPUs with perf stat and also
with PEBS based profiling - both work fine.

The patch does not affect other models.

v2 -> v3:
 - fix PEBS event 0xd0 with right umask combinations
 - move snb pebs constraint assignment to intel_pmu_init

v1 -> v2:
 - add more raw and PEBS events constraints
 - use offcore events for LLC-* cache events
 - remove the call to Nehalem workaround enable_all function

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1299072424.2175.24.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-02 14:37:02 +01:00
Jan Beulich e938c287ea x86: Fix a bogus unwind annotation in lib/semaphore_32.S
'simple' would have required specifying current frame address
and return address location manually, but that's obviously not
the case (and not necessary) here.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6D1082020000780003454C@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-02 08:16:44 +01:00
Daniel J Blueman 6670e9cdaf x86, build: Make sure mkpiggy fails on read error
Ensure build doesn't silently continue despite read failure,
addressing a warning due to the unchecked call.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
LKML-Reference: <AANLkTimxxTMU3=4ry-_zbY6v1xiDi+hW9y1RegTr8vLK@mail.gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-03-01 16:32:03 -08:00
Naga Chumbalkar 853cee26e2 [CPUFREQ] p4-clockmod: print EST-capable warning message only once
Print the message only once. I see it 16 times on a 2P box with 16 logical CPUs.

Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
2011-03-01 18:49:45 -05:00
Dave Jones a536b126f2 [CPUFREQ] Fix another notifier leak in powernow-k8.
Do the notifier registration later, so we don't have to worry
about freeing it if we fail the msr allocation.

Signed-off-by: Dave Jones <davej@redhat.com>
2011-03-01 18:49:44 -05:00
Neil Brown ac81831449 [CPUFREQ] Missing "unregister_cpu_notifier" in powernow-k8.c
It appears that when powernow-k8 finds that

    No compatible ACPI _PSS objects found.

 and suggests

    Try again with latest BIOS.

 it fails the module load, but does not unregister the cpu_notifier that was
 registered in powernowk8_init

 This ends up leaving freed memory on the cpu notifier list for some other
 poor module (e.g. md/raid5) to come along and trip over.

 The following might be a partial fix, but I suspect there is probably other
 clean-up that is needed.

 ( https://bugzilla.novell.com/show_bug.cgi?id=655215 has full dmesg traces).

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2011-03-01 18:49:44 -05:00
Jan Beulich 039e13890b x86: Remove unused bits from lib/thunk_*.S
Some of the items removed were apparently never used, others
simply didn't get removed with their last user.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6BD3A002000078000341F1@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-28 18:06:22 +01:00
Jan Beulich 60cf637a13 x86: Use {push,pop}_cfi in more places
Cleaning up and shortening code...

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BD35002000078000341DA@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-28 18:06:22 +01:00
Jan Beulich 39f2205e1a x86-64: Add CFI annotations to lib/rwsem_64.S
These weren't part of the initial commit of this code.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BCDFF02000078000341B0@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-28 18:06:21 +01:00
Don Zickus 299c56966a x86: Use u32 instead of long to set reset vector back to 0
A customer of ours, complained that when setting the reset
vector back to 0, it trashed other data and hung their box.
They noticed when only 4 bytes were set to 0 instead of 8,
everything worked correctly.

Mathew pointed out:

 |
 | We're supposed to be resetting trampoline_phys_low and
 | trampoline_phys_high here, which are two 16-bit values.
 | Writing 64 bits is definitely going to overwrite space
 | that we're not supposed to be touching.
 |

So limit the area modified to u32.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Cc: <stable@kernel.org>
LKML-Reference: <1297139100-424-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-28 16:22:18 +01:00
Christoph Lameter b9ec40af0e percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
Support this_cpu_cmpxchg_double() using the cmpxchg16b and cmpxchg8b
instructions.

-tj: s/percpu_cmpxchg16b/percpu_cmpxchg16b_double/ for consistency and
     other cosmetic changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-28 11:20:49 +01:00
Stratos Psomadakis 7bf04be8f4 x86, asm: Cleanup unnecssary macros in asm-offsets.c
PAGE_SIZE_asm, PAGE_SHIFT_asm, THREAD_SIZE_asm can be safely removed from 
asm-offsets.c, and be replaced by their non-'_asm' counterparts in the code 
that uses them, since the _AC macro defined in include/linux/const.h makes
PAGE_SIZE/PAGE_SHIFT/THREAD_SIZE work with as.

Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
LKML-Reference: <1298666774-17646-2-git-send-email-psomas@cslab.ece.ntua.gr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-02-25 16:37:32 -08:00
Linus Torvalds 958ede7f1b Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems
  x86/mrst: Fix apb timer rating when lapic timer is used
  x86: Fix reboot problem on VersaLogic Menlow boards
2011-02-25 14:02:33 -08:00
Ian Campbell 03c8142bd2 xen: suspend: add "arch" to pre/post suspend hooks
xen_pre_device_suspend is unused on ia64.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-25 16:43:12 +00:00
Ian Campbell a8b7458363 xen: switch to new schedop hypercall by default.
Rename old interface to sched_op_compat and rename sched_op_new to
simply sched_op.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-25 16:43:10 +00:00
Ian Campbell 8e15597fa4 xen: use new schedop interface for suspend
Take the opportunity to comment on the semantics of the PV guest
suspend hypercall arguments.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-25 16:43:10 +00:00
Stefano Stabellini e057a4b6e0 xen: fix compile issue if XEN is enabled but XEN_PVHVM is disabled
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2011-02-25 16:43:06 +00:00
Stefano Stabellini 99bbb3a84a xen: PV on HVM: support PV spinlocks and IPIs
Initialize PV spinlocks on boot CPU right after native_smp_prepare_cpus
(that switch to APIC mode and initialize APIC routing); on secondary
CPUs on CPU_UP_PREPARE.

Enable the usage of event channels to send and receive IPIs when
running as a PV on HVM guest.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2011-02-25 16:43:06 +00:00
Stefano Stabellini cff520b9c2 xen: do not use xen_info on HVM, set pv_info name to "Xen HVM"
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
2011-02-25 16:43:04 +00:00
Thomas Gleixner a906fdaacc x86: dt: Cleanup local apic setup
Up to now we force enable the local apic in the devicetree setup
uncoditionally and set smp_found_config unconditionally to 1 when a
devicetree blob is available. This breaks, when local apic is disabled
in the Kconfig.

Make it consistent by initializing device tree explicitely before
smp_get_config() so a non lapic configuration could be used as well.
To be functional that would require to implement PIT as an interrupt
host, but the only user of this code until now is ce4100 which
requires apics to be available. So we leave this up to those who need
it.

Tested-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-25 16:18:52 +01:00
David Rientjes 1f565a896e x86-64, NUMA: Fix size of numa_distance array
numa_distance should be sized like the SLIT, an NxN matrix where N is
the highest node id + 1.  This patch fixes the calculation to avoid
overflowing the array on the subsequent iteration.

-tj: The original patch used last index to calculate size.  Yinghai
     pointed out it should be incremented so it is the number of
     elements instead of the last index to calculate the size of the
     table.  Updated accordingly.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-25 10:10:54 +01:00
Linus Torvalds 86e2fe9ff3 Merge branch 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Advance instruction pointer in dr_intercept
2011-02-24 12:22:14 -08:00
Andreas Herrmann 7f74f8f28a x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems
On some SB800 systems polarity for IOAPIC pin2 is wrongly
specified as low active by BIOS. This caused system hangs after
resume from S3 when HPET was used in one-shot mode on such
systems because a timer interrupt was missed (HPET signal is
high active).

For more details see:

  http://marc.info/?l=linux-kernel&m=129623757413868

Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 37.x, 32.x
LKML-Reference: <20110224145346.GD3658@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-24 20:30:21 +01:00
Yinghai Lu d1b19426b0 x86: Rename e820_table_* to pgt_buf_*
e820_table_{start|end|top}, which are used to buffer page table
allocation during early boot, are now derived from memblock and don't
have much to do with e820.  Change the names so that they reflect what
they're used for.

This patch doesn't introduce any behavior change.

-v2: Ingo found that earlier patch "x86: Use early pre-allocated page
     table buffer top-down" caused crash on 32bit and needed to be
     dropped.  This patch was updated to reflect the change.

-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-24 14:52:18 +01:00
Sebastian Andrzej Siewior 4a66b1d95a x86: dt: Fix OLPC=y/INTEL_CE=n build
Both OLPC and CE4100 activate CONFIG_OF. OLPC uses PROMTREE while CE
uses FLATTREE. Compiling for OLPC only breaks due to missing flat tree
functions and variables.

Use proper wrappers and provide an empty x86_flattree_get_config()
inline so OF=y FLATTREE=n builds and works.

[ tglx: Make it work with HPET_TIMER=n and make a function static ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-24 13:01:59 +01:00
Jacob Pan 7b62dbec90 x86/mrst: Fix apb timer rating when lapic timer is used
Need to adjust the clockevent device rating for the structure
that will be registered with clockevent system instead of the
temporary structure.

Without this fix, APB timer rating will be higher than LAPIC
timer such that it can not be released later to be used as the
broadcast timer.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
LKML-Reference: <1298506046-439-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-24 08:22:43 +01:00
Sebastian Andrzej Siewior 3bcbaf6e08 rtc: cmos: Add OF bindings
This allows to load the OF driver based informations from the device
tree. Systems without BIOS may need to perform some initialization.
PowerPC creates a PNP device from the OF information and performs this
kind of initialization in their private PCI quirk. This looks more
generic.

This patch also avoids registering the platform RTC driver on X86 if
we have a device tree blob. Otherwise we would setup the device based
on the hardcoded information in arch/x86 rather than the device tree
based one.

[ tglx: Changed "int of_have_populated_dt()" to bool as recommended by
        Grant ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: rtc-linux@googlegroups.com
Cc: Alessandro Zummo <a.zummo@towertech.it>
LKML-Reference: <1298405266-1624-12-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:55 +01:00
Sebastian Andrzej Siewior 1fa4163bdc x86: ce4100: Use OF to setup devices
Use device tree information to setup IO_APIC configuration, interrupt
routing, HPET and everything else which cannot be enumerated by other
means.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-11-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:54 +01:00
Sebastian Andrzej Siewior bcc7c1244f x86: ioapic: Add OF bindings for IO_APIC
ioapic_xlate provides a translation from the information in device tree
to ioapic related informations. This includes
- obtaining hw irq which is the vector number "=> pin number + gsi"
- obtaining type (level/edge/..)
- programming this information into ioapic

ioapic_add_ofnode adds an irq_domain based on informations from the device
tree. This information (irq_domain) is required in order to map a device to
its proper interrupt controller.

[ tglx: Adapted to the io_apic changes, which let us move that whole code
  	to devicetree.c ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-10-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:54 +01:00
Sebastian Andrzej Siewior 9079b35364 x86: dtb: Add generic bus probe
For now we probe these busses and we change this to board dependent
probes once we have to.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-9-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:54 +01:00
Sebastian Andrzej Siewior 96e0a0797e x86: dtb: Add support for PCI devices backed by dtb nodes
x86_of_pci_init() does two things:

- it provides a generic irq enable and disable function. enable queries
  the device tree for the interrupt information, calls ->xlate on the
  irq host and updates the pci->irq information for the device.

- it walks through PCI bus(es) in the device tree and adds its children
  (device) nodes to appropriate pci_dev nodes in kernel. So the dtb
  node information is available at probe time of the PCI device.

Adding a PCI bus based on the information in the device tree is
currently not supported. Right now direct access via ioports is used.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-8-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior ffb9fc68df x86: dtb: Add device tree support for HPET
Set hpet_address based on information provied form DTB

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
LKML-Reference: <1298405266-1624-7-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior 3879a6f329 x86: dtb: Add early parsing of IO_APIC
APIC and IO_APIC have to be added to the system early because
native_init_IRQ() requires it.

In order to obtain the address of the ioapic the device tree has to be
unflattened so of_address_to_resource() works.

The device tree is relocated to ensure it is always covered by the
kernel mapping. That way the boot loader does not have to make
any assumptions about kernel's memory layout.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
LKML-Reference: <1298405266-1624-6-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior 19c4f5f7f7 x86: dtb: Add irq domain abstraction
The here introduced irq_domain abstraction represents a generic irq
controller. It is a subset of powerpc's irq_host which is going to be
renamed to irq_domain and then become generic. This implementation will
be removed once it is generic.

The xlate callback is resposible to parse irq informations like irq type
and number and returns the hardware irq number which is reported by the
hardware as active.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-5-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior df2634f43f x86: dtb: Add a device tree for CE4100
History:
v1..v2:
- dropped device_type except for cpu & pci. I have the compatible string
  for pci so I can drop the device_type once it is possible
- I lowercased all compatible types. I will need to resend some patches
  which have upper case intel
- The cpu had the same compatible string as the soc node. So I added to
  the soc node -immr for internel memory mapped registers.
- I added generic names for all parts.
- I reworked the i2c bars matching the way you suggested. I added a
  compatible node for the PCI device which only the PCI ids in its
  compatible string. The bars (each represents a complete i2c
  controller) have a "intel,ce4100-i2c-controller" compatible node. It
  is not used by the driver.
  The driver is probed via PCI ids (by the pci subsystem not OF) and
  matches the bar address against the ressource in the child node. Once
  there is a hit the node is attached.
- The SPI driver is also probed via pci. However I also attached a
  compatible property based on PCI ids

v2..v3:
- intel,ce4100-immr become intel,ce4100-cp. cp stands for core
  peripherals. The Atom data sheet talks here about ACPI devices. Since
  we don't have ACPI this does not apply here.
- The interrupt map is gone. There are now plenty of device nodes.
- The "unit address string" got fixed, it uses not DD,V format.

v3..v4:
- added descriptions for compatible nodes introduced here:
  - intel,ce4100-ioapic
  - intel,ce4100-lapic
  - intel,ce4100-hpet
  - intel,ce4100
  - intel,ce4100-cp
  - intel,ce4100-pci
- added a description about I2C controller magic.
- Added gpio-controller and gpio-cells property to gpio devices. Those
  properties are not (yet) used.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-4-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:52 +01:00
Sebastian Andrzej Siewior da6b737b9a x86: Add device tree support
This patch adds minimal support for device tree on x86. The device
tree blob is passed to the kernel via setup_data which requires at
least boot protocol 2.09.

Memory size, restricted memory regions, boot arguments are gathered
the traditional way so things like cmd_line are just here to let the
code compile.

The current plan is use the device tree as an extension and to gather
information which can not be enumerated and would have to be hardcoded
otherwise. This includes things like 
   - which devices are on this I2C/SPI bus?
   - how are the interrupts wired to IO APIC?
   - where could my hpet be?

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-3-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:52 +01:00
Sebastian Andrzej Siewior f1c2b35714 x86: e820: Remove conditional early mapping in parse_e820_ext
This patch ensures that the memory passed from parse_setup_data() is
large enough to cover the complete data structure. That means that the
conditional mapping in parse_e820_ext() can go.

While here, I also attempt not to map two pages if the address is not
aligned to a page boundary.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-2-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 22:27:52 +01:00
Thomas Gleixner cb4cfd568c Merge branch 'x86/apic' into x86/platform
Reason: Devicetree based ioapic setup depends on the apic changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 20:01:01 +01:00
Thomas Gleixner abb0052289 x86: ioapic: Move trigger defines to io_apic.h
Required for devicetree based io_apic configuration.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 19:58:09 +01:00
Thomas Gleixner 710dcda643 x86: ioapic: Implement and use io_apic_setup_irq_pin_once()
io_apic_set_pci_routing() and mp_save_irq() check the pin_programmed
bit before calling io_apic_setup_irq_pin() and set the bit when the
pin was setup.

Move that duplicated code into a separate function and use it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 18:58:09 +01:00
Thomas Gleixner b77cf6a860 x86: ioapic: Remove useless inlines
There is no point to have irq_trigger() and irq_polarity() as wrappers
around the MPBIOS_* camel case functions. Get rid of both the inlines
and the ugly camel case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:38:23 +01:00
Thomas Gleixner 41098ffe05 x86: ioapic: Make a few functions static
No users outside of io_apic.c. Mark bad_ioapic() __init while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:51 +01:00
Thomas Gleixner da1ad9d7b2 x86: ioapic: Use setup function in setup_IO_APIC_irq_extra()
Another version of the same thing. Only set the pin programmed, when
the setup function succeeds.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:50 +01:00
Thomas Gleixner 2d57e37dbf x86: ioapic: Use setup function in __io_apic_setup_irqs()
Replace the duplicated code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:50 +01:00
Thomas Gleixner e0799c04b2 x86: ioapic: Use setup function in __io_apic_set_pci_routing()
The only difference here is that we did not call
__add_pin_to_irq_node() for the legacy irqs, but that's not worth 30
lines of extra code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:49 +01:00
Thomas Gleixner f880ec78fa x86: ioapic: Use new setup function in pre_init_apic_IRQ0()
Remove the duplicated code and call the function. It does not matter
whether we allocated the cfg before calling setup_local_APIC() and we
can set the irq chip and handler after that as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:49 +01:00
Thomas Gleixner ff973d041e x86: ioapic: Add io_apic_setup_irq_pin()
There are about four places in the ioapic code which do exactly the
same setup sequence. Also the OF based ioapic setup needs that
function to avoid putting the OF specific code into ioapic.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:49 +01:00
Thomas Gleixner ed972ccf43 x86: ioapic: Split out the nested loop in setup_IO_APIC_irqs()
Two consecutive

    for(...)
    for(...)

lines to avoid an extra indentation are just horrible to read. I had
to look more than once to figure out what the code is doing.

Split out the inner loop into a separate function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:49 +01:00
Thomas Gleixner c8d6b8fe72 x86: ioapic: Remove silly debug bloat in setup_IOAPIC_irqs()
This is debug code and it does not matter at all whether we print each
not connected pin in an extra line or try to be extra clever.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 17:26:48 +01:00
Thomas Gleixner 939d578ecc x86: OLPC: Make OLPC=n build again
Stupid me missed the functions called from setup.c. Add the stubs back
for OLPC=n

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 11:54:02 +01:00
Henrik Kretzschmar 1444e0c9da x86: Fix deps of X86_UP_IOAPIC
Since commit 7cd92366a5
lAPIC enabled accidently the IOAPIC, which now gets fixed.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-5-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23 11:38:46 +01:00
Henrik Kretzschmar 7d0f192613 x86: Add dummy functions for compiling without IOAPIC
This patch adds IOAPIC dummy functions for compilation
with local APIC, but without IOAPIC.

The local variable ioapic_entries in enable_IR_x2apic()
does not need initialization anymore, since the dummy
returns NULL.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-4-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23 11:38:46 +01:00
Henrik Kretzschmar 7167d08e78 x86: Rework arch_disable_smp_support() for x86
Currently arch_disable_smp_support() on x86 disables only the
support for the IOAPIC and is also compiled in if SMP-support is
not.

Therefore this function is renamed to disable_ioapic_support(),
which meets its purpose and is only compiled in the kernel
when IOAPIC support is also.

A new arch_disable_smp_support() is created in smpboot.c,
which calls disable_ioapic_support() and gets only compiled
in the kernel when SMP support is also.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-3-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23 11:38:45 +01:00
Henrik Kretzschmar b6a1432da8 x86: Add dummy mp_save_irq()
This is a dummy function, used when no IOAPIC is compiled in.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-2-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23 11:38:45 +01:00
Henrik Kretzschmar 4e034b2451 x86: Move ioapic_irq_destination_types to apicdef.h
This enum is used by non IOAPIC code, so apicdef.h is
the best place for it.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23 11:38:44 +01:00
Thomas Gleixner c2a941fadb x86: OLPC: Remove extra OLPC_OPENFIRMWARE_DT indirection
OLPC_OPENFIRMWARE_DT is just there to be selected by OLPC and selects
OF_PROMTREE. So let OLPC select OF_PROMTREE and remove that extra
config indirection. Fixup code and Makefile and use CONFIG_OF_PROMTREE
instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@queued.net>
2011-02-23 10:40:45 +01:00
Thomas Gleixner dc3119e700 x86: OLPC: Cleanup config maze completely
Neither CONFIG_OLPC_OPENFIRMWARE nor CONFIG_OLPC_OPENFIRMWARE_DT are
really necessary.

OLPC selects OLPC_OPENFIRMWARE unconditionally, so move the "select
OF" part under OLPC config option and fixup the dependencies in
Makefiles and code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@queued.net>
2011-02-23 10:40:45 +01:00
Thomas Gleixner fe239545a1 x86: OLPC: Hide OLPC_OPENFIRMWARE config switch
OLPC selects OLPC_OPENFIRMWARE unconditionally. If OLPC=n then
the OLPC_OPENFIRMWARE functionality is pointless.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@queued.net>
2011-02-23 10:40:45 +01:00
Thomas Gleixner 540089798d x86: OLPC: Remove redundant !X64_64 config dependency
OLPC is under if X86_32 already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@queued.net>
2011-02-23 10:40:45 +01:00
Thomas Gleixner 7acdbb3f35 Merge branch 'linus' into x86/platform
Reason: Import mainline device tree changes on which further patches
        depend on or conflict.

Trivial conflict in: drivers/spi/pxa2xx_spi_pci.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-23 09:21:41 +01:00
Zhang, Fengzhe 2f14ddc3a7 xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps.
With the hypervisor argument of dom0_mem=X we iterate over the physical
(only for the initial domain) E820 and subtract the the size from each
E820_RAM region the delta so that the cumulative size of all E820_RAM regions
is equal to 'X'. This sometimes ends up with E820_RAM regions with zero size
(which are removed by e820_sanitize) and E820_RAM that are smaller
than physically.

Later on the PCI API looks at the E820 and attempts to set up an
resource region for the "PCI mem". The E820 (assume dom0_mem=1GB is
set) compared to the physical looks as so:

 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000]  Xen: 0000000000000000 - 0000000000097c00 (usable)
 [    0.000000]  Xen: 0000000000097c00 - 0000000000100000 (reserved)
-[    0.000000]  Xen: 0000000000100000 - 00000000defafe00 (usable)
+[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
 [    0.000000]  Xen: 00000000defafe00 - 00000000defb1ea0 (ACPI NVS)
 [    0.000000]  Xen: 00000000defb1ea0 - 00000000e0000000 (reserved)
 [    0.000000]  Xen: 00000000f4000000 - 00000000f8000000 (reserved)
..
And we get
[    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:9efafe00)

while it should have started at e0000000 (a nice big gap up to
f4000000 exists). The "Allocating PCI" is part of the resource API.

The users that end up using those PCI I/O regions usually supply their
own BARs when calling the resource API (request_resource, or allocate_resource),
but there are exceptions which provide an empty 'struct resource' and
expect the API to provide the 'struct resource' to be populated with valid values.
The one that triggered this bug was the intel AGP driver that requested
a region for the flush page (intel_i9xx_setup_flush).

Before this patch, when running under Xen hypervisor, the 'struct resource'
returned could have (depending on the dom0_mem size) physical ranges of a 'System RAM'
instead of 'I/O' regions. This ended up with the Hypervisor failing a request
to populate PTE's with those PFNs as the domain did not have access to those
'System RAM' regions (rightly so).

After this patch, the left-over E820_RAM region from the truncation, will be
labeled as E820_UNUSABLE. The E820 will look as so:

 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000]  Xen: 0000000000000000 - 0000000000097c00 (usable)
 [    0.000000]  Xen: 0000000000097c00 - 0000000000100000 (reserved)
-[    0.000000]  Xen: 0000000000100000 - 00000000defafe00 (usable)
+[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
+[    0.000000]  Xen: 0000000040000000 - 00000000defafe00 (unusable)
 [    0.000000]  Xen: 00000000defafe00 - 00000000defb1ea0 (ACPI NVS)
 [    0.000000]  Xen: 00000000defb1ea0 - 00000000e0000000 (reserved)
 [    0.000000]  Xen: 00000000f4000000 - 00000000f8000000 (reserved)

For more information:
http://mid.gmane.org/1A42CE6F5F474C41B63392A5F80372B2335E978C@shsmsx501.ccr.corp.intel.com

BugLink: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1726

Signed-off-by: Fengzhe Zhang <fengzhe.zhang@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-22 12:48:50 -05:00
Thomas Gleixner 695884fb8a Merge branch 'devicetree/for-x86' of git://git.secretlab.ca/git/linux-2.6 into x86/platform
Reason: x86 devicetree support for ce4100 depends on those device tree
	changes scheduled for .39.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-22 18:41:48 +01:00
Joerg Roedel 2c46d2aec0 KVM: SVM: Advance instruction pointer in dr_intercept
In the dr_intercept function a new cpu-feature called
decode-assists is implemented and used when available. This
code-path does not advance the guest-rip causing the guest
to dead-loop over mov-dr instructions. This is fixed by this
patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-02-22 16:01:44 +02:00
Yinghai Lu 2bf50555b0 x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance()
Alloc code is much bigger the distance setting.  Separate it out into
numa_alloc_distance() for readability.

-v2: Let alloc_numa_distance to return -ENOMEM on failing path,
     requested by tj.

-tj: Description update.  Minor tweaks including function name,
     location and return value check.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-22 11:18:49 +01:00
Tejun Heo 90e6b677b4 x86-64, NUMA: Add proper function comments to global functions
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Tejun Heo b8ef9172b2 x86-64, NUMA: Move NUMA emulation into numa_emulation.c
Create numa_emulation.c and move all NUMA emulation code there.  The
definitions of struct numa_memblk and numa_meminfo are moved to
numa_64.h.  Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
numa_reset_distance() along with numa_emulation() are made global.

- v2: Internal declarations moved to numa_internal.h as suggested by
      Yinghai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Tejun Heo fbe99959d1 x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
Update numa_emulation() such that, it

- takes @numa_meminfo and @numa_dist_cnt instead of directly
  referencing the global variables.

- copies the distance table by iterating each distance with
  node_distance() instead of memcpy'ing the distance table.

- tests emu_cmdline to determine whether emulation is requested and
  fills emu_nid_to_phys[] with identity mapping if emulation is not
  used.  This allows the caller to call numa_emulation()
  unconditionally and makes return value unncessary.

- defines dummy version if CONFIG_NUMA_EMU is disabled.

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Yinghai Lu 69efcc6d90 x86-64, NUMA: Do not scan two times for setup_node_bootmem()
By the time setup_node_bootmem() is called, all the memblocks are
already registered.  As node_data is allocated from these memblocks,
calling it more than once doesn't make any difference.  Drop the loop.

tj: Dropped comment referencing to the old behavior as suggested by
    David and rephrased the description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-21 11:23:31 +01:00
Kushal Koolwal e19e074b15 x86: Fix reboot problem on VersaLogic Menlow boards
VersaLogic Menlow based boards hang on reboot unless reboot=bios
is used. Add quirk to reboot through the BIOS.

Tested on at least four boards.

Signed-off-by: Kushal Koolwal <kushalkoolwal@gmail.com>
LKML-Reference: <1298152563-21594-1-git-send-email-kushalkoolwal@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-21 08:41:26 +01:00
Dan Carpenter 1396fa9cd2 x86, microcode, AMD: Fix signedness bug in generic_load_microcode()
install_equiv_cpu_table() returns type int.  It uses negative
error codes so using an unsigned type breaks the error handling.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: open list:AMD MICROCODE UPD... <amd64-microcode@amd64.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20110218091716.GA4384@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-20 14:01:32 +01:00
Borislav Petkov 2b15cd96e5 x86, system.h: Drop unused __SAVE/__RESTORE macros
Those are unused since at least the beginning of git history.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <1298044056-31104-1-git-send-email-bp@amd64.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-20 13:38:47 +01:00
H. Peter Anvin 1c4badbdea x86-64, trampoline: Remove unused variable
Removed unused variable left over from development.

Reported-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <AANLkTik6UJ680mWJcu_W+jerLcqPjwjvaXyxB1jAMaG0@mail.gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
2011-02-18 15:50:36 -08:00
H. Peter Anvin ee1b06ea6a x86, reboot: Fix the use of passed arguments in 32-bit BIOS reboot
The initial version of this patch had %eax being a segment and %ecx
being the mode.  I had changed the interfaces, but not the actual
implementation!

Reported-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <AANLkTikxqk=HEw9R-Du=v-1ti1HDGAY9vaNUep2XARaz@mail.gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
2011-02-18 15:47:42 -08:00
jacob.jun.pan@linux.intel.com 5df91509d3 x86: mrst: Remove apb timer read workaround
APB timer current count was unreliable in the earlier silicon, which
could result in time going backwards. This problem has been fixed in
the current silicon stepping. This patch removes the workaround which
was used to check and prevent timer rolling back when APB timer is
used as clocksource device.

The workaround code was also flawed by potential race condition
around the cached read value last_read. Though a fix can be done
by assigning last_read to a local variable at the beginning of
apbt_read_clocksource(), but this is not necessary anymore.

[ tglx: A sane timer on an Intel chip - I can't believe it ]

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
LKML-Reference: <1298065374-25532-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-18 23:14:54 +01:00
Konrad Rzeszutek Wilk 3d74a539ae pci/xen: When free-ing MSI-X/MSI irq->desc also use generic code.
This code path is only run when an MSI/MSI-X PCI device is passed
in to PV DomU.

In 2.6.37 time-frame we over-wrote the default cleanup handler for
MSI/MSI-X irq->desc to be "xen_teardown_msi_irqs". That function
calls the the xen-pcifront driver which can tell the backend to
cleanup/take back the MSI/MSI-X device.

However, we forgot to continue the process of free-ing the MSI/MSI-X
device resources (irq->desc) in the PV domU side. Which is what
the default cleanup handler: default_teardown_msi_irqs did.

Hence we would leak IRQ descriptors.

Without this patch, doing "rmmod igbvf;modprobe igbvf" multiple
times ends with abandoned IRQ descriptors:

 28:          5  xen-pirq-pcifront-msi-x
 29:          8  xen-pirq-pcifront-msi-x
...
130:         10  xen-pirq-pcifront-msi-x

with the end result of running out of IRQ descriptors.

Reviewed-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-18 12:41:53 -05:00
Konrad Rzeszutek Wilk cc0f89c4a4 pci/xen: Cleanup: convert int** to int[]
Cleanup code. Cosmetic change to make the code look easier
to read.

Reviewed-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-18 12:41:49 -05:00
Sebastian Andrzej Siewior 13884c6680 x86/pci: Remove unused variable
|arch/x86/pci/ce4100.c: In function `ce4100_conf_read':
|arch/x86/pci/ce4100.c:257:9: warning: unused variable `retval'

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: dirk.brandewie@gmail.com
LKML-Reference: <1292600033-12271-16-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-18 16:57:18 +01:00
Konrad Rzeszutek Wilk 55cb8cd45e pci/xen: Use xen_allocate_pirq_msi instead of xen_allocate_pirq
xen_allocate_pirq -> xen_map_pirq_gsi -> PHYSDEVOP_alloc_irq_vector IFF
xen_initial_domain() in addition to the kernel side book-keeping side of
things (set chip and handler, update irq_info etc) whereas
xen_allocate_pirq_msi just does the kernel book keeping.

Also xen_allocate_pirq allocates an IRQ in the 1-1 GSI space whereas
xen_allocate_pirq_msi allocates a dynamic one in the >GSI IRQ space.

All of this is uneccessary as this code path is only executed
when we run as a domU PV guest with an MSI/MSI-X PCI card passed in.
Hence we can jump straight to allocating an dynamic IRQ (and
binding it to the proper PIRQ) and skip the rest.

In short: this change is a cosmetic one.

Reviewed-by: Ian Campbell <Ian.Campbell@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-02-18 09:26:37 -05:00
Jan Beulich 58bff947e2 x86: Eliminate pointless adjustment attempts in fixup_irqs()
Not only when an IRQ's affinity equals cpu_online_mask is there
no need to actually try to adjust the affinity, but also when
it's a subset thereof. This particularly avoids adjustment
attempts during system shutdown to any IRQs bound to CPU#0.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Gary Hade <garyhade@us.ibm.com>
LKML-Reference: <4D5D52C2020000780003272C@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-18 08:58:00 +01:00
Jan Beulich 02ca752e41 x86: Remove die_nmi()
With no caller left, the function and the DIE_NMIWATCHDOG
enumerator can both go away.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
LKML-Reference: <4D5D521C0200007800032702@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-18 08:54:05 +01:00
Jan Beulich fd8fa4d3dd x86: Combine printk()s in show_regs_common()
Printing a single character alone when there's an immediately
following printk() is pretty pointless (and wasteful).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D5D535A0200007800032730@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-18 08:52:30 +01:00
Jan Beulich bb3e6251a6 x86: Don't call dump_stack() from arch_trigger_all_cpu_backtrace_handler()
show_regs() already prints two(!) stack traces, no need for a third one.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D5D512902000078000326EE@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-18 08:52:29 +01:00
H. Peter Anvin 3d35ac346e x86, reboot: Move the real-mode reboot code to an assembly file
Move the real-mode reboot code out to an assembly file (reboot_32.S)
which is allocated using the common lowmem trampoline allocator.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4D5DFBE4.7090104@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
2011-02-17 21:05:34 -08:00
H. Peter Anvin 014eea518a x86: Make the GDT_ENTRY() macro in <asm/segment.h> safe for assembly
Make the GDT_ENTRY() macro in  <asm/segment.h> safe for use in
assembly code by guarding the ULL suffixes with _AC() macros.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4D5DFBE4.7090104@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2011-02-17 21:05:13 -08:00
H. Peter Anvin d1ee433539 x86, trampoline: Use the unified trampoline setup for ACPI wakeup
Use the unified trampoline allocation setup to allocate and install
the ACPI wakeup code in low memory.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4D5DFBE4.7090104@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2011-02-17 21:05:06 -08:00
H. Peter Anvin 4822b7fc6d x86, trampoline: Common infrastructure for low memory trampolines
Common infrastructure for low memory trampolines.  This code installs
the trampolines permanently in low memory very early.  It also permits
multiple pieces of code to be used for this purpose.

This code also introduces a standard infrastructure for computing
symbol addresses in the trampoline code.

The only change to the actual SMP trampolines themselves is that the
64-bit trampoline has been made reusable -- the previous version would
overwrite the code with a status variable; this moves the status
variable to a separate location.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4D5DFBE4.7090104@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2011-02-17 21:02:43 -08:00
Len Brown bfb53ccf1c intel_idle: disable Atom/Lincroft HW C-state auto-demotion
Just as we had to disable auto-demotion for NHM/WSM,
we need to do the same for Atom (Lincroft version).

In particular, auto-demotion will prevent Lincroft
from entering the S0i3 idle power saving state.

https://bugzilla.kernel.org/show_bug.cgi?id=25252

Signed-off-by: Len Brown <len.brown@intel.com>
2011-02-17 17:08:48 -05:00
Len Brown 14796fca2b intel_idle: disable NHM/WSM HW C-state auto-demotion
Hardware C-state auto-demotion is a mechanism where the HW overrides
the OS C-state request, instead demoting to a shallower state,
which is less expensive, but saves less power.

Modern Linux should generally get exactly the states it requests.
In particular, when a CPU is taken off-line, it must not be demoted, else
it can prevent the entire package from reaching deep C-states.

https://bugzilla.kernel.org/show_bug.cgi?id=25252

Signed-off-by: Len Brown <len.brown@intel.com>
2011-02-17 17:08:46 -05:00
Yinghai Lu 6d496f9f23 x86-64, NUMA: Put dummy_numa_init() in the init section
dummy_numa_init() is used only during system boot.  Put it in .init
like other NUMA init functions.

- tj: Description update.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-17 15:04:20 +01:00
Yinghai Lu 2ca230baeb x86-64, NUMA: Don't call __pa() with invalid address in numa_reset_distance()
Do not call __pa(numa_distance) if it was not allocated before.
Calling with invalid address triggers VIRTUAL_BUG_ON() in
__phys_addr() if CONFIG_DEBUG_VIRTUAL.

Also reported by Ingo.

 http://thread.gmane.org/gmane.linux.kernel/1101306/focus=1101785

- v2: Change to check existing path as tj requested.
- tj: Description update.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
2011-02-17 15:03:43 +01:00
Akinobu Mita da1016df85 x86: Use bitmap library functions
Use bitmap_set()/bitmap_clear() to fill/zero a region of a
bitmap instead of doing set_bit()/clear_bit() each bit.

This change has been tested with ioperm() and there's no
change in behavior.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <1297867715-20394-1-git-send-email-akinobu.mita@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-17 14:59:22 +01:00
Tejun Heo e23bba6044 x86-64, NUMA: Unify emulated distance mapping
NUMA emulation needs to update node distance information.  It did it
by remapping apicid to PXM mapping, even when amdtopology is being
used.  There is no reason to go through such convolution.  The generic
code has all the information necessary to transform the distance table
to the emulated nid space.

Implement generic distance table transformation in numa_emulation()
and drop private implementations in srat_64 and amdtopology_64.  This
makes find_node_by_addr() and fake_physnodes() and related functions
unnecessary, drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo 6b78cb549b x86-64, NUMA: Unify emulated apicid -> node mapping transformation
NUMA emulation changes node mappings and thus apicid -> node mapping
needs to be updated accordingly.  srat_64 and amdtopology_64 did this
separately; however, all the necessary information is the mapping from
emulated nodes to physical nodes which is available in
emu_nid_to_phys[].

Implement common __apicid_to_node[] transformation in numa_emulation()
and drop duplicate implementations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo 1cca534073 x86-64, NUMA: Emulate directly from numa_meminfo
NUMA emulation built physnodes[] array which could only represent
configurations from the physical meminfo and emulated nodes using the
information.  There's no reason to take this extra level of
indirection.  Update emulation functions so that they operate directly
on numa_meminfo.  This simplifies the code and makes emulation layout
behave better with interleaved physical nodes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo 775ee85d7b x86-64, NUMA: Wrap node ID during emulation
Both emulation layout functions - split_nodes[_size]_interleave() -
didn't wrap emulated nid while laying out the fake nodes and tried to
avoid interating over the specified number of nodes, which is fragile.

Now that the emulation code generates numa_meminfo, the node memblks
don't need to be consecutive and emulated node IDs can simply wrap.
This makes the code more robust and is necessary for updates to better
handle the cases where the physical nodes are interleaved.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo c88aea7a70 x86-64, NUMA: Make emulation code build numa_meminfo and share the registration path
NUMA emulation code built nodes[] array and had its own registration
path to set up the emulated nodes.  Update it such that it generates
emulated numa_meminfo and returns control to initmem_init() and shares
the same registration path with non-emulated cases.

Because {acpi|amd}_fake_nodes() expect nodes[] parameter,
fake_physnodes() now generates nodes[] from numa_meminfo.  This will
go away with further updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo 9d073caeb3 x86-64, NUMA: Build and use direct emulated nid -> phys nid mapping
NUMA emulation copied physical NUMA configuration into physnodes[] and
used it to reverse-map emulated nodes to physical nodes, which is
unnecessarily convoluted.  Build emu_nid_to_phys[] array to map
emulated nids directly to the matching physical nids and use it in
numa_add_cpu().

physnodes[] will be removed with further patches.

- v2: Build failure when CONFIG_DEBUG_PER_CPU_MAPS due to missing
  local variable definition fixed.  Reported by Ingo.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo d9c515eacb x86-64, NUMA: Trivial changes to prepare for emulation updates
* Separate out numa_add_memblk_to() from numa_add_memblk() so that
  different numa_meminfo can be used.

* Rename cmdline to emu_cmdline.

* Drop @start/last_pfn from numa_emulation() and use max_pfn directly.

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo ac7136b611 x86-64, NUMA: Implement generic node distance handling
Node distance either used direct node comparison, ACPI PXM comparison
or ACPI SLIT table lookup.  This patch implements generic node
distance handling.  NUMA init methods can call numa_set_distance() to
set distance between nodes and the common __node_distance()
implementation will report the set distance.

Due to the way NUMA emulation is implemented, the generic node
distance handling is used only when emulation is not used.  Later
patches will update NUMA emulation to use the generic distance
mechanism.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 4697bdcc94 x86-64, NUMA: Kill mem_nodes_parsed
With all memory configuration information now carried in numa_meminfo,
there's no need to keep mem_nodes_parsed separate.  Drop it and use
numa_nodes_parsed for CPU / memory-less nodes.

A new helper numa_nodemask_from_meminfo() is added to calculate
memnode mask on the fly which is currently used to set
node_possible_map.

This simplifies NUMA init methods a bit and removes a source of
possible inconsistencies.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 92d4a4371e x86-64, NUMA: Rename cpu_nodes_parsed to numa_nodes_parsed
It's no longer necessary to keep both cpu_nodes_parsed and
mem_nodes_parsed.  In preparation for merge, rename cpu_nodes_parsed
to numa_nodes_parsed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 91556237ec x86-64, NUMA: Kill numa_nodes[]
numa_nodes[] doesn't carry any information which isn't present in
numa_meminfo.  Each entry is simply min/max range of all the memblks
for the node.  This is not only redundant but also inaccurate when
memblks for different nodes interleave - for example,
find_node_by_addr() can return the wrong nodeid.

Kill numa_nodes[] and always use numa_meminfo instead.

* nodes_cover_memory() is renamed to numa_meminfo_cover_memory() and
  now operations on numa_meminfo and returns bool.

* setup_node_bootmem() needs min/max range.  Compute the range on the
  fly.  setup_node_bootmem() invocation is restructured to use outer
  loop instead of hardcoding the double invocations.

* find_node_by_addr() now operates on numa_meminfo.

* setup_physnodes() builds physnodes[] from memblks.  This will go
  away when emulation code is updated to use struct numa_meminfo.

This patch also makes the following misc changes.

* Clearing of nodes_add[] clearing is converted to memset().

* numa_add_memblk() in amd_numa_init() is moved down a bit for
  consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo a844ef46fa x86-64, NUMA: Add common find_node_by_addr()
srat_64.c and amdtopology_64.c had their own versions of
find_node_by_addr() which were basically the same.  Add common one in
numa_64.c and remove the duplicates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 56e827fbde x86-64, NUMA: consolidate and improve memblk sanity checks
memblk sanity check was scattered around and incomplete.  Consolidate
and improve.

* Confliction detection and cutoff_node() logic are moved to
  numa_cleanup_meminfo().

* numa_cleanup_meminfo() clears the unused memblks before returning.

* Check and warn about invalid input parameters in numa_add_memblk().

* Check the maximum number of memblk isn't exceeded in
  numa_add_memblk().

* numa_cleanup_meminfo() is now called before numa_emulation() so that
  the emulation code also uses the cleaned up version.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 2e756be447 x86-64, NUMA: make numa_cleanup_meminfo() prettier
* Factor out numa_remove_memblk_from().

* Hole detection doesn't need separate start/end.  Calculate start/end
  once.

* Relocate comment.

* Define iterators at the top and remove unnecessary prefix
  increments.

This prepares for further improvements to the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo f9c60251c3 x86-64, NUMA: Separate out numa_cleanup_meminfo()
Separate out numa_cleanup_meminfo() from numa_register_memblks().
node_possible_map initialization is moved to the top of the split
numa_register_memblks().

This patch doesn't cause behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:09 +01:00
Tejun Heo 97e7b78d06 x86-64, NUMA: Introduce struct numa_meminfo
Arrays for memblks and nodeids and their length lived in separate
variables making things unnecessarily cumbersome.  Introduce struct
numa_meminfo which contains all memory configuration info.  This patch
doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:08 +01:00
Tejun Heo 8968dab8ad x86-64, NUMA: Remove %NULL @nodeids handling from compute_hash_shift()
numa_emulation() called compute_hash_shift() with %NULL @nodeids which
meant identity mapping between index and nodeid.  Make
numa_emulation() build identity array and drop %NULL @nodeids handling
from populate_memnodemap() and thus from compute_hash_shift().  This
is to prepare for transition to using memblks instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:08 +01:00
Tejun Heo 5d371b08fe x86-64, NUMA: Kill {acpi|amd|dummy}_scan_nodes()
They are empty now.  Kill them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:08 +01:00
Tejun Heo fd0435d8fb x86-64, NUMA: Unify the rest of memblk registration
Move the remaining memblk registration logic from acpi_scan_nodes() to
numa_register_memblks() and initmem_init().

This applies nodes_cover_memory() sanity check, memory node sorting
and node_online() checking, which were only applied to acpi, to all
init methods.

As all memblk registration is moved to common code, active range
clearing is moved to initmem_init() too and removed from bad_srat().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:08 +01:00
Tejun Heo 43a662f04f x86-64, NUMA: Unify use of memblk in all init methods
Make both amd and dummy use numa_add_memblk() to describe the detected
memory blocks.  This allows initmem_init() to call
numa_register_memblk() regardless of init method in use.  Drop custom
memory registration codes from amd and dummy.

After this change, memblk merge/cleanup in numa_register_memblks() is
applied to all init methods.

As this makes compute_hash_shift() and numa_register_memblks() used
only inside numa_64.c, make them static.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:08 +01:00
Tejun Heo ef396ec96c x86-64, NUMA: Factor out memblk handling into numa_{add|register}_memblk()
Factor out memblk handling from srat_64.c into two functions in
numa_64.c.  This patch doesn't introduce any behavior change.  The
next patch will make all init methods use these functions.

- v2: Fixed build failure on 32bit due to misplaced NR_NODE_MEMBLKS.
      Reported by Ingo.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:07 +01:00
Ingo Molnar a3ec4a603f Merge commit 'v2.6.38-rc5' into core/locking
Merge reason: pick up upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:33:41 +01:00
Robert Richter 4979d2729a perf, x86: Add support for AMD family 15h core counters
This patch adds support for AMD family 15h core counters. There are
major changes compared to family 10h. First, there is a new perfctr
msr range for up to 6 counters. Northbridge counters are separate
now. This patch only adds support for core counters. Second, certain
events may only be scheduled on certain counters. For this we need to
extend the event scheduling and constraints.

We use cpu feature flags to calculate family 15h msr address offsets.
This way we later can implement a faster ALTERNATIVE() version for
this.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110215135210.GB5874@erda.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:30:53 +01:00
Robert Richter 73d6e52206 perf, x86: Store perfctr msr addresses in config_base/event_base
Instead of storing the base addresses we can store the counter's msr
addresses directly in config_base/event_base of struct hw_perf_event.
This avoids recalculating the address with each msr access. The
addresses are configured one time. We also need this change to later
modify the address calculation.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1296664860-10886-5-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:30:52 +01:00
Robert Richter 69d8e1e8ac perf, x86: Add new AMD family 15h msrs to perfctr reservation code
This patch allows the reservation of perfctrs with new msr addresses
introduced for AMD cpu family 15h (0xc0010200/0xc0010201, etc).

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1296664860-10886-4-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:30:50 +01:00
Robert Richter 41bf498949 perf, x86: Calculate perfctr msr addresses in helper functions
This patch adds helper functions to calculate perfctr msr addresses.
We need this to later add support for AMD family 15h cpus. For this we
have to change the algorithms to generate the perfctr's msr addresses.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1296664860-10886-3-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:30:50 +01:00
Robert Richter d45dd923fc perf, x86: Use helper function in x86_pmu_enable_all()
Use helper function in x86_pmu_enable_all() to minimize access to
x86_pmu.eventsel in the fast path. The counter's msr address is now
calculated using struct hw_perf_event. Later we add code that
calculates the msr addresses with a table lookup which shouldn't be
done in the fast path.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1296664860-10886-2-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:30:49 +01:00
Ingo Molnar b00560f2d4 Merge branch 'perf/urgent' into perf/core
Merge reason: we need to queue up dependent patch

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 13:27:23 +01:00
Cyrill Gorcunov 7d44ec193d perf, x86: P4 PMU: Fix spurious NMI messages
Several people have reported spurious unknown NMI
messages on some P4 CPUs.

This patch fixes it by checking for an overflow (negative
counter values) directly, instead of relying on the
P4_CCCR_OVF bit.

Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Don Zickus <dzickus@redhat.com>
Reported-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTinfuTfCck_FfaOHrDqQZZehtRzkBum4SpFoO=KJ@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16 12:26:12 +01:00
Tejun Heo 1909554870 x86-64, NUMA: Kill {acpi|amd}_get_nodes()
With common numa_nodes[], common code in numa_64.c can access it
directly.  Copy directly and kill {acpi|amd}_get_nodes().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:07 +01:00
Tejun Heo 206e42087a x86-64, NUMA: Use common numa_nodes[]
ACPI and amd are using separate nodes[] array.  Add numa_nodes[] and
use them in all NUMA init methods.  cutoff_node() cleanup is moved
from srat_64.c to numa_64.c and applied in initmem_init() regardless
of init methods.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:07 +01:00
Tejun Heo 45fe6c78c4 x86-64, NUMA: Move apicid to numa mapping initialization from amd_scan_nodes() to amd_numa_init()
This brings amd initialization behavior closer to that of acpi.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:07 +01:00
Tejun Heo 99df738cd2 x86-64, NUMA: Remove local variable found from amd_numa_init()
Use weight count on mem_nodes_parsed instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:07 +01:00
Tejun Heo ec8cf29b1d x86-64, NUMA: Use common {cpu|mem}_nodes_parsed
ACPI and amd are using separate nodes_parsed masks.  Add
{cpu|mem}_nodes_parsed and use them in all NUMA init methods.
Initialization of the masks and building node_possible_map are now
handled commonly by initmem_init().

dummy_numa_init() is updated to set node 0 on both masks.  While at
it, move the info messages from scan to init.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:07 +01:00
Tejun Heo ffe77a4605 x86-64, NUMA: Restructure initmem_init()
Reorganize initmem_init() such that,

* Different NUMA init methods are iterated in a consistent way.

* Each iteration re-initializes all the parameters and different
  method can be tried after a failure.

* Dummy init is handled the same as other methods.

Apart from how retry after failure, this patch doesn't change the
behavior.  The call sequences are kept equivalent across the
conversion.

After the change, bad_srat() doesn't need to clear apic to node
mapping or worry about numa_off.  Simplified accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00
Tejun Heo d8fc3afc49 x86, NUMA: Move *_numa_init() invocations into initmem_init()
There's no reason for these to live in setup_arch().  Move them inside
initmem_init().

- v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
  Fixed.  Found by Ankita.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00
Tejun Heo a9aec56afa x86-64, NUMA: Wrap acpi_numa_init() so that failure can be indicated by return value
Because of the way ACPI tables are parsed, the generic
acpi_numa_init() couldn't return failure when error was detected by
arch hooks.  Instead, the failure state was recorded and later arch
dependent init hook - acpi_scan_nodes() - would fail.

Wrap acpi_numa_init() with x86_acpi_numa_init() so that failure can be
indicated as return value immediately.  This is in preparation for
further NUMA init cleanups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00
Tejun Heo 940fed2e79 x86-64, NUMA: Unify {acpi|amd}_{numa_init|scan_nodes}() arguments and return values
The functions used during NUMA initialization - *_numa_init() and
*_scan_nodes() - have different arguments and return values.  Unify
them such that they all take no argument and return 0 on success and
-errno on failure.  This is in preparation for further NUMA init
cleanups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00
Tejun Heo 86ef4dbf1f x86, NUMA: Drop @start/last_pfn from initmem_init()
initmem_init() extensively accesses and modifies global data
structures and the parameters aren't even followed depending on which
path is being used.  Drop @start/last_pfn and let it deal with
@max_pfn directly.  This is in preparation for further NUMA init
cleanups.

- v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
  Fixed.  Found by Yinghai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00
Tejun Heo 13081df5dd x86-64, NUMA: Simplify hotplug node handling in acpi_numa_memory_affinity_init()
Hotplug node handling in acpi_numa_memory_affinity_init() was
unnecessarily complicated with storing the original nodes[] entry and
restoring it afterwards.  Simplify it by not modifying the nodes[]
entry for hotplug nodes from the beginning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 12:13:06 +01:00