The initial bootup code uses set_phys_to_machine quite a lot, and after
bootup it would be used by the balloon driver. The balloon driver does have
mutex lock so this should not be necessary - but just in case, add
a WARN_ON if we do hit this scenario. If we do fail this, it is OK
to continue as there is a backup mechanism (VM_IO) that can bypass
the P2M and still set the _PAGE_IOMAP flags.
[v2: Change from WARN to BUG_ON]
[v3: Rebased on top of xen->p2m code split]
[v4: Change from BUG_ON to WARN]
Reviewed-by: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If we find that the PFN is within the P2M as an identity
PFN make sure to tack on the _PAGE_IOMAP flag.
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Our P2M tree structure is a three-level. On the leaf nodes
we set the Machine Frame Number (MFN) of the PFN. What this means
is that when one does: pfn_to_mfn(pfn), which is used when creating
PTE entries, you get the real MFN of the hardware. When Xen sets
up a guest it initially populates a array which has descending
(or ascending) MFN values, as so:
idx: 0, 1, 2
[0x290F, 0x290E, 0x290D, ..]
so pfn_to_mfn(2)==0x290D. If you start, restart many guests that list
starts looking quite random.
We graft this structure on our P2M tree structure and stick in
those MFN in the leafs. But for all other leaf entries, or for the top
root, or middle one, for which there is a void entry, we assume it is
"missing". So
pfn_to_mfn(0xc0000)=INVALID_P2M_ENTRY.
We add the possibility of setting 1-1 mappings on certain regions, so
that:
pfn_to_mfn(0xc0000)=0xc0000
The benefit of this is, that we can assume for non-RAM regions (think
PCI BARs, or ACPI spaces), we can create mappings easily b/c we
get the PFN value to match the MFN.
For this to work efficiently we introduce one new page p2m_identity and
allocate (via reserved_brk) any other pages we need to cover the sides
(1GB or 4MB boundary violations). All entries in p2m_identity are set to
INVALID_P2M_ENTRY type (Xen toolstack only recognizes that and MFNs,
no other fancy value).
On lookup we spot that the entry points to p2m_identity and return the identity
value instead of dereferencing and returning INVALID_P2M_ENTRY. If the entry
points to an allocated page, we just proceed as before and return the PFN.
If the PFN has IDENTITY_FRAME_BIT set we unmask that in appropriate functions
(pfn_to_mfn).
The reason for having the IDENTITY_FRAME_BIT instead of just returning the
PFN is that we could find ourselves where pfn_to_mfn(pfn)==pfn for a
non-identity pfn. To protect ourselves against we elect to set (and get) the
IDENTITY_FRAME_BIT on all identity mapped PFNs.
This simplistic diagram is used to explain the more subtle piece of code.
There is also a digram of the P2M at the end that can help.
Imagine your E820 looking as so:
1GB 2GB
/-------------------+---------\/----\ /----------\ /---+-----\
| System RAM | Sys RAM ||ACPI| | reserved | | Sys RAM |
\-------------------+---------/\----/ \----------/ \---+-----/
^- 1029MB ^- 2001MB
[1029MB = 263424 (0x40500), 2001MB = 512256 (0x7D100), 2048MB = 524288 (0x80000)]
And dom0_mem=max:3GB,1GB is passed in to the guest, meaning memory past 1GB
is actually not present (would have to kick the balloon driver to put it in).
When we are told to set the PFNs for identity mapping (see patch: "xen/setup:
Set identity mapping for non-RAM E820 and E820 gaps.") we pass in the start
of the PFN and the end PFN (263424 and 512256 respectively). The first step is
to reserve_brk a top leaf page if the p2m[1] is missing. The top leaf page
covers 512^2 of page estate (1GB) and in case the start or end PFN is not
aligned on 512^2*PAGE_SIZE (1GB) we loop on aligned 1GB PFNs from start pfn to
end pfn. We reserve_brk top leaf pages if they are missing (means they point
to p2m_mid_missing).
With the E820 example above, 263424 is not 1GB aligned so we allocate a
reserve_brk page which will cover the PFNs estate from 0x40000 to 0x80000.
Each entry in the allocate page is "missing" (points to p2m_missing).
Next stage is to determine if we need to do a more granular boundary check
on the 4MB (or 2MB depending on architecture) off the start and end pfn's.
We check if the start pfn and end pfn violate that boundary check, and if
so reserve_brk a middle (p2m[x][y]) leaf page. This way we have a much finer
granularity of setting which PFNs are missing and which ones are identity.
In our example 263424 and 512256 both fail the check so we reserve_brk two
pages. Populate them with INVALID_P2M_ENTRY (so they both have "missing" values)
and assign them to p2m[1][2] and p2m[1][488] respectively.
At this point we would at minimum reserve_brk one page, but could be up to
three. Each call to set_phys_range_identity has at maximum a three page
cost. If we were to query the P2M at this stage, all those entries from
start PFN through end PFN (so 1029MB -> 2001MB) would return INVALID_P2M_ENTRY
("missing").
The next step is to walk from the start pfn to the end pfn setting
the IDENTITY_FRAME_BIT on each PFN. This is done in 'set_phys_range_identity'.
If we find that the middle leaf is pointing to p2m_missing we can swap it over
to p2m_identity - this way covering 4MB (or 2MB) PFN space. At this point we
do not need to worry about boundary aligment (so no need to reserve_brk a middle
page, figure out which PFNs are "missing" and which ones are identity), as that
has been done earlier. If we find that the middle leaf is not occupied by
p2m_identity or p2m_missing, we dereference that page (which covers
512 PFNs) and set the appropriate PFN with IDENTITY_FRAME_BIT. In our example
263424 and 512256 end up there, and we set from p2m[1][2][256->511] and
p2m[1][488][0->256] with IDENTITY_FRAME_BIT set.
All other regions that are void (or not filled) either point to p2m_missing
(considered missing) or have the default value of INVALID_P2M_ENTRY (also
considered missing). In our case, p2m[1][2][0->255] and p2m[1][488][257->511]
contain the INVALID_P2M_ENTRY value and are considered "missing."
This is what the p2m ends up looking (for the E820 above) with this
fabulous drawing:
p2m /--------------\
/-----\ | &mfn_list[0],| /-----------------\
| 0 |------>| &mfn_list[1],| /---------------\ | ~0, ~0, .. |
|-----| | ..., ~0, ~0 | | ~0, ~0, [x]---+----->| IDENTITY [@256] |
| 1 |---\ \--------------/ | [p2m_identity]+\ | IDENTITY [@257] |
|-----| \ | [p2m_identity]+\\ | .... |
| 2 |--\ \-------------------->| ... | \\ \----------------/
|-----| \ \---------------/ \\
| 3 |\ \ \\ p2m_identity
|-----| \ \-------------------->/---------------\ /-----------------\
| .. +->+ | [p2m_identity]+-->| ~0, ~0, ~0, ... |
\-----/ / | [p2m_identity]+-->| ..., ~0 |
/ /---------------\ | .... | \-----------------/
/ | IDENTITY[@0] | /-+-[x], ~0, ~0.. |
/ | IDENTITY[@256]|<----/ \---------------/
/ | ~0, ~0, .... |
| \---------------/
|
p2m_missing p2m_missing
/------------------\ /------------\
| [p2m_mid_missing]+---->| ~0, ~0, ~0 |
| [p2m_mid_missing]+---->| ..., ~0 |
\------------------/ \------------/
where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN | IDENTITY_BIT)
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
[v5: Changed code to use ranges, added ASCII art]
[v6: Rebased on top of xen->p2m code split]
[v4: Squished patches in just this one]
[v7: Added RESERVE_BRK for potentially allocated pages]
[v8: Fixed alignment problem]
[v9: Changed 1<<3X to 1<<BITS_PER_LONG-X]
[v10: Copied git commit description in the p2m code + Add Review tag]
[v11: Title had '2-1' - should be '1-1' mapping]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Setting the pci ops on subsys initcall unconditionally will break
multi platform kernels on anything except ce4100.
Use x86_init.pci.init ops to call this only on real ce4100 platforms.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: sodaville@linutronix.de
LKML-Reference: <20110314093340.GA21026@www.tglx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The caller of ioapic_register_intr() has a pointer to the irq_cfg for
the irq already. Hand it in to avoid a full lookup.
In msi_compose_msg() the pointer to irq_cfg is already available. No
need to look it up again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the functions which take irq_data. We already have a pointer to
irq_data. That avoids a sparse irq lookup in move_*_irq.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the state information in irq_data. That avoids a radix-tree lookup
from apic_ack_level() and simplifies setup_ioapic_dest().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
genirq is switching to a consistent name space for the irq related
functions. Convert x86. Conversion was done with coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The distance transforming in numa_emulation() used to call
numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node
combinations regardless of which are enabled. As numa_set_distance()
ignores all out-of-bound distance settings, this doesn't cause any
problem other than looping unnecessarily many times during boot.
However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the
code such that it iterates through only the enabled combinations.
Yinghai Lu identified the issue and provided an initial patch to
address the issue; however, the patch was incorrect in that it didn't
build emulated distance table when there's no physical distance table
and unnecessarily complex.
http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
The latest binutils (2.21.0.20110302/Ubuntu) breaks the build
yet another time, under CONFIG_XEN=y due to a .size directive that
refers to a slightly differently named (hence, to the now very
strict and unforgiving assembler, non-existent) symbol.
[ mingo:
This unnecessary build breakage caused by new binutils
version 2.21 gets escallated back several kernel releases spanning
several years of Linux history, affecting over 130,000 upstream
kernel commits (!), on CONFIG_XEN=y 64-bit kernels (i.e. essentially
affecting all major Linux distro kernel configs).
Git annotate tells us that this slight debug symbol code mismatch
bug has been introduced in 2008 in commit 3d75e1b8:
3d75e1b8 (Jeremy Fitzhardinge 2008-07-08 15:06:49 -0700 1231) ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
The 'bug' is just a slight assymetry in ENTRY()/END()
debug-symbols sequences, with lots of assembly code between the
ENTRY() and the END():
ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
...
END(do_hypervisor_callback)
Human reviewers almost never catch such small mismatches, and binutils
never even warned about it either.
This new binutils version thus breaks the Xen build on all upstream kernels
since v2.6.27, out of the blue.
This makes a straightforward Git bisection of all 64-bit Xen-enabled kernels
impossible on such binutils, for a bisection window of over hundred
thousand historic commits. (!)
This is a major fail on the side of binutils and binutils needs to turn
this show-stopper build failure into a warning ASAP. ]
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <kees.cook@canonical.com>
LKML-Reference: <1299877178-26063-1-git-send-email-heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we have a guest that asked for:
memory=1024
maxmem=2048
Which means we want 1GB now, and create pagetables so that we can expand
up to 2GB, we would have this E820 layout:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
[ 0.000000] Xen: 0000000000100000 - 0000000080800000 (usable)
Due to patch: "xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps."
we would mark the memory past the 1GB mark as unusuable resulting in:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
[ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable)
[ 0.000000] Xen: 0000000040000000 - 0000000080800000 (unusable)
which meant that we could not balloon up anymore. We could
balloon the guest down. The fix is to run the code introduced
by the above mentioned patch only for the initial domain.
We will have to revisit this once we start introducing a modified
E820 for PCI passthrough so that we can utilize the P2M identity code.
We also fix an overflow by having UL instead of ULL on 32-bit machines.
[v2: Ian pointed to the overflow issue]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Change futex_atomic_op_inuser and futex_atomic_cmpxchg_inatomic
prototypes to use u32 types for the futex as this is the data type the
futex core code uses all over the place.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Darren Hart <darren@dvhart.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110311025058.GD26122@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The cmpxchg_futex_value_locked API was funny in that it returned either
the original, user-exposed futex value OR an error code such as -EFAULT.
This was confusing at best, and could be a source of livelocks in places
that retry the cmpxchg_futex_value_locked after trying to fix the issue
by running fault_in_user_writeable().
This change makes the cmpxchg_futex_value_locked API more similar to the
get_futex_value_locked one, returning an error code and updating the
original value through a reference argument.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile]
Acked-by: Tony Luck <tony.luck@intel.com> [ia64]
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michal Simek <monstr@monstr.eu> [microblaze]
Acked-by: David Howells <dhowells@redhat.com> [frv]
Cc: Darren Hart <darren@dvhart.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110311024851.GC26122@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch moves some functions and variables into init
sections, makes a function static and removes some lines of
cruft.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <1299826956-8607-2-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The extra tsc_sync.o goal definition is superflous.
CONFIG_X86_64_SMP depends on CONFIG_SMP
and tsc_sync.o is already in the definition of CONFIG_SMP.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <1299826956-8607-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, UV: Initialize the broadcast assist unit base destination node id properly
x86, numa: Fix numa_emulation code with memory-less node0
x86, build: Make sure mkpiggy fails on read error
The function name does not distinguish it from xen_allocate_pirq_msi
(which operates on domU and pvhvm domains rather than dom0).
Hoist domain 0 specific functionality up into the only caller leaving
functionality common to all guest types in xen_bind_pirq_msi_to_irq.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Makes the tail end of this function look even more like
xen_bind_pirq_msi_to_irq.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Split the binding aspect of xen_allocate_pirq_msi out into a new
xen_bind_pirq_to_irq function.
In xen_hvm_setup_msi_irq when allocating a pirq write the MSI message
to signal the PIRQ as soon as the pirq is obtained. There is no way to
free the pirq back so if the subsequent binding to an IRQ fails we
want to ensure that we will reuse the PIRQ next time rather than leak
it.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
apic_register_gsi_xen_hvm is a tiny wrapper around
xen_hvm_register_pirq.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
consistent with other similar functions.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All callers pass this flag so it is pointless.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fixes:
CC arch/x86/pci/xen.o
arch/x86/pci/xen.c:183: warning: 'xen_initdom_setup_msi_irqs' defined but not used
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* stable/pcifront-fixes:
pci/xen: When free-ing MSI-X/MSI irq->desc also use generic code.
pci/xen: Cleanup: convert int** to int[]
pci/xen: Use xen_allocate_pirq_msi instead of xen_allocate_pirq
xen-pcifront: Sanity check the MSI/MSI-X values
xen-pcifront: don't use flush_scheduled_work()
* stable/irq.rework:
xen/irq: Cleanup up the pirq_to_irq for DomU PV PCI passthrough guests as well.
xen: Use IRQF_FORCE_RESUME
xen/timer: Missing IRQF_NO_SUSPEND in timer code broke suspend.
xen: Fix compile error introduced by "switch to new irq_chip functions"
xen: Switch to new irq_chip functions
xen: Remove stale irq_chip.end
xen: events: do not free legacy IRQs
xen: events: allocate GSIs and dynamic IRQs from separate IRQ ranges.
xen: events: add xen_allocate_irq_{dynamic, gsi} and xen_free_irq
xen:events: move find_unbound_irq inside CONFIG_PCI_MSI
xen: handled remapped IRQs when enabling a pcifront PCI device.
genirq: Add IRQF_FORCE_RESUME
Currently the index to the ret_stack is updated and the real return address
is saved in the ret_stack. Then we call the trace function. The trace
function could decide that it doesn't want to trace this function
(ex. set_graph_function does not match) and it will return 0 which means
not to trace this call.
The normal function graph tracer has this code:
if (!(trace->depth || ftrace_graph_addr(trace->func)) ||
ftrace_graph_ignore_irqs())
return 0;
What this states is, if the trace depth (which is curr_ret_stack)
is zero (top of nested functions) then test if we want to trace this
function. If this function is not to be traced, then return 0 and
the rest of the function graph tracer logic will not trace this function.
The problem arises when an interrupt comes in after we updated the
curr_ret_stack. The next function that gets called will have a trace->depth
of 1. Which fools this trace code into thinking that we are in a nested
function, and that we should trace. This causes interrupts to be traced
when they should not be.
The solution is to trace the function first and then update the ret_stack.
Reported-by: zhiping zhong <xzhong86@163.com>
Reported-by: wu zhangjin <wuzhangjin@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-8-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's forbidden to take the page_table_lock with the irq disabled
or if there's contention the IPIs (for tlb flushes) sent with
the page_table_lock held will never run leading to a deadlock.
Nobody takes the pgd_lock from irq context so the _irqsave can be
removed.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mm_fault_error() should not execute oom-killer, if page fault
occurs in kernel space. E.g. in copy_from_user()/copy_to_user().
This would happen if we find ourselves in OOM on a
copy_to_user(), or a copy_from_user() which faults.
Without this patch, the kernels hangs up in copy_from_user(),
because OOM killer sends SIG_KILL to current process, but it
can't handle a signal while in syscall, then the kernel returns
to copy_from_user(), reexcute current command and provokes
page_fault again.
With this patch the kernel return -EFAULT from copy_from_user().
The code, which checks that page fault occurred in kernel space,
has been copied from do_sigbus().
This situation is handled by the same way on powerpc, xtensa,
tile, ...
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Return 0 on failure. This will cause the initialization of the driver
to fail and prevent the driver from loading if the BIOS cannot handle
the PCC interface command to "get frequency". Otherwise, the driver
will load and display a very high value like "4294967274" (which is
actually -EINVAL) for frequency:
# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
4294967274
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
CC: stable@kernel.org
Signed-off-by: Dave Jones <davej@redhat.com>
Due to commit 781c5a67f1 it is
likely that the number of areas to scan for BIOS corruption is 0
-- especially when the first 64K is already reserved
(X86_RESERVE_LOW is 64K by default).
If that's the case then don't set up the scan.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: <stable@kernel.org>
LKML-Reference: <20110225202838.2229.71011.sendpatchset@nchumbalkar.americas.hpqcorp.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The BAU's initialization of the broadcast description header is
lacking the coherence domain (high bits) in the nasid. This
causes a catastrophic system failure when running on a system
with multiple coherence domains.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
LKML-Reference: <E1PxKBB-0005F0-3U@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This isn't being referenced anywhere, and the selects done from
it can be easily done together with all the other X86 ones.
v2: Also adjust UML's Kconfig.x86.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D7603DA02000078000351C1@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
New binutils version 2.21.0.20110302-1 started checking that the symbol
parameter to the .size directive matches the entry name's
symbol parameter, unearthing two mismatches:
AS arch/x86/kernel/acpi/wakeup_rm.o
arch/x86/kernel/acpi/wakeup_rm.S: Assembler messages:
arch/x86/kernel/acpi/wakeup_rm.S:12: Error: .size expression with symbol `wakeup_code_start' does not evaluate to a constant
arch/x86/kernel/entry_32.S: Assembler messages:
arch/x86/kernel/entry_32.S:1421: Error: .size expression with
symbol `apf_page_fault' does not evaluate to a constant
The problem was discovered while using Debian's binutils
(2.21.0.20110302-1) and experimenting with binutils from
upstream.
Thanks Alexander and H.J. for the vital help.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <1299620364-21644-1-git-send-email-sedat.dilek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
You can crash the kernel (with root/admin privileges) using kprobe tracer by running:
echo "p system_call_after_swapgs" > ./kprobe_events
echo 1 > ./events/kprobes/enable
The reason is that at the system_call_after_swapgs label, the
kernel stack is not set up. If optimized kprobes are enabled,
the user space stack is being used in this case (see optimized
kprobe template) and this might result in a crash.
There are several places like this over the entry code
(entry_$BIT). As it seems there's no any reasonable/maintainable
way to disable only those places where the stack is not ready, I
switched off the whole entry code from kprobe optimizing.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: acme@redhat.com
Cc: fweisbec@gmail.com
Cc: ananth@in.ibm.com
Cc: davem@davemloft.net
Cc: a.p.zijlstra@chello.nl
Cc: eric.dumazet@gmail.com
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <1298298313-5980-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Put x86 entry code into a separate link section: .entry.text.
Separating the entry text section seems to have performance
benefits - caused by more efficient instruction cache usage.
Running hackbench with perf stat --repeat showed that the change
compresses the icache footprint. The icache load miss rate went
down by about 15%:
before patch:
19417627 L1-icache-load-misses ( +- 0.147% )
after patch:
16490788 L1-icache-load-misses ( +- 0.180% )
The motivation of the patch was to fix a particular kprobes
bug that relates to the entry text section, the performance
advantage was discovered accidentally.
Whole perf output follows:
- results for current tip tree:
Performance counter stats for './hackbench/hackbench 10' (500 runs):
19417627 L1-icache-load-misses ( +- 0.147% )
2676914223 instructions # 0.497 IPC ( +- 0.079% )
5389516026 cycles ( +- 0.144% )
0.206267711 seconds time elapsed ( +- 0.138% )
- results for current tip tree with the patch applied:
Performance counter stats for './hackbench/hackbench 10' (500 runs):
16490788 L1-icache-load-misses ( +- 0.180% )
2717734941 instructions # 0.502 IPC ( +- 0.079% )
5414756975 cycles ( +- 0.148% )
0.206747566 seconds time elapsed ( +- 0.137% )
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: masami.hiramatsu.pt@hitachi.com
Cc: ananth@in.ibm.com
Cc: davem@davemloft.net
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a keyctl op (KEYCTL_INSTANTIATE_IOV) that is like KEYCTL_INSTANTIATE, but
takes an iovec array and concatenates the data in-kernel into one buffer.
Since the KEYCTL_INSTANTIATE copies the data anyway, this isn't too much of a
problem.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
I'm sure it was a mere oversight that the CONFIG_ prefixes are
missing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Dave Jones <davej@redhat.com>
LKML-Reference: <4D7118D30200007800034F79@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-5-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Undetermined entries in emu_nid_to_phys[] are filled with zero
assuming that physical node 0 is always online; however, this might
not be true depending on hardware configuration. Find a physical node
which is actually online and use it instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
This crash happens on a system that does not have RAM on node0.
When numa_emulation is compiled in, and:
1. we boot the system without numa=fake...
2. or we boot the system with numa=fake=128 to make emulation fail
we will get:
[ 0.076025] ------------[ cut here ]------------
[ 0.080004] kernel BUG at arch/x86/mm/numa_64.c:788!
[ 0.080004] invalid opcode: 0000 [#1] SMP
[...]
need to use early_cpu_to_node() directly, because cpu_to_apicid
and apicid_to_node will return node0 that is not onlined.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
LKML-Reference: <4D6ECF72.5010308@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch cleans initmem_init() so that it is more readable and doesn't
use an unnecessary array of function pointers to convolute the flow of
the code. It also makes it obvious that dummy_numa_init() will always
succeed (and documents that requirement) so that the existing BUG() is
never actually reached.
No functional change.
-tj: Updated comment for dummy_numa_init() slightly.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
On Intel Nehalem and Westmere CPUs the generic perf LLC-* events count the
L2 caches, not the real L3 LLC - this was inconsistent with behavior on
other CPUs.
Fixing this requires the use of the special OFFCORE_RESPONSE
events which need a separate mask register.
This has been implemented by the previous patch, now use this infrastructure
to set correct events for the LLC-* on Nehalem and Westmere.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-3-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change logs against Andi's original version:
- Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
- Fixed a major event scheduling issue. There cannot be a ref++ on an
event that has already done ref++ once and without calling
put_constraint() in between. (Stephane Eranian)
- Use thread_cpumask for percore allocation. (Lin Ming)
- Use MSR names in the extra reg lists. (Lin Ming)
- Remove redundant "c = NULL" in intel_percore_constraints
- Fix comment of perf_event_attr::config1
Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
that can be used to monitor any offcore accesses from a core.
This is a very useful event for various tunings, and it's
also needed to implement the generic LLC-* events correctly.
Unfortunately this event requires programming a mask in a separate
register. And worse this separate register is per core, not per
CPU thread.
This patch:
- Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
The extra parameters are passed by user space in the
perf_event_attr::config1 field.
- Adds support to the Intel perf_event core to schedule per
core resources. This adds fairly generic infrastructure that
can be also used for other per core resources.
The basic code has is patterned after the similar AMD northbridge
constraints code.
Thanks to Stephane Eranian who pointed out some problems
in the original version and suggested improvements.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch updates PEBS event constraints for Intel Atom, Nehalem, Westmere.
This patch also reorganizes the PEBS format/constraint detection code. It is
now based on processor model and not PEBS format. Two processors may use the
same PEBS format without have the same list of PEBS events.
In this second version, we simplified the initialization of the PEBS
constraints by leveraging the existing switch() statement in perf_event_intel.c.
We also renamed the constraint tables to be more consistent with regular
constraints.
In this 3rd version, we drop BR_INST_RETIRED.MISPRED from Intel Atom as it does
not seem to work. Use MISPREDICTED_BRANCH_RETIRED instead. Also add FP_ASSIST.*
o both Intel Nehalem and Westmere. I misssed those in the earlier patches.
Events were tested using libpfm4 perf_examples.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d6e6b02.815bdf0a.637b.07a7@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch reverts NUMA affine page table allocation added by commit
1411e0ec31 (x86-64, numa: Put pgtable to local node memory).
The commit made an undocumented change where the kernel linear mapping
strictly follows intersection of e820 memory map and NUMA
configuration. If the physical memory configuration has holes or NUMA
nodes are not properly aligned, this leads to using unnecessarily
smaller mapping size which leads to increased TLB pressure. For
details,
http://thread.gmane.org/gmane.linux.kernel/1104672
Patches to fix the problem have been proposed but the underlying code
needs more cleanup and the approach itself seems a bit heavy handed
and it has been determined to revert the feature for now and come back
to it in the next developement cycle.
http://thread.gmane.org/gmane.linux.kernel/1105959
As init_memory_mapping_high() callsites have been consolidated since
the commit, reverting is done manually. Also, the RED-PEN comment in
arch/x86/mm/init.c is not restored as the problem no longer exists
with memblock based top-down early memory allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
The patches missed an indirect use of IRQF_NO_SUSPEND pulled in via
IRQF_TIMER. The following patch fixes the issue.
With this fixlet PV guest migration works just fine. I also booted the
entire series as a dom0 kernel and it appeared fine.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This happens to not be an issue currently because we take pains to try
to ensure that the GSI-IRQ mapping is 1-1 in a PV guest and that
regular event channels do not clash. However a subsequent patch is
going to break this 1-1 mapping.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
With this patch, we diligently set regions that will be used by the
balloon driver to be INVALID_P2M_ENTRY and under the ownership
of the balloon driver. We are OK using the __set_phys_to_machine
as we do not expect to be allocating any P2M middle or entries pages.
The set_phys_to_machine has the side-effect of potentially allocating
new pages and we do not want that at this stage.
We can do this because xen_build_mfn_list_list will have already
allocated all such pages up to xen_max_p2m_pfn.
We also move the check for auto translated physmap down the
stack so it is present in __set_phys_to_machine.
[v2: Rebased with mmu->p2m code split]
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make functions used strictly in bool context return bool. Also,
fixup used types and comments, and make a local function static,
while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Borislav Petkov <bp@amd64.org>
LKML-Reference: <20110303115932.GA8603@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add extra parentheses around a couple of definitions introduced
by "x86: Cleanup vector usage" and used in assembly macro
arguments, and remove spaces. Without that old (2.16.1) gas
would see more macro arguments than were actually specified.
Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <4D6F81B10200007800034B0B@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
of/promtree: allow DT device matching by fixing 'name' brokenness (v5)
x86: OLPC: have prom_early_alloc BUG rather than return NULL
of/flattree: Drop an uninteresting message to pr_debug level
of: Add missing of_address.h to xilinx ehci driver
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] p4-clockmod: print EST-capable warning message only once
[CPUFREQ] fix BUG on cpufreq policy init failure
[CPUFREQ] Fix another notifier leak in powernow-k8.
[CPUFREQ] Missing "unregister_cpu_notifier" in powernow-k8.c
..similar to what sparc's prom_early_alloc does.
Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Handling of out-of-bounds distances and allocation failure can use
better documentation. Add it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
NUMA distance table handling has the following problems.
* numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
as the table size when it should be using the square of
numa_distance.
* The same size miscalculation when allocation space for phys_dist in
numa_emulation().
* In numa_emulation(), phys_dist must be reserved; otherwise, the new
emulated distance table may overlap it.
Fix them and, while at it, take numa_distance_cnt resetting in
numa_reset_distance() out of the if block to simplify the code a bit.
David Rientjes reported incorrect handling of distance table during
emulation.
-tj: Edited out numa_alloc_distance() related changes which weren't
necessary and rewrote patch description.
-v2: Ingo was unhappy with 80-column limit induced linebreaks. Let
lines run over 80-column.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: David Rientjes <rientjes@google.com>
This patch adds basic SandyBridge support, including hardware
cache events and PEBS events support.
It has been tested on SandyBridge CPUs with perf stat and also
with PEBS based profiling - both work fine.
The patch does not affect other models.
v2 -> v3:
- fix PEBS event 0xd0 with right umask combinations
- move snb pebs constraint assignment to intel_pmu_init
v1 -> v2:
- add more raw and PEBS events constraints
- use offcore events for LLC-* cache events
- remove the call to Nehalem workaround enable_all function
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1299072424.2175.24.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
'simple' would have required specifying current frame address
and return address location manually, but that's obviously not
the case (and not necessary) here.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6D1082020000780003454C@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ensure build doesn't silently continue despite read failure,
addressing a warning due to the unchecked call.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
LKML-Reference: <AANLkTimxxTMU3=4ry-_zbY6v1xiDi+hW9y1RegTr8vLK@mail.gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Do the notifier registration later, so we don't have to worry
about freeing it if we fail the msr allocation.
Signed-off-by: Dave Jones <davej@redhat.com>
It appears that when powernow-k8 finds that
No compatible ACPI _PSS objects found.
and suggests
Try again with latest BIOS.
it fails the module load, but does not unregister the cpu_notifier that was
registered in powernowk8_init
This ends up leaving freed memory on the cpu notifier list for some other
poor module (e.g. md/raid5) to come along and trip over.
The following might be a partial fix, but I suspect there is probably other
clean-up that is needed.
( https://bugzilla.novell.com/show_bug.cgi?id=655215 has full dmesg traces).
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Some of the items removed were apparently never used, others
simply didn't get removed with their last user.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6BD3A002000078000341F1@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cleaning up and shortening code...
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BD35002000078000341DA@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These weren't part of the initial commit of this code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BCDFF02000078000341B0@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A customer of ours, complained that when setting the reset
vector back to 0, it trashed other data and hung their box.
They noticed when only 4 bytes were set to 0 instead of 8,
everything worked correctly.
Mathew pointed out:
|
| We're supposed to be resetting trampoline_phys_low and
| trampoline_phys_high here, which are two 16-bit values.
| Writing 64 bits is definitely going to overwrite space
| that we're not supposed to be touching.
|
So limit the area modified to u32.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Cc: <stable@kernel.org>
LKML-Reference: <1297139100-424-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Support this_cpu_cmpxchg_double() using the cmpxchg16b and cmpxchg8b
instructions.
-tj: s/percpu_cmpxchg16b/percpu_cmpxchg16b_double/ for consistency and
other cosmetic changes.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
PAGE_SIZE_asm, PAGE_SHIFT_asm, THREAD_SIZE_asm can be safely removed from
asm-offsets.c, and be replaced by their non-'_asm' counterparts in the code
that uses them, since the _AC macro defined in include/linux/const.h makes
PAGE_SIZE/PAGE_SHIFT/THREAD_SIZE work with as.
Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
LKML-Reference: <1298666774-17646-2-git-send-email-psomas@cslab.ece.ntua.gr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems
x86/mrst: Fix apb timer rating when lapic timer is used
x86: Fix reboot problem on VersaLogic Menlow boards
xen_pre_device_suspend is unused on ia64.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Rename old interface to sched_op_compat and rename sched_op_new to
simply sched_op.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Take the opportunity to comment on the semantics of the PV guest
suspend hypercall arguments.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Initialize PV spinlocks on boot CPU right after native_smp_prepare_cpus
(that switch to APIC mode and initialize APIC routing); on secondary
CPUs on CPU_UP_PREPARE.
Enable the usage of event channels to send and receive IPIs when
running as a PV on HVM guest.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Up to now we force enable the local apic in the devicetree setup
uncoditionally and set smp_found_config unconditionally to 1 when a
devicetree blob is available. This breaks, when local apic is disabled
in the Kconfig.
Make it consistent by initializing device tree explicitely before
smp_get_config() so a non lapic configuration could be used as well.
To be functional that would require to implement PIT as an interrupt
host, but the only user of this code until now is ce4100 which
requires apics to be available. So we leave this up to those who need
it.
Tested-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
numa_distance should be sized like the SLIT, an NxN matrix where N is
the highest node id + 1. This patch fixes the calculation to avoid
overflowing the array on the subsequent iteration.
-tj: The original patch used last index to calculate size. Yinghai
pointed out it should be incremented so it is the number of
elements instead of the last index to calculate the size of the
table. Updated accordingly.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
On some SB800 systems polarity for IOAPIC pin2 is wrongly
specified as low active by BIOS. This caused system hangs after
resume from S3 when HPET was used in one-shot mode on such
systems because a timer interrupt was missed (HPET signal is
high active).
For more details see:
http://marc.info/?l=linux-kernel&m=129623757413868
Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 37.x, 32.x
LKML-Reference: <20110224145346.GD3658@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
e820_table_{start|end|top}, which are used to buffer page table
allocation during early boot, are now derived from memblock and don't
have much to do with e820. Change the names so that they reflect what
they're used for.
This patch doesn't introduce any behavior change.
-v2: Ingo found that earlier patch "x86: Use early pre-allocated page
table buffer top-down" caused crash on 32bit and needed to be
dropped. This patch was updated to reflect the change.
-tj: Updated commit description.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Both OLPC and CE4100 activate CONFIG_OF. OLPC uses PROMTREE while CE
uses FLATTREE. Compiling for OLPC only breaks due to missing flat tree
functions and variables.
Use proper wrappers and provide an empty x86_flattree_get_config()
inline so OF=y FLATTREE=n builds and works.
[ tglx: Make it work with HPET_TIMER=n and make a function static ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Need to adjust the clockevent device rating for the structure
that will be registered with clockevent system instead of the
temporary structure.
Without this fix, APB timer rating will be higher than LAPIC
timer such that it can not be released later to be used as the
broadcast timer.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
LKML-Reference: <1298506046-439-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows to load the OF driver based informations from the device
tree. Systems without BIOS may need to perform some initialization.
PowerPC creates a PNP device from the OF information and performs this
kind of initialization in their private PCI quirk. This looks more
generic.
This patch also avoids registering the platform RTC driver on X86 if
we have a device tree blob. Otherwise we would setup the device based
on the hardcoded information in arch/x86 rather than the device tree
based one.
[ tglx: Changed "int of_have_populated_dt()" to bool as recommended by
Grant ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: rtc-linux@googlegroups.com
Cc: Alessandro Zummo <a.zummo@towertech.it>
LKML-Reference: <1298405266-1624-12-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use device tree information to setup IO_APIC configuration, interrupt
routing, HPET and everything else which cannot be enumerated by other
means.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-11-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ioapic_xlate provides a translation from the information in device tree
to ioapic related informations. This includes
- obtaining hw irq which is the vector number "=> pin number + gsi"
- obtaining type (level/edge/..)
- programming this information into ioapic
ioapic_add_ofnode adds an irq_domain based on informations from the device
tree. This information (irq_domain) is required in order to map a device to
its proper interrupt controller.
[ tglx: Adapted to the io_apic changes, which let us move that whole code
to devicetree.c ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-10-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For now we probe these busses and we change this to board dependent
probes once we have to.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-9-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
x86_of_pci_init() does two things:
- it provides a generic irq enable and disable function. enable queries
the device tree for the interrupt information, calls ->xlate on the
irq host and updates the pci->irq information for the device.
- it walks through PCI bus(es) in the device tree and adds its children
(device) nodes to appropriate pci_dev nodes in kernel. So the dtb
node information is available at probe time of the PCI device.
Adding a PCI bus based on the information in the device tree is
currently not supported. Right now direct access via ioports is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-8-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>