Beginning at commit d52d3997f8 ("ipv6: Create percpu rt6_info"), the
following INFO splat is logged:
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-rc7-next-20150612 #1 Not tainted
-------------------------------
kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
3 locks held by systemd/1:
#0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
#1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
#2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
stack backtrace:
CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
Call Trace:
dump_stack+0x4c/0x6e
lockdep_rcu_suspicious+0xe7/0x120
___might_sleep+0x1d5/0x1f0
__might_sleep+0x4d/0x90
kmem_cache_alloc+0x47/0x250
create_object+0x39/0x2e0
kmemleak_alloc_percpu+0x61/0xe0
pcpu_alloc+0x370/0x630
Additional backtrace lines are truncated. In addition, the above splat
is followed by several "BUG: sleeping function called from invalid
context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
these are the clue to the fix. Routine kmemleak_alloc_percpu() always
uses GFP_KERNEL for its allocations, whereas it should follow the gfp
from its callers.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unit_map[] array has "nr_cpu_ids" number of elements. It's
allocated a few lines earlier in the function. So this test should be
>= instead of >.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When @gfp is specified, the percpu allocator is interested in whether
it contains all of GFP_KERNEL or not. If it does, the normal
allocation path is taken; otherwise, the atomic allocation path.
Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp
contains any part of GFP_KERNEL.
Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of
"!(gfp & GFP_KERNEL)" to decide whether the allocation should be
atomic or not.
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 3189eddbca ("percpu: free percpu allocation info for
uniprocessor system").
The commit causes a hang with a crisv32 image. This may be an architecture
problem, but at least for now the revert is necessary to be able to boot a
crisv32 image.
Cc: Tejun Heo <tj@kernel.org>
Cc: Honggang Li <enjoymindful@gmail.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3189eddbca ("percpu: free percpu allocation info for uniprocessor system")
Cc: stable@vger.kernel.org # Please don't apply 3189eddbca
While updating locking, b38d08f318 ("percpu: restructure locking")
broke pcpu_create_chunk() creation path in pcpu_alloc(). It returns
without releasing pcpu_alloc_mutex. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
The percpu allocator now supports atomic allocations by only
allocating from already populated areas but the mechanism to ensure
that there's adequate amount of populated areas was missing.
This patch expands pcpu_balance_work so that in addition to freeing
excess free chunks it also populates chunks to maintain an adequate
level of populated areas. pcpu_alloc() schedules pcpu_balance_work if
the amount of free populated areas is too low or after an atomic
allocation failure.
* PERPCU_DYNAMIC_RESERVE is increased by two pages to account for
PCPU_EMPTY_POP_PAGES_LOW.
* pcpu_async_enabled is added to gate both async jobs -
chunk->map_extend_work and pcpu_balance_work - so that we don't end
up scheduling them while the needed subsystems aren't up yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
pcpu_reclaim_work will also be used to populate chunks asynchronously.
Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is
renamed to pcpu_balance_workfn() and some of its local variables are
renamed too.
This is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
pcpu_nr_empty_pop_pages counts the number of empty populated pages
across all chunks and chunk->nr_populated counts the number of
populated pages in a chunk. Both will be used to implement pre/async
population for atomic allocations.
pcpu_chunk_[de]populated() are added to update chunk->populated,
chunk->nr_populated and pcpu_nr_empty_pop_pages together. All
successful chunk [de]populations should be followed by the
corresponding pcpu_chunk_[de]populated() calls.
Signed-off-by: Tejun Heo <tj@kernel.org>
An allocation attempt may require extending chunk->map array which
requires GFP_KERNEL context which isn't available for atomic
allocations. This patch ensures that chunk->map array usually keeps
some amount of available space by directly allocating buffer space
during GFP_KERNEL allocations and scheduling async extension during
atomic ones. This should make atomic allocation failures from map
space exhaustion rare.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that pcpu_alloc_area() can allocate only from populated areas,
it's easy to add atomic allocation support to [__]alloc_percpu().
Update pcpu_alloc() so that it accepts @gfp and skips all the blocking
operations and allocates only from the populated areas if @gfp doesn't
contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp()
are added.
While this means that atomic allocations are possible, this isn't
complete yet as there's no mechanism to ensure that certain amount of
populated areas is kept available and atomic allocations may keep
failing under certain conditions.
Signed-off-by: Tejun Heo <tj@kernel.org>
The next patch will conditionalize the population block in
pcpu_alloc() which will end up making a rather large indentation
change obfuscating the actual logic change. This patch puts the block
under "if (true)" so that the next patch can avoid indentation
changes. The defintions of the local variables which are used only in
the block are moved into the block.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Update pcpu_alloc_area() so that it can skip unpopulated areas if the
new parameter @pop_only is true. This is implemented by a new
function, pcpu_fit_in_area(), which determines the amount of head
padding considering the alignment and populated state.
@pop_only is currently always false but this will be used to implement
atomic allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
At first, the percpu allocator required a sleepable context for both
alloc and free paths and used pcpu_alloc_mutex to protect everything.
Later, pcpu_lock was introduced to protect the index data structure so
that the free path can be invoked from atomic contexts. The
conversion only updated what's necessary and left most of the
allocation path under pcpu_alloc_mutex.
The percpu allocator is planned to add support for atomic allocation
and this patch restructures locking so that the coverage of
pcpu_alloc_mutex is further reduced.
* pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
chunk and populating the allocated area. Everything else is now
protected soley by pcpu_lock.
After this change, multiple instances of pcpu_extend_area_map() may
race but the function already implements sufficient synchronization
using pcpu_lock.
This also allows multiple allocators to arrive at new chunk
creation. To avoid creating multiple empty chunks back-to-back, a
new chunk is created iff there is no other empty chunk after
grabbing pcpu_alloc_mutex.
* pcpu_lock is now held while modifying chunk->populated bitmap.
After this, all data structures are protected by pcpu_lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, pcpu_[de]populate_chunk() were called with the range which
may contain multiple target regions in it and
pcpu_[de]populate_chunk() iterated over the regions. This has the
benefit of batching up cache flushes for all the regions; however,
we're planning to add more bookkeeping logic around [de]population to
support atomic allocations and this delegation of iterations gets in
the way.
This patch moves the region iterations out of
pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
pcpu_reclaim() - so that we can later add logic to track more states
around them. This change may make cache and tlb flushes more frequent
but multi-region [de]populations are rare anyway and if this actually
becomes a problem, it's not difficult to factor out cache flushes as
separate callbacks which are directly invoked from percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
percpu-vm and percpu-km implement separate versions of
pcpu_[de]populate_chunk() and some part which is or should be common
are currently in the specific implementations. Make the following
changes.
* Allocate area clearing is moved from the pcpu_populate_chunk()
implementations to pcpu_alloc(). This makes percpu-km's version
noop.
* Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
to their respective callers so that they are applied to percpu-km
too. This doesn't make any meaningful difference as both functions
are noop for percpu-km; however, this is more consistent and will
help implementing atomic allocation support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, only SMP system free the percpu allocation info.
Uniprocessor system should free it too. For example, one x86 UML
virtual machine with 256MB memory, UML kernel wastes one page memory.
Signed-off-by: Honggang Li <enjoymindful@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)
It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
but for consistency with its couterpart pcpu_mem_zalloc(),
use pcpu_mem_free() instead.
Commit b4916cb17c ("percpu: make pcpu_free_chunk() use
pcpu_mem_free() instead of kfree()") addressed this problem, but
missed this one.
tj: commit message updated
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 099a19d91c ("percpu: allow limited allocation before slab is online)
Cc: stable@vger.kernel.org
During pcpu_alloc_area(), we might merge the current head with the
previous block. Since we have calculated the max_contig using the
size of previous block before we skip it, and now we update the size
of previous block, so we should renew the max_contig.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
723ad1d90b ("percpu: store offsets instead of lengths in ->map[]")
updated percpu area allocator to use the lowest bit, instead of sign,
to signify whether the area is occupied and forced min align to 2;
unfortunately, it forgot to force the allocation size to be even
causing malfunctions for the very rare odd-sized allocations.
Always force the allocations to be even sized.
tj: Wrote patch description.
Original-patch-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
If we know that first N areas are all in use, we can obviously skip
them when searching for a free one. And that kind of hint is very
easy to maintain.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Current code keeps +-length for each area in chunk->map[]. It has
several unpleasant consequences:
* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in. Note that offsets are monotonous, so simple
binary search would do here.
New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>. The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.
In other words, where the old variant would have
4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
0, 5, 13, 16, 21, 32, 133
This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do. Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
... and simplify the results a bit. Makes the next step easier
to deal with - we will be changing the data representation for
chunk->map[] and it's easier to do if the code in question is
not split between pcpu_alloc_area() and pcpu_split_block().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge first patch-bomb from Andrew Morton:
- a couple of misc things
- inotify/fsnotify work from Jan
- ocfs2 updates (partial)
- about half of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm/migrate: remove unused function, fail_migrate_page()
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
mm/migrate: correct failure handling if !hugepage_migration_support()
mm/migrate: add comment about permanent failure path
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
mm: compaction: reset scanner positions immediately when they meet
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
mm: compaction: detect when scanners meet in isolate_freepages
mm: compaction: reset cached scanner pfn's before reading them
mm: compaction: encapsulate defer reset logic
mm: compaction: trace compaction begin and end
memcg, oom: lock mem_cgroup_print_oom_info
sched: add tracepoints related to NUMA task migration
mm: numa: do not automatically migrate KSM pages
mm: numa: trace tasks that fail migration due to rate limiting
mm: numa: limit scope of lock for NUMA migrate rate limiting
mm: numa: make NUMA-migrate related functions static
lib/show_mem.c: show num_poisoned_pages when oom
mm/hwpoison: add '#' to hwpoison_inject
mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
...
Switch to memblock interfaces for early memory allocator instead of
bootmem allocator. No functional change in beahvior than what it is in
current code from bootmem users points of view.
Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock. And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmalloc already gives a useful macro to calculate the total vmalloc
size. Use it.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
If memory allocation of in pcpu_embed_first_chunk() fails, the
allocated memory is not released correctly. In the release loop also
the non-allocated elements are released which leads to the following
kernel BUG on systems with very little memory:
[ 0.000000] kernel BUG at mm/bootmem.c:307!
[ 0.000000] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22
[ 0.000000] task: 0000000000a20ae0 ti: 0000000000a08000 task.ti: 0000000000a08000
[ 0.000000] Krnl PSW : 0400000180000000 0000000000abda7a (__free+0x116/0x154)
[ 0.000000] R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 EA:3
...
[ 0.000000] [<0000000000abdce2>] mark_bootmem_node+0xde/0xf0
[ 0.000000] [<0000000000abdd9c>] mark_bootmem+0xa8/0x118
[ 0.000000] [<0000000000abcbba>] pcpu_embed_first_chunk+0xe7a/0xf0c
[ 0.000000] [<0000000000abcc96>] setup_per_cpu_areas+0x4a/0x28c
To fix the problem now only allocated elements are released. This then
leads to the correct kernel panic:
[ 0.000000] Kernel panic - not syncing: Failed to initialize percpu areas.
...
[ 0.000000] Call Trace:
[ 0.000000] ([<000000000011307e>] show_trace+0x132/0x150)
[ 0.000000] [<0000000000113160>] show_stack+0xc4/0xd4
[ 0.000000] [<00000000007127dc>] dump_stack+0x74/0xd8
[ 0.000000] [<00000000007123fe>] panic+0xea/0x264
[ 0.000000] [<0000000000b14814>] setup_per_cpu_areas+0x5c/0x28c
tj: Flipped if conditional so that it doesn't need "continue".
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
commit 099a19d9('allow limited allocation before slab is online') made
pcpu_alloc_chunk() use pcpu_mem_zalloc() but forgot to update
pcpu_free_chunk() accordingly. This doesn't cause any immediate
problema, but fix it for consistency.
tj: commit message updated
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Kmemleak tracks the percpu allocations via a specific API and the
originally allocated areas must be removed from kmemleak (via
kmemleak_free). The code was already doing this for SMP systems.
Reported-by: Sami Liedes <sami.liedes@iki.fi>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
pcpu_embed_first_chunk() allocates memory for each node, copies percpu
data and frees unused portions of it before proceeding to the next
group. This assumes that allocations for different nodes doesn't
overlap; however, depending on memory topology, the bootmem allocator
may end up allocating memory from a different node than the requested
one which may overlap with the portion freed from one of the previous
percpu areas. This leads to percpu groups for different nodes
overlapping which is a serious bug.
This patch separates out copy & partial free from the allocation loop
such that all allocations are complete before partial frees happen.
This also fixes overlapping frees which could happen on allocation
failure path - out_free_areas path frees whole groups but the groups
could have portions freed at that point.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru>
Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru>
LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
pcpu_dump_alloc_info() was printing continued lines without KERN_CONT.
Use it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kay Sievers <kay.sievers@vrfy.org>
Main features:
- Handle percpu memory allocations (only scanning them, not actually
reporting).
- Memory hotplug support.
Usability improvements:
- Show the origin of early allocations.
- Report previously found leaks even if kmemleak has been disabled by
some error.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAABAgAGBQJPDCI0AAoJEGvWsS0AyF7x+MUQALEQTnREqBgpqa+95Wk8WaEB
F/00mbwRpLVKl1jsfCn4wxFPUGuXS/oaxYztDSTP8BrEzZ5E0Kq+Ejsby9yPLs5r
9nwsoRrxBerUjFHqXjx2xrTkAZQomLesNw5ZkaKFVgBzNo7O63Co4TGuP5J8s03G
7hyewcZvbmzkX1SpqMvPItdUTpK+vwABBHGvYta6NS89Bt9GuexC/NS3o2qy2q6c
2BXhUXSJyYsalxvsYYw+hNOyVWrFJ/TWJKsksg9ANxzcbkLKUat9IpvcR3CTRUpu
L/72GXGCDyMw3YgXs8MBlOk3KXRcobISYCVMsDuVz6tITP7RHCB6rG/Hg55YWxeS
1N2P0kMFkDGVui4pzPZZENUH1QfuwoZ5RpgJ2OCaVnfguLOgGM9k665KT9OScWeC
tpxoS82jGd5RezrgF30yvpLz2CivvjRiEpIXL8o47pg/kESgY1PFnDwTW8imoikt
dTQFZXYeFzjcHkN1YNUXgjNfh+CqCkUXLQ5k+8vQ+9TFWh21thwuzg5AGcK28xTc
6mGzSsJzx2w7IKTCjZ3BGN+IXt/KpC4iKyIEFeNsgy9Z8gU0I0GaMVixQtZFxeEt
asqNBaQGngJ86BeO1bjRB/YKO+F+ZIchJiGN4PNgtc4BGz45LGfKOfRjlku4rmsZ
8OJRqGx5qZykxYhNSHXq
=5lb1
-----END PGP SIGNATURE-----
Merge tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux
Kmemleak patches
Main features:
- Handle percpu memory allocations (only scanning them, not actually
reporting).
- Memory hotplug support.
Usability improvements:
- Show the origin of early allocations.
- Report previously found leaks even if kmemleak has been disabled by
some error.
* tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
kmemleak: Add support for memory hotplug
kmemleak: Handle percpu memory allocation
kmemleak: Report previously found leaks even after an error
kmemleak: When the early log buffer is exceeded, report the actual number
kmemleak: Show where early_log issues come from
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.
This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:
crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000
So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4
However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000
As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.
Fix it by adding offset_in_page() to the translated page address.
-tj: Combined Eugene's and Petr's commit messages.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Cc: stable@kernel.org
This patch adds kmemleak callbacks from the percpu allocator, reducing a
number of false positives caused by kmemleak not scanning such memory
blocks. The percpu chunks are never reported as leaks because of current
kmemleak limitations with the __percpu pointer not pointing directly to
the actual chunks.
Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Add comments about current per_cpu_ptr_to_phys implementation to
explain why the logic is more complicated than necessary.
-tj: relocated comment into kerneldoc comment
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.
This simply isn't true. Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.
Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.
The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.
Kudos to Dave Young for tracking down the problem.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: WANG Cong <xiyou.wangcong@gmail.com>
Reported-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
LKML-Reference: <4EC21F67.10905@redhat.com>
Cc: stable @kernel.org
Currently pcpu_mem_alloc() is implemented always return zeroed memory.
So rename it to make user like pcpu_get_pages_and_bitmap() know don't
reinit it.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Unify input section names
percpu: Avoid extra NOP in percpu_cmpxchg16b_double
percpu: Cast away printk format warning
percpu: Always align percpu output section to PAGE_SIZE
Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
On 32-bit systems which don't happen to implicitly define or cast
VMALLOC_START and/or VMALLOC_END to long in their arch headers, the
printk in the percpu code will cause a warning to be emitted:
mm/percpu.c: In function 'pcpu_embed_first_chunk':
mm/percpu.c:1648: warning: format '%lx' expects type 'long unsigned int',
but argument 3 has type 'unsigned int'
So add an explicit cast to unsigned long here.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an
address is in the vmalloc() region or not. This is incorrect on NOMMU as
there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()).
The correct way to do this is to use is_vmalloc_addr(). This encapsulates the
vmalloc() region test in MMU mode and just returns 0 in NOMMU mode.
On FRV in NOMMU mode, the percpu compilation fails without this patch:
mm/percpu.c: In function 'per_cpu_ptr_to_phys':
mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function)
mm/percpu.c:1011: error: (Each undeclared identifier is reported only once
mm/percpu.c:1011: error: for each function it appears in.)
mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function)
mm/percpu.c:1018: warning: control reaches end of non-void function
Signed-off-by: David Howells <dhowells@redhat.com>
Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly. The calculation of the
former depends on the alignment of percpu output section in the kernel
image.
The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.
This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.
For um, this patch raises the alignment of percpu area. As the area
is in .init, there shouldn't be any noticeable difference.
This problem was discovered by David Howells while debugging boot
failure on mn10300.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
gameport: use this_cpu_read instead of lookup
x86: udelay: Use this_cpu_read to avoid address calculation
x86: Use this_cpu_inc_return for nmi counter
x86: Replace uses of current_cpu_data with this_cpu ops
x86: Use this_cpu_ops to optimize code
vmstat: User per cpu atomics to avoid interrupt disable / enable
irq_work: Use per cpu atomics instead of regular atomics
cpuops: Use cmpxchg for xchg to avoid lock semantics
x86: this_cpu_cmpxchg and this_cpu_xchg operations
percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
percpu,x86: relocate this_cpu_add_return() and friends
connector: Use this_cpu operations
xen: Use this_cpu_inc_return
taskstats: Use this_cpu_ops
random: Use this_cpu_inc_return
fs: Use this_cpu_inc_return in buffer.c
highmem: Use this_cpu_xx_return() operations
vmstat: Use this_cpu_inc_return for vm statistics
x86: Support for this_cpu_add, sub, dec, inc_return
percpu: Generic support for this_cpu_add, sub, dec, inc_return
...
Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
Now that percpu allocator is mostly stable, there is no reason to
print alloc information with KERN_INFO and clutter the boot messages.
Switch it to KERN_DEBUG.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>