For now, all users of ratelimit_state allocates it statically, so
DEFINE_RATELIMIT_STATE() is enough. But, I want to use ratelimit_state
for fs, i.e. per super_block to suppress too many error reports.
So, this adds ratelimit_state_init() to initialize ratelimite_state
which is dynamically allocated, instead of opencoding.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ratelimit_state initialization of printk_ratelimited() seems broken. This
fixes it by using DEFINE_RATELIMIT_STATE() to initialize spinlock
properly.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For misc devices, inode->i_cdev doesn't point to the device drivers own
data. Link between file operations and device driver internal data is
lost. Pass pointer to misc device struct via file private data for driver
open function use.
Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
32-bit Sparc used to only allow usage of 24-bit of it's atomic_t type.
This was corrected with linux 2.6.3 when Keith M Wesolowski changed the
implementation to use the parisc approach of having an array of spinlocks
to protect the atomic_t.
These warnings were also removed from the sparc implementation when the
new implementation was merged in BKrev:402e4949VThdc6D3iaosSFUgabMfvw, but
the warning still remained in some other places without any 24-bit-only
atomic_t implementation inside the kernel.
We should remove these warnings to allow users to rely on the full 32-bit
range of atomic_t.
Signed-off-by: Peter Fritzsche <peter.fritzsche@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quote from Nick piggin's about btrfs patch
- http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg04472.html.
"add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions."
Let's use it instead of private pagevec in ntfs, too.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cached_page and lru_pvec have not been used. Let's remove the arguments.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-4.3.3 produces the warning:
"format not a string literal and no format arguments"
Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <cel@citi.umich.edu>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Handle out-of-range indices before reading what they refer to. And don't
access the one-past-the-end element of the array either.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roel Kluin <roel.kluin@gmail.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.
- Make SHRT_MIN of type s16, not int, for consistency.
[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/char/hangcheck-timer.c is doubly broken. When the overflown value
of TIMER_FREQ is abnormally low, it spams the syslog with KERN_CRIT
messages "Hangcheck: hangcheck value past margin!" But whether it happens
or not depends on HZ and lpj in a complex way. People have hit it
occasionally as far as google search can tell.
First, the following line overflows unsigned long:
# define TIMER_FREQ (HZ*loops_per_jiffy)
Second, and more importantly, loops_per_jiffy has little to do with the
con= version from the the time scale of get_cycles() (aka rdtsc) to the
time scale of jiffies.
The attached patch resolves both of the problems.
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Jan Glauber <jan.glauber@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux does not define __BYTE_ORDER in its endian header files which makes
some header files bend backwards to get at the current endian. Lets
#define __BYTE_ORDER in big_endian.h/litte_endian.h to make it easier for
header files that are used in user space too.
In userspace the convention is that
1. _both_ __LITTLE_ENDIAN and __BIG_ENDIAN are defined,
2. you have to test for e.g. __BYTE_ORDER == __BIG_ENDIAN.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __must_check to error pointer handlers to have the compiler warn about
mistakes like:
if (err)
ERR_PTR(err);
It found two bugs:
Mar 12 Nikula Jani [PATCH] enclosure: fix error path - actually return ERR_PTR() on error
Mar 12 Nikula Jani [PATCH] sunrpc: fix error path - actually return ERR_PTR() on error
Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the menu governor uses the (corrected) next timer as key item
for predicting the idle duration.
It turns out that there are specific cases where this breaks down: There
are cases where we have a very repetitive pattern of idle durations, where
the idle period is pretty much the same, for reasons completely unrelated
to the next timer event. Examples of such repeating patterns are network
loads with irq mitigation, the mouse moving but in theory also the wifi
beacons.
This patch adds a relatively simple detector for such repeating patterns,
where the standard deviation of the last 8 idle periods is compared to a
threshold.
With this extra predictor in place, measurements show that the DECAY
factor can now be increased (the decaying average will now decay slower)
to get an even more stable result.
[arjan@infradead.org: fix bug identified by Frank]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures that handle DMA-non-coherent memory need to set
ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
buffer doesn't share a cache with the others.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
asm-generic/atomic.h has been derived from the mn10300 implementation.
Remove the now duplicated mn10300 implementation by including the generic
version instead.
This adds cmpxchg_local() and cmpxchg64_local() for free to the
architecture, as they are implemented in asm-generic/atomic.h.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Fritzsche <peter.fritzsche@gmx.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Keith M Wesolowski <wesolows@foobazco.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8b505ca8e2 ("serial: 68328serial.c:
remove BAUD_TABLE_SIZE macro") misses one use of BAUD_TABLE_SIZE. So the
resulting 68328serial.c does not compile:
drivers/serial/68328serial.c: In function `m68328_console_setup':
drivers/serial/68328serial.c:1439: error: `BAUD_TABLE_SIZE' undeclared (first use in this function)
drivers/serial/68328serial.c:1439: error: (Each undeclared identifier is reported only once
drivers/serial/68328serial.c:1439: error: for each function it appears in.)
Fix that last use of it.
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Cc: Thiago Farina <tfransosi@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures that handle DMA-non-coherent memory need to set
ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
buffer doesn't share a cache with the others.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extend gdbstub to support more features of gdb remote protocol to keep
gdb-7 and emacs gud mode happy:
(*) The D command. Detach debugger.
(*) The H command. Handle setting the target thread by ignoring it.
(*) The qAttached command. Indicate we 'attached' to an existing process.
(*) The qC command. Indicate that the current thread ID is 0.
(*) The qOffsets command. Indicate that no relocation has been done.
(*) The qSymbol:: command. Indicate that we're not interested in looking up
any symbol addresses.
(*) The qSupported command. Indicate the maximum packet size and the fact
that reverse step and continue aren't supported.
(*) The vCont? command. Indicate that we don't support any of its variants.
Also make it possible to trace the commands and replies without tracing
the individual character I/O.
[akpm@linux-foundation.org: make gdbstub_handle_query() static]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This ensures that platforms with lowmem PAs above 32 bits work correctly
by avoiding truncating the PA during a left shift.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Barry Song <21cnbao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add global mutex zonelists_mutex to fix the possible race:
CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();
In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).
Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).
[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:
1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().
2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()
Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
[<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
[<ffffffff81128407>] __page_cache_alloc+0x67/0x70
[<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
[<ffffffff81132751>] ra_submit+0x21/0x30
[<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
[<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
[<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
[<ffffffff81266cfa>] nfs_file_read+0xca/0x130
[<ffffffff8117b20a>] do_sync_read+0xfa/0x140
[<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
[<ffffffff8117c151>] sys_read+0x51/0x80
[<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
RSP <ffff88000d1e78a8>
---[ end trace 4bda28328b9990db ]
[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No behavior change here.
Move some of setup_per_cpu_pageset() code into a new function
setup_zone_pageset() that will be useful for memory hotplug.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Got this while compiling for ARM/SA1100:
mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function
This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS. Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.
Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area. For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior. Your patch actually fixes it.
Signed-off-by: Marcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br>
Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Sergei Shtylyov <sshtylyov@mvista.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In f4112de6b6 ("mm: introduce
debug_kmap_atomic") I said that debug_kmap_atomic() needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.
It was wrong. (I thought irqs_disabled() is only available when the
architecture has CONFIG_TRACE_IRQFLAGS_SUPPORT)
Remove the #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT check to enable
kmap_atomic() debugging for the architectures which do not have
CONFIG_TRACE_IRQFLAGS_SUPPORT.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add parenthesis in a define. This doesn't change functionality.
checkpatch errors:
1) white space fixes
2) add spaces after comas
Signed-off-by: matt mooney <mfm@muteddisk.com>
Cc: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix minor spelling errors in a few comments; no code changes.
Signed-off-by: matt mooney <mfm@muteddisk.com>
Cc: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable users to online CPUs even if the CPUs belongs to a numa node which
doesn't have onlined local memory.
The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
in system boot/init period, or at the time of local memory online. For a
numa node without onlined local memory, its zonelists are not initialized
at present. As a result, any memory allocation operations executed by
CPUs within this node will fail. In fact, an out-of-memory error is
triggered when attempt to online CPUs before memory comes to online.
This patch tries to create zonelists for such numa nodes, so that the
memory allocation for this node can be fallback'ed to other nodes.
[akpm@linux-foundation.org: remove unneeded export]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: minskey guo<chaohong.guo@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, we have global isolation vs. memory control group isolation, do
not allow the reclaim entry function to set an arbitrary page isolation
callback, we do not need that flexibility.
And since we already pass around the group descriptor for the memory
control group isolation case, just use it to decide which one of the two
isolator functions to use.
The decisions can be merged into nearby branches, so no extra cost there.
In fact, we save the indirect calls.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This scan control is abused to communicate a return value from
shrink_zones(). Write this idiomatically and remove the knob.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When wb_writeback() hasn't written anything it will re-acquire the inode
lock before calling inode_wait_for_writeback.
This change tests the sync bit first so that is doesn't need to drop &
re-acquire the lock if the inode became available while wb_writeback() was
waiting to get the lock.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.
Fix it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2. making our scan more smoothly. Without this, if percent[x] is
underflow, shrink_zone() doesn't scan any pages and suddenly it scans
all pages when priority is zero. With this, even priority isn't zero,
shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision. For system with a lot of memory, this might slightly changes
behavior. For example, in a sequential file read workload, without the
patch, we don't swap any anon pages. With it, if anon memory size is
bigger than 16G, we will see one anon page swapped. The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
which is common in this workload. So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use mm->task_size instead of TASK_SIZE to ensure that the entire user
address space is migrated. mm->task_size is independent of the calling
task context. TASK SIZE may be dependant on the address space size of the
calling process. Usage of TASK_SIZE can lead to partial address space
migration if the calling process was 32 bit and the migrating process was
64 bit.
Here is the test script used on 64 system with a 32 bit echo process:
mount -t cgroup none /cgroup -o cpuset
cd /cgroup
mkdir 0
echo 1 > 0/cpuset.cpus
echo 0 > 0/cpuset.mems
echo 1 > 0/cpuset.memory_migrate
mkdir 1
echo 1 > 1/cpuset.cpus
echo 1 > 1/cpuset.mems
echo 1 > 1/cpuset.memory_migrate
echo $$ > 0/tasks
64_bit_process &
pid=$!
echo $pid > 1/tasks # This does not migrate all process pages without
# this patch. If 64 bit echo is used or this patch is
# applied, then the full address space of $pid is
# migrated.
To check memory migration, I watched:
grep MemUsed /sys/devices/system/node/node*/meminfo
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why
o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met
In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.
The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will not
be compacted. This choice is arbitrary and not based on data. To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold. The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.
[randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.
Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.
[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a per-node sysfs file called compact. When the file is written to,
each zone in that node is compacted. The intention that this would be
used by something like a job scheduler in a batch system before a job
starts so that the job can allocate the maximum number of hugepages
without significant start-up cost.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
written to the file, all zones are compacted. The expected user of such a
trigger is a job scheduler that prepares the system before the target
application runs.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.
A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.
[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such
as in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.
For the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zo basis via
/sys/kernel/debug/extfrag/extfrag_index
> cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is a
value between 0 and 1. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. For
the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zone basis via
/sys/kernel/debug/extfrag/unusable_index.
> cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
[akpm@linux-foundation.org: Fix allnoconfig]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example
kernel BUG at include/linux/swapops.h:105!
invalid opcode: 0000 [#1] PREEMPT SMP
...
Call Trace:
[<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[<ffffffff813099b5>] page_fault+0x25/0x30
[<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[<ffffffff8111329b>] search_binary_handler+0x173/0x313
[<ffffffff81114896>] do_execve+0x219/0x30a
[<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[<ffffffff8100320a>] stub_execve+0x6a/0xc0
RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
There is a race between shift_arg_pages and migration that triggers this
bug. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed
before migration completes, the migration PTE may not be found leading to
a BUG when the stack is faulted.
This patch causes pages within the temporary stack during exec to be
skipped by migration. It does this by marking the VMA covering the
temporary stack with an otherwise impossible combination of VMA flags.
These flags are cleared when the temporary stack is moved to its final
location.
[kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.
As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.
[akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>