Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress. This made no sense as the failure
to make progress could be completely independent of IO. It was later
replaced by wait_iff_congested() and removed entirely by commit 258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().
This is problematic. If kswapd encounters many pages under writeback
and it continues to scan until it reaches the high watermark then it
will quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.
The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer
was unable to write to the underlying BDI. kswapd bypasses the BDI
congestion as it sets PF_SWAPWRITE but even if this was taken into
account then it would cause direct reclaimers to stall on writeback
which is not desirable.
This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU
and the zone watermark which is no indication as to whether kswapd
should write pages or not.
This patch tracks if an excessive number of unqueued dirty pages are
being encountered at the end of the LRU. If so, it indicates that dirty
pages are being recycled before flusher threads can clean them and flags
the zone so that kswapd will start writing pages until the zone is
balanced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
zone_*() functions.
Change node_end_pfn() to be a wrapper of pgdat_end_pfn().
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factoring out these 2 checks makes it more clear what we are actually
checking for.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.
This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.
Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header. This
patch is necessary because otherwise there would be a circular
dependency between mm_types.h and mm.h.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 702d1a6e07 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.
Then later, commit 2139cbe627 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter. It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).
This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.
Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
Ka0JKgnWvsa6ez6FSzKI
=ivQa
-----END PGP SIGNATURE-----
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.
During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.
For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866https://patchwork.kernel.org/patch/1346751/
This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.
We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.
For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().
The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.
This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.
[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits 2139cbe627 ("cma: fix counting of isolated pages") and
d95ea5d18e ("cma: fix watermark checking") introduced a reliable
method of free page accounting when memory is being allocated from CMA
regions, so the workaround introduced earlier by commit 49f223a9cd
("mm: trigger page reclaim in alloc_contig_range() to stabilise
watermarks") can be finally removed.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec. (On
many configurations, calculating zone from page is slower.)
But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded. Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
IP: __mod_zone_page_state+0x9/0x60
Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
Call Trace:
__pagevec_lru_add_fn+0xdf/0x140
pagevec_lru_move_fn+0xb1/0x100
__pagevec_lru_add+0x1c/0x30
lru_add_drain_cpu+0xa3/0x130
lru_add_drain+0x2f/0x40
...
The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone. The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.
So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.
Ah, there was one exceptionr. For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too. In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.
I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.
Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.
This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.
[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.
To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.
What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.
Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.
1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.
2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.
The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.
Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.
The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch
o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.
The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 7db8889ab0 ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).
[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.
This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotplug offlining happens on zone A, it starts to mark freed page as
MIGRATE_ISOLATE type in buddy for preventing further allocation.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
we can't allocate them).
When the memory shortage happens during hotplug offlining, current task
starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
sleep because current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
direct reclaim path without kswapd's helping. The problem is that
zone->all_unreclaimable is set by only kswapd so that current task would
be looping forever like below.
__alloc_pages_slowpath
restart:
wake_all_kswapd
rebalance:
__alloc_pages_direct_reclaim
do_try_to_free_pages
if global_reclaim && !all_unreclaimable
return 1; /* It means we did did_some_progress */
skip __alloc_pages_may_oom
should_alloc_retry
goto rebalance;
If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
setting zone->all_unreclaimable, we can solve this problem by killing some
task in direct reclaim path. But it doesn't wake up kswapd, still. It
could be a problem still if other subsystem needs GFP_ATOMIC request. So
kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
going sleep.
This patch counts the number of MIGRATE_ISOLATE page block and
zone_watermark_ok_safe will consider it if the system has such blocks
(fortunately, it's very rare so no problem in POV overhead and kswapd is
never hotpath).
Copy/modify from Mel's quote
"
Ideal solution would be "allocating" the pageblock.
It would keep the free space accounting as it is but historically,
memory hotplug didn't allocate pages because it would be difficult to
detect if a pageblock was isolated or if part of some balloon.
Allocating just full pageblocks would work around this, However,
it would play very badly with CMA.
"
[1] http://lkml.org/lkml/2012/6/14/74
[akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
[akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0ee332c145 ("memblock: Kill early_node_map[]") wanted to replace
CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but
ended up replacing one occurence with a reference to the non-existent
symbol CONFIG_HAVE_MEMBLOCK_NODE.
The resulting omission of code would probably have been causing problems
to 32-bit machines with memory hotplug.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.
Forced full (order == -1) compactions are left alone.
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
Conflicts:
include/linux/mmzone.h
Synced with Linus' tree so that trivial patch can be applied
on top of up-to-date code properly.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.
If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.
We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list. And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.
The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.
For more information one can refer to nice LWN articles:
- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/
- 'CMA and ARM':
http://lwn.net/Articles/450286/
- 'A deep dive into CMA':
http://lwn.net/Articles/486301/
- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204
The main client for this new framework is ARM DMA-mapping subsystem.
The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.
The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.
For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html
The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."
Acked by Andrew Morton <akpm@linux-foundation.org>:
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...
Conflicts:
arch/x86/include/asm/dma-mapping.h
alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.
This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).
It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.
To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.
The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.
Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
there's no user of early_node_map[] left. Kill early_node_map[] and
replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
as page_alloc.c would no longer host an alternative implementation.
This change is ultimately one to one mapping and shouldn't cause any
observable difference; however, after the recent changes, there are
some functions which now would fit memblock.c better than page_alloc.c
and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
doesn't make much sense on some of them. Further cleanups for
functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.
-v2: Fix compile bug introduced by mis-spelling
CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
mmzone.h. Reported by Stephen Rothwell.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd. Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low). That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied. In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings. The command line for fs_mark when
botted with 512M looked something like
./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores. The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available. Swap is on a
separate disk. Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
512M1P-xfs Elapsed Time fsmark 51.41 48.29
512M1P-xfs Elapsed Time simple-wb 114.09 108.61
512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
512M1P-xfs Kswapd efficiency fsmark 62% 63%
512M1P-xfs Kswapd efficiency simple-wb 56% 61%
512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
512M-xfs Elapsed Time fsmark 56.08 48.90
512M-xfs Elapsed Time simple-wb 112.22 98.13
512M-xfs Elapsed Time mmap-strm 219.15 196.67
512M-xfs Kswapd efficiency fsmark 54% 56%
512M-xfs Kswapd efficiency simple-wb 54% 55%
512M-xfs Kswapd efficiency mmap-strm 45% 44%
512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
512M-4X-xfs Elapsed Time fsmark 63.26 55.88
512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
512M-4X-xfs Kswapd efficiency fsmark 49% 50%
512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
512M-16X-xfs Elapsed Time fsmark 67.47 58.25
512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
512M-16X-xfs Kswapd efficiency fsmark 45% 46%
512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely. The time to completion for all tests is improved which is an
important result. In general, kswapd efficiency is not affected by
skipping dirty pages.
1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
1024M1P-xfs Elapsed Time fsmark 84.14 80.41
1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
1024M1P-xfs Kswapd efficiency fsmark 69% 75%
1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
1024M-xfs Elapsed Time fsmark 94.59 91.00
1024M-xfs Elapsed Time simple-wb 229.84 195.08
1024M-xfs Elapsed Time mmap-strm 405.38 440.29
1024M-xfs Kswapd efficiency fsmark 79% 71%
1024M-xfs Kswapd efficiency simple-wb 74% 74%
1024M-xfs Kswapd efficiency mmap-strm 39% 42%
1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped. The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim. In other words, roughly the same number of pages were reclaimed
but swapping was slower. As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
4608M1P-xfs Elapsed Time fsmark 512.01 492.15
4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
4608M1P-xfs Kswapd efficiency fsmark 93% 86%
4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
4608M-xfs Elapsed Time fsmark 555.96 532.34
4608M-xfs Elapsed Time simple-wb 659.72 571.85
4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
4608M-xfs Kswapd efficiency fsmark 89% 91%
4608M-xfs Kswapd efficiency simple-wb 88% 82%
4608M-xfs Kswapd efficiency mmap-strm 48% 46%
4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well. For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced. KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher. However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage. A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
1024M-xfs Elapsed Time fsmark 2053.52 1641.03
1024M-xfs Elapsed Time simple-wb 1229.53 768.05
1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
1024M-xfs Kswapd efficiency fsmark 84% 85%
1024M-xfs Kswapd efficiency simple-wb 92% 81%
1024M-xfs Kswapd efficiency mmap-strm 60% 51%
1024M-xfs Avg wait ms fsmark 5404.53 4473.87
1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory. On the postive side, firefox launched
marginally faster with the patches applied. Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive. It was
also the case that "Avg wait ms" on the root filesystem was lower. I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems. First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex. Some
filesystems ignore write requests from direct reclaim as a result. The
second is that a single-page flush is inefficient in terms of IO. While
there is an expectation that the elevator will merge requests, this does
not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists. There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>