zonelists_mutex was introduced by commit 4eaf3f6439 ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.
Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49 ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.
Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.
While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.
Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Haicheng Li <haicheng.li@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
build_all_zonelists has been (ab)using stop_machine to make sure that
zonelists do not change while somebody is looking at them. This is is
just a gross hack because a) it complicates the context from which we
can call build_all_zonelists (see 3f906ba236 ("mm/memory-hotplug:
switch locking to a percpu rwsem")) and b) is is not really necessary
especially after "mm, page_alloc: simplify zonelist initialization" and
c) it doesn't really provide the protection it claims (see below).
Updates of the zonelists happen very seldom, basically only when a zone
becomes populated during memory online or when it loses all the memory
during offline. A racing iteration over zonelists could either miss a
zone or try to work on one zone twice. Both of these are something we
can live with occasionally because there will always be at least one
zone visible so we are not likely to fail allocation too easily for
example.
Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not
the case in most cases) so the some zones could still be seen twice or a
zone missed.
I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory
pressure.
Node 1, zone DMA
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 943, 943, 943)
Node 1, zone DMA32
pages free 227310
min 8294
low 10367
high 12440
spanned 262112
present 262112
managed 241436
protection: (0, 0, 0, 0)
Node 1, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 1024)
Node 1, zone Movable
pages free 32722
min 85
low 117
high 149
spanned 32768
present 32768
managed 32768
protection: (0, 0, 0, 0)
root@test1:/sys/devices/system/node/node1# while true
do
echo offline > memory34/state
echo online_movable > memory34/state
done
root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.
Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have
in each node we rely on the _zoneref to terminate initialized part of
the zonelist by a NULL zone. While this is functionally correct it is
quite suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.
We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.
While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns the
updated index for the next iteration. Let's rename the function to
build_zonerefs_node to better reflect its purpose and give it zoneref
array to update. The function doesn't the index anymore. It just
returns the number of added zones so that the caller can advance the
zonered array start for the next update.
This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.
Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists. So the additional call is redundant. Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.
Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d701 ("memory-hotplug: allocate zone's pcp before
onlining pages")).
Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.
Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__build_all_zonelists reinitializes each online cpu local node for
CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously
memory less nodes could gain some memory during memory hotplug and so
the local node should be changed for CPUs close to such a node. It
makes less sense to do that unconditionally for a newly creaded NUMA
node which is still offline and without any memory.
Let's also simplify the cpu loop and use for_each_online_cpu instead of
an explicit cpu_online check for all possible cpus.
Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
boot_pageset is a boot time hack which gets superseded by normal
pagesets later in the boot process. It makes zero sense to reinitialize
it again and again during memory hotplug.
Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanup zonelists initialization", v1.
This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but the
usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.
Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it we
can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.
Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.
Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.
This patch (of 9):
Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all. Mel has already made node
ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.
This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means that
lowmem requests have to skip over all highmem pages on LRUs already and
so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.
Considering that 32b NUMA systems are rather suboptimal already and it
is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line or
the sysctl.
[mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds document and kconfig for using of writeback feature.
Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables read IO from backing device. For the feature, it
implements two IO read functions to transfer data from backing storage.
One is asynchronous IO function and other is synchronous one.
A reason I need synchrnous IO is due to partial write which need to
complete read IO before the overwriting partial data.
We can make the partial IO's case asynchronous, too but at the moment, I
don't feel adding more complexity to support such rare use cases so want
to go with simple.
[xieyisheng1@huawei.com: read_from_bdev_async(): return 1 to avoid call page_endio() in zram_rw_page()]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables write IO to transfer data to backing device. For
that, it implements write_to_bdev function which creates new bio and
chaining with parent bio to make the parent bio asynchrnous.
For rw_page which don't have parent bio, it submit owned bio and handle
IO completion by zram_page_end_io.
Also, this patch defines new flag ZRAM_WB to mark written page for later
read IO.
[xieyisheng1@huawei.com: fix typo in comment]
Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For upcoming asynchronous IO like writeback, zram_rw_page should be
aware of that whether requested IO was completed or submitted
successfully, otherwise error.
For the goal, zram_bvec_rw has three return values.
-errno: returns error number
0: IO request is done synchronously
1: IO request is issued successfully.
Link: http://lkml.kernel.org/r/1498459987-24562-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With backing device, zram needs management of free space of backing
device.
This patch adds bitmap logic to manage free space which is very naive.
However, it would be simple enough as considering uncompressible pages's
frequenty in zram.
Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For writeback feature, user should set up backing device before the zram
working.
This patch enables the interface via /sys/block/zramX/backing_dev.
Currently, it supports block device only but it could be enhanced for
file as well.
Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_decompress_page naming is not proper because it doesn't decompress
if page was dedup hit or stored with compression.
Use more abstract term and consistent with write path function
__zram_bvec_write.
Link: http://lkml.kernel.org/r/1498459987-24562-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_compress does several things, compress, entry alloc and check
limitation. I did for just readbility but it hurts modulization.:(
So this patch removes zram_compress functions and inline it in
__zram_bvec_write for upcoming patches.
Link: http://lkml.kernel.org/r/1498459987-24562-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "writeback incompressible pages to storage", v1.
zRam is useful for memory saving with compressible pages but sometime,
workload can be changed and system has lots of incompressible pages
which is very harmful for zram.
This patch supports writeback feature of zram so admin can set up a
block device and with it, zram can save the memory via writing out the
incompressile pages once it found it's incompressible pages (1/4 comp
ratio) instead of keeping the page in memory.
[1-3] is just clean up and [4-8] is step by step feature enablement.
[4-8] is logically not bisectable(ie, logical unit separation)
although I tried to compiled out without breaking but I think it would
be better to review.
This patch (of 9):
__zram_bvec_write has some of duplicated logic for zram meta data
handling of same_page|compressed_page. This patch aims to clean it up
without behavior change.
[xieyisheng1@huawei.com: fix compr_data_size stat]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/1498459987-24562-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Juneho Choi <juno.choi@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
to precede the Movable zone in the physical memory range. The purpose
of the movable zone is, however, not bound to any physical memory
restriction. It merely defines a class of migrateable and reclaimable
memory.
There are users (e.g. CMA) who might want to reserve specific physical
memory ranges for their own purpose. Moreover our pfn walkers have to
be prepared for zones overlapping in the physical range already because
we do support interleaving NUMA nodes and therefore zones can interleave
as well. This means we can allow each memory block to be associated
with a different zone.
Loosen the current onlining semantic and allow explicit onlining type on
any memblock. That means that online_{kernel,movable} will be allowed
regardless of the physical address of the memblock as long as it is
offline of course. This might result in moveble zone overlapping with
other kernel zones. Default onlining then becomes a bit tricky but
still sensible. echo online > memoryXY/state will online the given
block to
1) the default zone if the given range is outside of any zone
2) the enclosing zone if such a zone doesn't interleave with
any other zone
3) the default zone if more zones interleave for this range
where default zone is movable zone only if movable_node is enabled
otherwise it is a kernel zone.
Here is an example of the semantic with (movable_node is not present but
it work in an analogous way). We start with following memblocks, all of
them offline:
memory34/valid_zones:Normal Movable
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Normal Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal Movable
Now, we online block 34 in default mode and block 37 as movable
root@test1:/sys/devices/system/node/node1# echo online > memory34/state
root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal Movable
As we can see all other blocks can still be onlined both into Normal and
Movable zones and the Normal is default because the Movable zone spans
only block37 now.
root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Movable Normal
memory39/valid_zones:Movable Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:Movable
Now the default zone for blocks 37-41 has changed because movable zone
spans that range.
root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:Movable
Note that the block 39 now belongs to the zone Normal and so block38
falls into Normal by default as well.
For completness
root@test1:/sys/devices/system/node/node1# for i in memory[34]?
do
echo online > $i/state 2>/dev/null
done
memory34/valid_zones:Normal
memory35/valid_zones:Normal
memory36/valid_zones:Normal
memory37/valid_zones:Movable
memory38/valid_zones:Normal
memory39/valid_zones:Normal
memory40/valid_zones:Movable
memory41/valid_zones:Movable
Implementation wise the change is quite straightforward. We can get rid
of allow_online_pfn_range altogether. online_pages allows only offline
nodes already. The original default_zone_for_pfn will become
default_kernel_zone_for_pfn. New default_zone_for_pfn implements the
above semantic. zone_for_pfn_range is slightly reorganized to implement
kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
catch all default behavior.
Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kani Toshimitsu <toshi.kani@hpe.com>
Cc: <slaoub@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") we used to allow to change the
valid zone types of a memory block if it is adjacent to a different zone
type.
This fact was reflected in memoryNN/valid_zones by the ordering of
printed zones. The first one was default (echo online > memoryNN/state)
and the other one could be onlined explicitly by online_{movable,kernel}.
This behavior was removed by the said patch and as such the ordering was
not all that important. In most cases a kernel zone would be default
anyway. The only exception is movable_node handled by "mm,
memory_hotplug: support movable_node for hotpluggable nodes".
Let's reintroduce this behavior again because later patch will remove
the zone overlap restriction and so user will be allowed to online
kernel resp. movable block regardless of its placement. Original
behavior will then become significant again because it would be
non-trivial for users to see what is the default zone to online into.
Implementation is really simple. Pull out zone selection out of
move_pfn_range into zone_for_pfn_range helper and use it in
show_valid_zones to display the zone for default onlining and then both
kernel and movable if they are allowed. Default online zone is not
duplicated.
Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kani Toshimitsu <toshi.kani@hpe.com>
Cc: <slaoub@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9adb62a5df ("mm/hotplug: correctly setup fallback zonelists
when creating new pgdat") tries to build the correct zonelist for a
newly added node, while it is not necessary to rebuild it for already
exist nodes.
In build_zonelists(), it will iterate on nodes with memory. For a newly
added node, it will have memory until node_states_set_node() is called
in online_pages().
This patch avoids rebuilding the zonelists for already existing nodes.
build_zonelists_node() uses managed_zone(zone) checks, so it should not
include empty zones anyway. So effectively we avoid some pointless work
under stop_machine().
[akpm@linux-foundation.org: tweak comment text]
[akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_slab() allows us to report back the number of objects we
successfully scanned (out of the target shrinkctl->nr_to_scan). As
report the number of pages owned by each GEM object as a separate item
to the shrinker, we cannot precisely control the number of shrinker
objects we scan on each pass; and indeed may free more than requested.
If we fail to tell the shrinker about the number of objects we process,
it will continue to hold a grudge against us as any objects left
unscanned are added to the next reclaim -- and so we will keep on
"unfairly" shrinking our own slab in comparison to other slabs.
Link: http://lkml.kernel.org/r/20170822135325.9191-2-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some shrinkers may only be able to free a bunch of objects at a time,
and so free more than the requested nr_to_scan in one pass.
Whilst other shrinkers may find themselves even unable to scan as many
objects as they counted, and so underreport. Account for the extra
freed/scanned objects against the total number of objects we intend to
scan, otherwise we may end up penalising the slab far more than
intended. Similarly, we want to add the underperforming scan to the
deferred pass so that we try harder and harder in future passes.
Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an assertion similar to "fasttop" check in GNU C Library allocator
as a part of SLAB_FREELIST_HARDENED feature. An object added to a
singly linked freelist should not point to itself. That helps to detect
some double free errors (e.g. CVE-2017-2636) without slub_debug and
KASAN.
Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.com
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Paul E McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tycho Andersen <tycho@docker.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This SLUB free list pointer obfuscation code is modified from Brad
Spengler/PaX Team's code in the last public patch of grsecurity/PaX
based on my understanding of the code. Changes or omissions from the
original code are mine and don't reflect the original grsecurity/PaX
code.
This adds a per-cache random value to SLUB caches that is XORed with
their freelist pointer address and value. This adds nearly zero
overhead and frustrates the very common heap overflow exploitation
method of overwriting freelist pointers.
A recent example of the attack is written up here:
http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
and there is a section dedicated to the technique the book "A Guide to
Kernel Exploitation: Attacking the Core".
This is based on patches by Daniel Micay, and refactored to minimize the
use of #ifdef.
With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
run times:
before:
mean 10.11882499999999999995
variance .03320378329145728642
stdev .18221905304181911048
after:
mean 10.12654000000000000014
variance .04700556623115577889
stdev .21680767106160192064
The difference gets lost in the noise, but if the above is to be taken
literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.
Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Daniel Micay <danielmicay@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tycho Andersen <tycho@docker.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- free_kmem_cache_nodes() frees the cache node before nulling out a
reference to it
- init_kmem_cache_nodes() publishes the cache node before initializing
it
Neither of these matter at runtime because the cache nodes cannot be
looked up by any other thread. But it's neater and more consistent to
reorder these.
Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clean up some unused functions and parameters.
Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function is never called outside of fs/ocfs2/acl.c.
Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is code duplication between sec_name() and sech_name(). Simplify
sec_name() by re-using sech_name(). Also, move them up to remove the
forward declaration of sec_name().
Link: http://lkml.kernel.org/r/1502248721-22009-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dax_pmd_insert_mapping() contains the following code:
pfn_t pfn;
if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
goto fallback;
/* ... */
fallback:
trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
When the condition in the if statement fails, the function calls
trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.
This issue has been found while building the kernel with clang. The
compiler reported:
fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/dax.c:1310:60: note: uninitialized use occurs here
trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
^~~
Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
equivalent page offset mask.
Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a comment explaining how the user addresses provided to read(2) and
write(2) are validated in the DAX I/O path.
We call dax_copy_from_iter() or copy_to_iter() on these without calling
access_ok() first in the DAX code, and there was a concern that the user
might be able to read/write to arbitrary kernel addresses with this
path.
Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we no longer insert struct page pointers in DAX radix trees the
page cache code no longer needs to know anything about DAX exceptional
entries. Move all the DAX exceptional entry definitions from dax.h to
fs/dax.c.
Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().
This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.
Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.
This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:
7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:
Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us
This was the average latency over 1 GiB of sequential reads done by
this simple fio script:
[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.
Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.
This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:
7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Overall system memory consumption is similarly improved.
Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:
"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.
Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
needs to be moved lower in dax.c so the definition exists.
dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
made static to dax.c, so we need to move its definition above all its
callers.
Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree. This has three major
drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory.
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
This series solves these issues by following the lead of the DAX PMD
code and using a common 4k zero page instead. This reduces memory usage
and decreases latencies for some workloads, and it simplifies the DAX
code, removing over 100 lines in total.
This patch (of 5):
To be able to use the common 4k zero page in DAX we need to have our PTE
fault path look more like our PMD fault path where a PTE entry can be
marked as dirty and writeable as it is first inserted rather than
waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
call.
Right now we can rely on having a dax_pfn_mkwrite() call because we can
distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does for
DAX ptes. Instead of special casing the DAX + 4k zero page case we will
simplify our DAX PTE page fault sequence so that it matches our DAX PMD
sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead
use dax_iomap_fault() to handle write-protection faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite'
is set insert_pfn() will do the work that was previously done by
wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a7be6e5a7f ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().
The parent_node() macro in METAG architecture is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/1501076076-1974-4-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Support ipv6 checksum offload in sunvnet driver, from Shannon
Nelson.
2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
Dumazet.
3) Allow generic XDP to work on virtual devices, from John Fastabend.
4) Add bpf device maps and XDP_REDIRECT, which can be used to build
arbitrary switching frameworks using XDP. From John Fastabend.
5) Remove UFO offloads from the tree, gave us little other than bugs.
6) Remove the IPSEC flow cache, from Florian Westphal.
7) Support ipv6 route offload in mlxsw driver.
8) Support VF representors in bnxt_en, from Sathya Perla.
9) Add support for forward error correction modes to ethtool, from
Vidya Sagar Ravipati.
10) Add time filter for packet scheduler action dumping, from Jamal Hadi
Salim.
11) Extend the zerocopy sendmsg() used by virtio and tap to regular
sockets via MSG_ZEROCOPY. From Willem de Bruijn.
12) Significantly rework value tracking in the BPF verifier, from Edward
Cree.
13) Add new jump instructions to eBPF, from Daniel Borkmann.
14) Rework rtnetlink plumbing so that operations can be run without
taking the RTNL semaphore. From Florian Westphal.
15) Support XDP in tap driver, from Jason Wang.
16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.
17) Add Huawei hinic ethernet driver.
18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
Delalande.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
i40e: avoid NVM acquire deadlock during NVM update
drivers: net: xgene: Remove return statement from void function
drivers: net: xgene: Configure tx/rx delay for ACPI
drivers: net: xgene: Read tx/rx delay for ACPI
rocker: fix kcalloc parameter order
rds: Fix non-atomic operation on shared flag variable
net: sched: don't use GFP_KERNEL under spin lock
vhost_net: correctly check tx avail during rx busy polling
net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
rxrpc: Make service connection lookup always check for retry
net: stmmac: Delete dead code for MDIO registration
gianfar: Fix Tx flow control deactivation
cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
cxgb4: Fix pause frame count in t4_get_port_stats
cxgb4: fix memory leak
tun: rename generic_xdp to skb_xdp
tun: reserve extra headroom only when XDP is set
net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
net: dsa: bcm_sf2: Advertise number of egress queues
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
UuF6eSWEC5O1UCRUk1A+
=LLY3
-----END PGP SIGNATURE-----
Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull writeback error handling updates from Jeff Layton:
"This pile continues the work from last cycle on better tracking
writeback errors. In v4.13 we added some basic errseq_t infrastructure
and converted a few filesystems to use it.
This set continues refining that infrastructure, adds documentation,
and converts most of the other filesystems to use it. The main
exception at this point is the NFS client"
* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
ecryptfs: convert to file_write_and_wait in ->fsync
mm: remove optimizations based on i_size in mapping writeback waits
fs: convert a pile of fsync routines to errseq_t based reporting
gfs2: convert to errseq_t based writeback error reporting for fsync
fs: convert sync_file_range to use errseq_t based error-tracking
mm: add file_fdatawait_range and file_write_and_wait
fuse: convert to errseq_t based error tracking for fsync
mm: consolidate dax / non-dax checks for writeback
Documentation: add some docs for errseq_t
errseq: rename __errseq_set to errseq_set
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrTzyAAoJEAAOaEEZVoIVj8wP/0sOxG+7vEEpe4uj2W52aq9T
Y39/ZLfRTLm9SqgH61lkN+IyUsvDx+IP1ws2LBhp0IDRD9m40wdILhHZRWXJcRW2
ApEfmXF+rxnZZ6725ixX9w4Ylab2ZeGmKbzaG4wIjxfddftewZkJvFQcb1LZDfWq
1N0SF4KWoWN6t26Du5CHmYSj/Sz6YGrWGhF22u3mNfkGL+MmuKbz+kB3W+0q2NUF
ZjkOIH9WcRiXgSlcHPBLre2EKHqHaNgb0s4Iofd3ZEe50v1NwY/vBMefxuwRdgKS
kpLhIKIYMawrHn2rpV0jm12qdgCYj+t2kbVIUBDn3unBP2zYA0e/oo5HNIrroVlk
Q6aGwmW0LN60rpd5qcRuNS1p1h2id2HpxEe98dsski6T8CVnj/nvu7EIxmWM02cf
g2HeOd7bnl3+uu7SwSTkOVb6G7Kjn+Xufiz/n11mK6fl2jvOyWZZmDqhhjWAYJ8r
t5mQVWJdEV12+6+A1WSv9DeS3TUgdYPCF8dzDtF+JVn3WEmxYHywH36Y3hKKz+BA
gFEhnHvlyaVvpXCr8Y5BqNSfEfvZe/YUnmVReHpgBU/U4GJ17iQYk/g2vfmPLmsN
IZ2OGCrDUc/LfdWc4llRyQBvlGT1KujaT0tbN7xnuWcS2qWdsfX4jDtDUH9E6pvK
TB6Sw4Ike0ixamG8N8q/
=VPMU
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"This pile just has a few file locking fixes from Ben Coddington. There
are a couple of cleanup patches + an attempt to bring sanity to the
l_pid value that is reported back to userland on an F_GETLK request.
After a few gyrations, he came up with a way for filesystems to
communicate to the VFS layer code whether the pid should be translated
according to the namespace or presented as-is to userland"
* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: restore a warn for leaked locks on close
fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
fs/locks: Use allocation rather than the stack in fcntl_getlk()
This set includes a bunch of minor code cleanups that
have accumulated, probably from code analyzers people
like to run. There is one nice fix that avoids some
socket leaks by switching to use sock_create_lite().
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrtP1AAoJEDgbc8f8gGmqH3YQALZouj0tzatxHfWlGcMHoufL
M2wmQragG4qOI1w9vNKmo6GctQ+Teqholnt2gRHputxNLzPUXPzNgAR1/O7O8741
TCB/fhR16KqMMP4bTa2GQ73WVFhohkh8xSvtcWkhCqC+Ti/qx2FN7LZ7Mxn6Muje
IC7E+Oy2Xr64lUb1CsfpXXel8vs+ujoMIAZiU4P/PgCzYX5FvaFWJ9VCwgYzfIuN
zj2O1txau5xW2fZmD5GRmgWY/g5wCPcPxwCdZacqrL7yNiU1wsrhYFds0AiGSPJC
D/wMX9a0GN28L+zW0eLEVI+lIk8f5Az+DOrw5UUFNwDd4ejDWaS0dtMNThYu6VvD
x6+JZhgZHcj3Df/s4PMZvPkCx+8ZeRGK9RK+jlkEVfO8aIE39gi6mC+EuTJmZe/m
PAB7O2OG0FTUPoY+t/5wKaz1g6qSHQ2fQZb8rAMoUFWwFJWXp3q7/tlZN4dlwIDI
2yp9UN09ug3tICcne/gvmJ5x8lVN3Eh6XHkbO1qedsv45SYKdOwPmvyp2XTZooJK
kg827Z+deRmvTfX3gzEsEO1caabtYDOrZ23RHJxqViNdZbMn3Tifc2pBZCIJHfmu
4My+Midl38Ch9SUx30ePwjyJ9+Ptsm7KiSOIvrtRZV/1bPEqRP03suxEZvmCA/z4
elMj1gKykj/GHXZ+cHlX
=lGzS
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a bunch of minor code cleanups that have
accumulated, probably from code analyzers people like to run. There is
one nice fix that avoids some socket leaks by switching to use
sock_create_lite()"
* tag 'dlm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: use sock_create_lite inside tcp_accept_from_sock
uapi linux/dlm_netlink.h: include linux/dlmconstants.h
dlm: avoid double-free on error path in dlm_device_{register,unregister}
dlm: constify kset_uevent_ops structure
dlm: print log message when cluster name is not set
dlm: Delete an unnecessary variable initialisation in dlm_ls_start()
dlm: Improve a size determination in two functions
dlm: Use kcalloc() in two functions
dlm: Use kmalloc_array() in make_member_array()
dlm: Delete an error message for a failed memory allocation in dlm_recover_waiters_pre()
dlm: Improve a size determination in dlm_recover_waiters_pre()
dlm: Use kcalloc() in dlm_scan_waiters()
dlm: Improve a size determination in table_seq_start()
dlm: Add spaces for better code readability
dlm: Replace six seq_puts() calls by seq_putc()
dlm: Make dismatch error message more clear
dlm: Fix kernel memory disclosure
miscellaneous bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlms4KwACgkQ8vlZVpUN
gaMsNwf9EDIaB7uMAFIRw8TzszbYKE3K3T412s18zce+kYea6wZPkQavWH/qMYgU
r8jmXfDi3KZJJBI8fFW4qh36qs/fJoOeQOD3BTHcczEJqaaOiahaxfSylTwezDfw
fIkEMCfBj6Vyldo0aKrtM4iU07Njj7QmYBtsiJo1kpyAuZ7wuoaiyizCLRb0fhLB
AnBOs2ur9fQvn954M03tJIKpxFgmbpofM7yMtJYpW9dHCCWe2G+sIdBy/W6vVTxt
sJIzUGyyEFs9Hr8U8THbuo5XgScFh+NEOkj/cf60t5ZYxwZzJvY7+Oj7BQwxEM1p
Efcjz2kRLU8qSYHOHxCar0D3MW+oNQ==
=E7gL
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Scalability improvements when allocating inodes, and some
miscellaneous bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid Y2038 overflow in recently_deleted()
ext4: fix fault handling when mounted with -o dax,ro
ext4: fix quota inconsistency during orphan cleanup for read-only mounts
ext4: fix incorrect quotaoff if the quota feature is enabled
ext4: remove useless test and assignment in strtohash functions
ext4: backward compatibility support for Lustre ea_inode implementation
ext4: remove timebomb in ext4_decode_extra_time()
ext4: use sizeof(*ptr)
ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets
ext4: reduce lock contention in __ext4_new_inode
ext4: cleanup goto next group
ext4: do not unnecessarily allocate buffer in recently_deleted()
- Write unmount record for a ro mount to avoid unnecessary log replay
- Clean up orphaned inodes when mounting fs readonly
- Resubmit inode log items when buffer writeback fails to avoid umount hang
- Fix log recovery corruption problems when log headers wrap around the end
- Avoid infinite loop searching for free inodes when inode counters are wrong
- Evict inodes involved with log redo so that we don't leak them later
- Fix a potential race between reclaim and inode cluster freeing
- Refactor the inode joining code w.r.t. transaction rolling & deferred ops
- Fix a bug where the log doesn't properly deal with dirty buffers that
are about to become ordered buffers
- Fix the extent swap code to deal with making dirty buffers ordered properly
- Consolidate page fault handlers
- Refactor the incore extent manipulation functions to use the iext
abstractions instead of directly modifying with extent data
- Disable crashy chattr +/-x until we fix it
- Don't allow us to set S_DAX for v2 inodes
- Various cleanups
- Clarify some documentation
- Fix a problem where fsync and a log commit race to send the disk a
flush command, resulting in a small window where power fail data loss
could occur
- Simplify some rmap operations in the fcollapse code
- Fix some use-after-free problems in async writeback
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZrEAQAAoJEPh/dxk0SrTrxKEP/3y8sLWdy4fUdPpVkwZteXwc
zGyYaLrmKRc5i6abBNtLCZoRJGfRdvVyPrhQ1q3mt8H//xuURgqgFFyjj3wAdsLf
sDejIHhdsc8/VcuLLtCW3rEYg58hJ89hW7d1InCP0tvqWmljh9svhzXebtwUvNNF
/2fHIUXUiAxLbgjv/N2i/smlLl0zdx6C2x1TlJmfwer0UMTAnlmbFWxCqmtUZwSl
QSuGgn1wo3dkId9aFoNwQmSCFeYcxQlpaInJEzUiVQOA4dbphXHO9Bsx0eOkpDuz
39waaX0fld8LEfIQGmUQ995UkAwfk/asjgDSApyXdkMayNWhi0KpRl1zXgCb8BbL
m7vYJhIfJ399+jbNPe1+htn3I16AmpvAai9MNJidFclWwqFEuQEnxZccdtTIAiRv
XuYiq9hN2NOwlwPUYfrZxfx34fdocRyHmGVs3i7P3/qPWd5Hx6+FpQTOngciS7MN
6xnM8PbnrLadw3ooMDEKgWsN805BQALiwzDRggoAXG1Pm2SqFnLD/dAR4c7R3nR8
vvYlfGHnd38aMlW73IALkkGJqZy/bHPFhrbvpjXyIG6SYwCjrWrO0chM0O8MCRrF
MIW3rM5hYIE8aCkpJ2mxvcQalmSAlSPVKlmgvSK4S1Sz4kcywxskNhch8uNkb5uy
WUHhrJz+wBjdjrDOU3aL
=jBdo
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
"Here are the changes for xfs for 4.14. Most of these are cleanups and
fixes for bad behavior, as we're mostly focusing on improving
reliablity this cycle (read: there's potentially a lot of stuff on the
horizon for 4.15 so better to spend a few weeks killing other bugs
now).
Summary:
- Write unmount record for a ro mount to avoid unnecessary log replay
- Clean up orphaned inodes when mounting fs readonly
- Resubmit inode log items when buffer writeback fails to avoid
umount hang
- Fix log recovery corruption problems when log headers wrap around
the end
- Avoid infinite loop searching for free inodes when inode counters
are wrong
- Evict inodes involved with log redo so that we don't leak them
later
- Fix a potential race between reclaim and inode cluster freeing
- Refactor the inode joining code w.r.t. transaction rolling &
deferred ops
- Fix a bug where the log doesn't properly deal with dirty buffers
that are about to become ordered buffers
- Fix the extent swap code to deal with making dirty buffers ordered
properly
- Consolidate page fault handlers
- Refactor the incore extent manipulation functions to use the iext
abstractions instead of directly modifying with extent data
- Disable crashy chattr +/-x until we fix it
- Don't allow us to set S_DAX for v2 inodes
- Various cleanups
- Clarify some documentation
- Fix a problem where fsync and a log commit race to send the disk a
flush command, resulting in a small window where power fail data
loss could occur
- Simplify some rmap operations in the fcollapse code
- Fix some use-after-free problems in async writeback"
* tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
xfs: use kmem_free to free return value of kmem_zalloc
xfs: open code end_buffer_async_write in xfs_finish_page_writeback
xfs: don't set v3 xflags for v2 inodes
xfs: fix compiler warnings
fsmap: fix documentation of FMR_OF_LAST
xfs: simplify the rmap code in xfs_bmse_merge
xfs: remove unused flags arg from xfs_file_iomap_begin_delay
xfs: fix incorrect log_flushed on fsync
xfs: disable per-inode DAX flag
xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
xfs: move some code around inside xfs_bmap_shift_extents
xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
xfs: add a xfs_iext_update_extent helper
xfs: consolidate the various page fault handlers
iomap: return VM_FAULT_* codes from iomap_page_mkwrite
xfs: relog dirty buffers during swapext bmbt owner change
...
because we held some back from the previous merge window until we
could get them perfected and well tested. We have a couple patch
sets, including my patch set for protecting glock gl_object and
Andreas Gruenbacher's patch set to fix the long-standing shrink-
slab hang, plus a bunch of assorted bugs and cleanups:
1. I fixed a bug whereby an IO error would lead to a double-brelse.
2. Andreas Gruenbacher made a minor cleanup to call his relatively
new function, gfs2_holder_initialized, rather than doing it
manually. This was just missed by a previous patch set.
3. Jan Kara fixed a bug whereby the SGID was being cleared when
inheriting ACLs.
4. Andreas found a bug and fixed it in his previous patch,
"Get rid of flush_delayed_work in gfs2_evict_inode". A call to
flush_delayed_work was deleted from *gfs2_inode_lookup and added
to gfs2_create_inode.
5. Wang Xibo found and fixed a list_add call in inode_go_lock
that specified the parameters in the wrong order.
6. Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
metadata reads that were accidentally missing them.
7 - 10. I submitted a 4-patch set to protect the glock gl_object
field. GFS2 was setting and checking gl_object with no locking
mechanism, so the value was occasionally stomped on, which caused
file system corruption.
11. I submitted a small cleanup to function gfs2_clear_rgrpd.
It was needlessly adding rgrp glocks to the lru list, then pulling
them back off immediately. The rgrp glocks don't use the lru list
anyway, so doing so was just a waste of time.
12. I submitted a patch that checks the GLOF_LRU flag on a glock
before trying to remove it from the lru_list. This avoids a lot
of unnecessary spin_lock contention.
13. I submitted a patch to delete GFS2's debugfs files only after
we evict all the glocks. Before this patch, GFS2 would delete the
debugfs files, and if unmount hung waiting for a glock, there was
no way to debug the problem. Now, if a hang occurs during umount,
we can examine the debugfs files to figure out why it's hung.
14. Andreas Gruenbacher submitted a patch to fix some trivial typos.
15 - 19. Andreas also submitted a five-part patch set to fix the
longstanding hang involving the slab shrinker: dlm requires
memory, calls the inode shrinker, which calls gfs2's evict, which
calls back into DLM before it can evict an inode.
20. Abhi Das submitted a patch to forcibly flush the active items
list to relieve memory pressure. This fixes a long-standing bug
whereby GFS2 was getting hung permanently in balance_dirty_pages.
21. Thomas Tai submitted a patch to fix a slab corruption problem
due to a residual pointer left in the lock_dlm lockstruct.
22. I submitted a patch to withdraw the file system if IO errors
are encountered while writing to the journals or statfs system
file which were previously not being sent back up. Before, some
IO errors were sometimes not be detected for several hours, and
at recovery time, the journal errors made journal replay
impossible.
23. Andreas has a patch to fix an annoying format-truncation compiler
warning so GFS2 compiles cleanly.
24. I have a patch that fixes a handful of sparse compiler warnings.
25. Andreas fixed up an useless gl_object warning caused by an
earlier patch.
26. Arvind Yadav added a patch to properly constify our rhashtable
params declare.
27. I added a patch to fix a regression caused by the non-recursive
delete and truncate patch that caused file system blocks to not
be properly freed.
28. Ernesto A. Fernández added a patch to fix a place where GFS2
would send back the wrong return code setting extended attributes.
29. Ernesto also added a patch to fix a case in which GFS2 was
improperly setting an inode's i_mode, potentially granting access
to the wrong users.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZrMC2AAoJENeLYdPf93o7PIQIAKY4hdC2pMM5tiiIHx5fPAAr
tjpVuFkDQzyEaTb9sArVLxEdva3ShKERQKoYq/VVxqbAEwPgXbzJFNNil1WTJi1t
J2gE4wE4G5x1+A7XDzCdPI8KAcF+yX63AaFYlVKyuZSq5w7njIRc1Vk+TFiIexxC
xb0nP0g9L6Zt114rE8kfi0/GLjTO9vOKM3XsJgG612I3/cs3RUx4gJ+nSUG0bYLA
qoBIXEJ3SFHw2Zr/LgHZ9QDHnlPVl3bjg03sRQaWZms7XbLegDBYsDSvS1HLZ300
gjTc0Dgz/6KwzDVJ7cZ/fPNYtIFY58tKs6aqqDTrCncsX9nPjcTAxYkBNWsFyZM=
=tXJ8
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got a whopping 29 GFS2 patches for this merge window, mainly
because we held some back from the previous merge window until we
could get them perfected and well tested. We have a couple patch sets,
including my patch set for protecting glock gl_object and Andreas
Gruenbacher's patch set to fix the long-standing shrink- slab hang,
plus a bunch of assorted bugs and cleanups.
Summary:
- I fixed a bug whereby an IO error would lead to a double-brelse.
- Andreas Gruenbacher made a minor cleanup to call his relatively new
function, gfs2_holder_initialized, rather than doing it manually.
This was just missed by a previous patch set.
- Jan Kara fixed a bug whereby the SGID was being cleared when
inheriting ACLs.
- Andreas found a bug and fixed it in his previous patch, "Get rid of
flush_delayed_work in gfs2_evict_inode". A call to
flush_delayed_work was deleted from *gfs2_inode_lookup and added to
gfs2_create_inode.
- Wang Xibo found and fixed a list_add call in inode_go_lock that
specified the parameters in the wrong order.
- Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
metadata reads that were accidentally missing them.
- I submitted a 4-patch set to protect the glock gl_object field.
GFS2 was setting and checking gl_object with no locking mechanism,
so the value was occasionally stomped on, which caused file system
corruption.
- I submitted a small cleanup to function gfs2_clear_rgrpd. It was
needlessly adding rgrp glocks to the lru list, then pulling them
back off immediately. The rgrp glocks don't use the lru list
anyway, so doing so was just a waste of time.
- I submitted a patch that checks the GLOF_LRU flag on a glock before
trying to remove it from the lru_list. This avoids a lot of
unnecessary spin_lock contention.
- I submitted a patch to delete GFS2's debugfs files only after we
evict all the glocks. Before this patch, GFS2 would delete the
debugfs files, and if unmount hung waiting for a glock, there was
no way to debug the problem. Now, if a hang occurs during umount,
we can examine the debugfs files to figure out why it's hung.
- Andreas Gruenbacher submitted a patch to fix some trivial typos.
- Andreas also submitted a five-part patch set to fix the
longstanding hang involving the slab shrinker: dlm requires memory,
calls the inode shrinker, which calls gfs2's evict, which calls
back into DLM before it can evict an inode.
- Abhi Das submitted a patch to forcibly flush the active items list
to relieve memory pressure. This fixes a long-standing bug whereby
GFS2 was getting hung permanently in balance_dirty_pages.
- Thomas Tai submitted a patch to fix a slab corruption problem due
to a residual pointer left in the lock_dlm lockstruct.
- I submitted a patch to withdraw the file system if IO errors are
encountered while writing to the journals or statfs system file
which were previously not being sent back up. Before, some IO
errors were sometimes not be detected for several hours, and at
recovery time, the journal errors made journal replay impossible.
- Andreas has a patch to fix an annoying format-truncation compiler
warning so GFS2 compiles cleanly.
- I have a patch that fixes a handful of sparse compiler warnings.
- Andreas fixed up an useless gl_object warning caused by an earlier
patch.
- Arvind Yadav added a patch to properly constify our rhashtable
params declare.
- I added a patch to fix a regression caused by the non-recursive
delete and truncate patch that caused file system blocks to not be
properly freed.
- Ernesto A. Fernández added a patch to fix a place where GFS2 would
send back the wrong return code setting extended attributes.
- Ernesto also added a patch to fix a case in which GFS2 was
improperly setting an inode's i_mode, potentially granting access
to the wrong users"
* tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (29 commits)
gfs2: preserve i_mode if __gfs2_set_acl() fails
gfs2: don't return ENODATA in __gfs2_xattr_set unless replacing
GFS2: Fix non-recursive truncate bug
gfs2: constify rhashtable_params
GFS2: Fix gl_object warnings
GFS2: Fix up some sparse warnings
gfs2: Silence gcc format-truncation warning
GFS2: Withdraw for IO errors writing to the journal or statfs
gfs2: fix slab corruption during mounting and umounting gfs file system
gfs2: forcibly flush ail to relieve memory pressure
gfs2: Clean up waiting on glocks
gfs2: Defer deleting inodes under memory pressure
gfs2: gfs2_evict_inode: Put glocks asynchronously
gfs2: Get rid of gfs2_set_nlink
gfs2: gfs2_glock_get: Wait on freeing glocks
gfs2: Fix trivial typos
GFS2: Delete debugfs files only after we evict the glocks
GFS2: Don't waste time locking lru_lock for non-lru glocks
GFS2: Don't bother trying to add rgrps to the lru list
GFS2: Clear gl_object when deleting an inode in gfs2_delete_inode
...
Ville Syrjälä spotted that PGETBL_CTL was losing its enable bit upon a
reset. That was causing the display to show garbage on his 945gm. On my
i915gm the effect was far more severe; re-enabling the display following
the reset without PGETBL_CTL being enabled lead to an immediate hard
hang.
We do have a routine to re-enable PGETBL_CTL which is applicable to
gen2-4, although on gen4 it is documented that a graphics reset doesn't
alter the register (no such wording is given for gen3) and should be safe
to call to punch back in the enable bit. However, that leaves the question
of whether we need to completely re-initialise the register and the
rest of the GSM. For g33/pnv/gen4+, where we do have a configurable
page table, its contents do seem to be kept, and so we should be able to
recover without having to reinitialise the GTT from scratch (as prior to
g33, that register is configured by the BIOS and we leave alone except
for the enable bit).
This appears to have been broken by commit 5fbd0418ee ("drm/i915:
Re-enable GGTT earlier during resume on pre-gen6 platforms"), which
moved the intel_enable_gtt() from i915_gem_init_hw() (also used by
reset) to add it earlier during hw init and resume, missing the reset
path.
v2: Find the culprit, rearrange ggtt_enable to be before gem_init_hw to
match init/resume
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Fixes: 5fbd0418ee ("drm/i915: Re-enable GGTT earlier during resume on pre-gen6 platforms")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101852
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20170906111405.27110-1-chris@chris-wilson.co.uk
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
(cherry picked from commit 0db8c96120)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Add the missing __user to the urelocs cast to fix the following sparse
warning:
i915_gem_execbuffer.c:1541:47: warning: cast removes address space of expression
i915_gem_execbuffer.c:1541:62: warning: incorrect type in argument 2 (different address spaces)
i915_gem_execbuffer.c:1541:62: expected void const [noderef] <asn:1>*from
i915_gem_execbuffer.c:1541:62: got char *
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Fixes: 2889caa923 ("drm/i915: Eliminate lots of iterations over the execobjects array")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170901165434.24636-1-ville.syrjala@linux.intel.com
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> #irc
(cherry picked from commit 908a610557)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>