change_pte_range is called from task work context to mark PTEs for
receiving NUMA faulting hints. If the marked pages are dirty then
migration may fail. Some filesystems cannot migrate dirty pages without
blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
Even when they can, it can be a waste of cycles when the pages are
shared forcing higher scan rates. This patch avoids marking shared
dirty pages for hinting faults but also will skip a migration if the
page was dirtied after the scanner updated a clean page.
This is most noticeable running the NASA Parallel Benchmark when backed
by btrfs, the default root filesystem for some distributions, but also
noticeable when using XFS.
The following are results from a 4-socket machine running a 4.16-rc4
kernel with some scheduler patches that are pending for the next merge
window.
4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1
Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)
is.D regresses slightly in terms of absolute time but note that that
particular load varies quite a bit from run to run. The more relevant
observation is the total system CPU usage.
4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1
User 71471.91 70627.04
System 11078.96 8256.13
Elapsed 661.66 632.74
That is a substantial drop in system CPU usage and overall the workload
completes faster. The NUMA balancing statistics are also interesting
NUMA base PTE updates 111407972 139848884
NUMA huge PMD updates 206506 264869
NUMA page range updates 217139044 275461812
NUMA hint faults 4300924 3719784
NUMA hint local faults 3012539 3416618
NUMA hint local percent 70 91
NUMA pages migrated 1517487 1358420
While more PTEs are scanned due to changes in what faults are gathered,
it's clear that a far higher percentage of faults are local as the bulk
of the remote hits were dirty pages that, in this case with btrfs, had
no chance of migrating.
The following is a comparison when using XFS as that is a more realistic
filesystem choice for a data partition
4.16.0-rc4 4.16.0-rc4
schedtip-20180309 nodirty-v1r47
Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)
That is a reasonable gain on two relatively long-lived workloads. While
not presented, there is also a substantial drop in system CPu usage and
the NUMA balancing stats show similar improvements in locality as btrfs
did.
Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fix typos and syntaxes, thanks to Randy Dunlap for pointing them out
(they were all my faults).
Link: http://lkml.kernel.org/r/20180409151859.4713-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last fix was still wrong, as we need the inline dummy functions also
for the case that CONFIG_HMM is enabled but CONFIG_HMM_MIRROR is not:
kernel/fork.o: In function `__mmdrop':
fork.c:(.text+0x14f6): undefined reference to `hmm_mm_destroy'
This adds back the second copy of the dummy functions, hopefully
this time in the right place.
Link: http://lkml.kernel.org/r/20180404110236.804484-1-arnd@arndb.de
Fixes: 8900d06a277a ("mm/hmm: fix header file if/else/endif maze")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
actually uses the RCU protection. The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superfluous rcu_read_lock/unlock() around the function.
This doesn't add anything and just adds to confusion. Remove the RCU
protection and open-code the radix tree lookup. If this needs to become
more sophisticated in the future, let's add them back when necessary.
Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
pfn shift value allowing them to define their own encoding for HMM pfn
that are fill inside the pfns array of the hmm_range struct. With this
device driver can get pfn that match their own private encoding out of HMM
without having to do any conversion.
[rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
[rcampbell@nvidia.com: clarify fault logic for device private memory]
Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes hmm_vma_fault() to not take a global write fault flag for a
range but instead rely on caller to populate HMM pfns array with proper
fault flag ie HMM_PFN_VALID if driver want read fault for that address or
HMM_PFN_VALID and HMM_PFN_WRITE for write.
Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
device private memory to be migrated back to system memory through page
fault.
This is more flexible API and it better reflects how device handles and
reports fault.
Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional change, just create one function to handle pmd and one to
handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move hmm_pfns_clear() closer to where it is used to make it clear it is
not use by page table walkers.
Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make naming consistent across code, DEVICE_PRIVATE is the name use outside
HMM code so use that one.
Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no point in differentiating between a range for which there is
not even a directory (and thus entries) and empty entry (pte_none() or
pmd_none() returns true).
Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on those
vma and their backing memory.
This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.
It also make behavior consistent after this patch both hmm_vma_fault() and
hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
were filling the HMM pfn array with special entry.
Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver to
directly use pfn value return by HMM (saving memory and CPU cycles use for
conversion between the two).
Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only peculiar architecture allow write without read thus assume that any
valid pfn do allow for read. Note we do not care for write only because
it does make sense with thing like atomic compare and exchange or any
other operations that allow you to get the memory value through them.
Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
as parameter and were initializing that struct with others of their
parameters. Have caller of those function do this as they are likely to
already do and only pass this struct to both function this shorten
function signature and make it easier in the future to add new parameters
by simply adding them to the structure.
Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.
Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This fix the
zombie mm_struct as without this patch we do not drop the refcount we have
on it.
Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hmm_mirror_register() registers a callback for when the CPU pagetable is
modified. Normally, the device driver will call hmm_mirror_unregister()
when the process using the device is finished. However, if the process
exits uncleanly, the struct_mm can be destroyed with no warning to the
device driver.
Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because of
this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM was
enabled (CONFIG_HMM set).
Link: http://lkml.kernel.org/r/20180323005527.758-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the documentation for HMM to fix minor typos and phrasing to be a
bit more readable.
Link: http://lkml.kernel.org/r/20180323005527.758-2-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Stephen Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.
Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags. I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set. If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee. E.g. direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).
Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion in
one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
[aryabinin@virtuozzo.com: congestion state should be per-node]
Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have separate LRU list for each memory cgroup. Memory reclaim
iterates over cgroups and calls shrink_inactive_list() every inactive
LRU list. Based on the state of a single LRU shrink_inactive_list() may
flag the whole node as dirty,congested or under writeback. This is
obviously wrong and hurtful. It's especially hurtful when we have
possibly small congested cgroup in system. Than *all* direct reclaims
waste time by sleeping in wait_iff_congested(). And the more memcgs in
the system we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.
Sum reclaim stats across all visited LRUs on node and flag node as
dirty, congested or under writeback based on that sum. Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.
This only fixes the problem for global reclaim case. Per-cgroup reclaim
may alter global pgdat flags too, which is wrong. But that is separate
issue and will be addressed in the next patch.
This change will not have any effect on a systems with all workload
concentrated in a single cgroup.
[aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only kswapd can have non-zero nr_immediate, and current_may_throttle()
is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.
Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update some comments that became stale since transiton from per-zone to
per-node reclaim.
Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).
So, the overcommit logic should treat it as free.
Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g. dentry external names.
If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.
The following program illustrates the approach. It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.
int main()
{
char buf[256];
unsigned long i;
struct stat statbuf;
buf[0] = '/';
for (i = 1; i < sizeof(buf); i++)
buf[i] = '_';
for (i = 0; 1; i++) {
sprintf(&buf[248], "%8lu", i);
stat(buf, &statbuf);
}
return 0;
}
This patch in combination with related indirectly reclaimable memory
patches closes this issue.
Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I received a report about suspicious growth of unreclaimable slabs on
some machines. I've found that it happens on machines with low memory
pressure, and these unreclaimable slabs are external names attached to
dentries.
External names are allocated using generic kmalloc() function, so they
are accounted as unreclaimable. But they are held by dentries, which
are reclaimable, and they will be reclaimed under the memory pressure.
In particular, this breaks MemAvailable calculation, as it doesn't take
unreclaimable slabs into account. This leads to a silly situation, when
a machine is almost idle, has no memory pressure and therefore has a big
dentry cache. And the resulting MemAvailable is too low to start a new
workload.
To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
used to track the amount of memory, consumed by external names. The
counter is increased in the dentry allocation path, if an external name
structure is allocated; and it's decreased in the dentry freeing path.
To reproduce the problem I've used the following Python script:
import os
for iter in range (0, 10000000):
try:
name = ("/some_long_name_%d" % iter) + "_" * 220
os.stat(name)
except Exception:
pass
Without this patch:
$ cat /proc/meminfo | grep MemAvailable
MemAvailable: 7811688 kB
$ python indirect.py
$ cat /proc/meminfo | grep MemAvailable
MemAvailable: 2753052 kB
With the patch:
$ cat /proc/meminfo | grep MemAvailable
MemAvailable: 7809516 kB
$ python indirect.py
$ cat /proc/meminfo | grep MemAvailable
MemAvailable: 7749144 kB
[guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
[guro@fb.com: fix indirectly reclaimable memory accounting]
Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust /proc/meminfo MemAvailable calculation by adding the amount of
indirectly reclaimable memory (rounded to the PAGE_SIZE).
Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "indirectly reclaimable memory", v2.
This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.
This patch (of 3):
Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.
Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e. will
be released under memory pressure.
The counter is in bytes, as it's not always possible to count such
objects in pages. The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.
Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are the main MIPS changes for 4.17. Rough overview:
(1) generic platform: Add support for Microsemi Ocelot SoCs
(2) crypto: Add CRC32 and CRC32C HW acceleration module
(3) Various cleanups and misc improvements
Miscellaneous:
- Hang more efficiently on halt/powerdown/restart
- pm-cps: Block system suspend when a JTAG probe is present
- Expand make help text for generic defconfigs
- Refactor handling of legacy defconfigs
- Determine the entry point from the ELF file header to fix microMIPS
for certain toolchains
- Introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code
Minor cleanups:
- DTS: boston/ci20: Unit name cleanups and correction
- kdump: Make the default for PHYSICAL_START always 64-bit
- Constify gpio_led in Alchemy, AR7, and TXX9
- Silence a couple of W=1 warnings
- Remove duplicate includes
Platform support:
ath79:
- Fix AR724X_PLL_REG_PCIE_CONFIG offset
BCM47xx:
- FIRMWARE: Use mac_pton() for MAC address parsing
- Add Luxul XAP1500/XWR1750 WiFi LEDs
- Use standard reset button for Luxul XWR-1750
BMIPS:
- Enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
- Add STB PM, wake-up timer, watchdog DT nodes
Generic platform:
- Add support for Microsemi Ocelot
- dt-bindings: Add vendor prefix for Microsemi Corporation
- dt-bindings: Add bindings for Microsemi SoCs
- Add ocelot SoC & PCB123 board DTS files
- MAINTAINERS: Add entry for Microsemi MIPS SoCs
- Enable crc32-mips on r6 configs
Octeon:
- Drop '.' after newlines in printk calls
ralink:
- pci-mt7621: Enable PCIe on MT7688
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEd80NauSabkiESfLYbAtpk944dnoFAlrL1tMACgkQbAtpk944
dnpkhg//UnKOmj4xiQf5ik4XjSDHo2dcUzQKPLq/grzpc9yC2I6bgJ32gcQnD/rw
x2f64DAjmbIJEY/hCGLEOH82/SqU5vF5gI+BtSKE6Ti28dyQH68pvyHRYIZBFx62
WRYAf50Spa/q/bOJYv4/eTqjEc7FPMaUll3kH5NMiBB3X5bIeq3w+x78YBKjAqPB
IJgHF/JnlpD8UvQ0RCl1fIXhkRYeNuTyiAseVI3ygAqbMFMGkTevePWl82D+pWuo
W9BFaliNt/9TNgq5/L91a/uhSuo6INRls1tFQT9cPa6n4GJxLpUtO7a3v4Xgaukj
TeXTkXZl4PEUfeTA3cVaXrENTN4sCwThkxvrKwefbUBiIa+/TDi/Wgxi3n+rKwi2
Vu4V36uxnO/vNU7Sgp/jHdRJyxw06GfRy55MvToFlk1S25Gwt6zW2O0o8nAHwCDW
zvoRfSkasElKwnM3zKR/mrdDn7qPulFfiyWyKuuY3bIXsaTnItO1JJ/rbgKuIXr7
lo0eVZ7GhuWfuLCEHZ99rWsGz0SjuQQIb4c6faNELp6gTwIpz9bbsOHRuajvfoSF
8jmPjUocbaRR0odyFBqsejwBmeebDoeIZljzD50ImSoYftEMgRoQ448mIm0EuWVU
KY+85MIeK5zwWXKd4VBexgLvPggGURaOEEXKisgvoE3NjCuO2nw=
=4vUc
-----END PGP SIGNATURE-----
Merge tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips
Pull MIPS updates from James Hogan:
"These are the main MIPS changes for 4.17. Rough overview:
(1) generic platform: Add support for Microsemi Ocelot SoCs
(2) crypto: Add CRC32 and CRC32C HW acceleration module
(3) Various cleanups and misc improvements
More detailed summary:
Miscellaneous:
- hang more efficiently on halt/powerdown/restart
- pm-cps: Block system suspend when a JTAG probe is present
- expand make help text for generic defconfigs
- refactor handling of legacy defconfigs
- determine the entry point from the ELF file header to fix microMIPS
for certain toolchains
- introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code
Minor cleanups:
- DTS: boston/ci20: Unit name cleanups and correction
- kdump: Make the default for PHYSICAL_START always 64-bit
- constify gpio_led in Alchemy, AR7, and TXX9
- silence a couple of W=1 warnings
- remove duplicate includes
Platform support:
Generic platform:
- add support for Microsemi Ocelot
- dt-bindings: Add vendor prefix for Microsemi Corporation
- dt-bindings: Add bindings for Microsemi SoCs
- add ocelot SoC & PCB123 board DTS files
- MAINTAINERS: Add entry for Microsemi MIPS SoCs
- enable crc32-mips on r6 configs
ath79:
- fix AR724X_PLL_REG_PCIE_CONFIG offset
BCM47xx:
- firmware: Use mac_pton() for MAC address parsing
- add Luxul XAP1500/XWR1750 WiFi LEDs
- use standard reset button for Luxul XWR-1750
BMIPS:
- enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
- add STB PM, wake-up timer, watchdog DT nodes
Octeon:
- drop '.' after newlines in printk calls
ralink:
- pci-mt7621: Enable PCIe on MT7688"
* tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (37 commits)
MIPS: BCM47XX: Use standard reset button for Luxul XWR-1750
MIPS: BCM47XX: Add Luxul XAP1500/XWR1750 WiFi LEDs
MIPS: Make the default for PHYSICAL_START always 64-bit
MIPS: Use the entry point from the ELF file header
MAINTAINERS: Add entry for Microsemi MIPS SoCs
MIPS: generic: Add support for Microsemi Ocelot
MIPS: mscc: Add ocelot PCB123 device tree
MIPS: mscc: Add ocelot dtsi
dt-bindings: mips: Add bindings for Microsemi SoCs
dt-bindings: Add vendor prefix for Microsemi Corporation
MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
MIPS: pci-mt7620: Enable PCIe on MT7688
MIPS: pm-cps: Block system suspend when a JTAG probe is present
MIPS: VDSO: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: BPF: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: Introduce isa-rev.h to define MIPS_ISA_REV
MIPS: Hang more efficiently on halt/powerdown/restart
FIRMWARE: bcm47xx_nvram: Replace mac address parsing
MIPS: BMIPS: Add Broadcom STB watchdog nodes
...
- Tom Zanussi's extended histogram work
This adds the synthetic events to have histograms from multiple event data
Adds triggers "onmatch" and "onmax" to call the synthetic events
Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with them)
And other various fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlrLoCAUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUks/QwAn/ky8WgfjcRdjKmBYuEwDedvm9iI
V9G5kpv5JMw5dLz4l1pS3tA3M9Lyuc5z3Shw92FTy36vdU1wxEjQgHa7viB1xk9x
KsiTyNjTsgrRd7GVHMy/8Be2RRiTRLaXKAsLCoj/c7QWzagV1P8XWlWK5mojYkh/
DrSXyg9Avkp30+sU1bvcLWnmmZUFqMxs+bWipD9uFc98USMMyeP25nrnhrj0gDTg
Q93cjXUuyVRC4lJ2YTW0GCSKhMKEw5f/ltEOT1hwScqYkCJj1EubKqS53R/9h21z
IPUrYcqLnMRu0j2ejR+UAy5Vsy3gJUrPMQb0F6hlu1DwbMd0d/9SGh1c+Sm+zorh
yftWTdCZsYrXkaOuB6V5M30X+KBwbWO0Xc9VCvgJ/IU5vMlgLSt5itTWbT/Fmfhb
ll5/RXP7zhSXRv5sdl/BP3/4dd6F8jpyKyaR2Rk2+XjBOGIq5mvqNGr4Vj9AzxW8
E0nvq7l7e0dbxZNM42gEm3cht1VUg7Zz0Y0+
=91oN
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New features:
- Tom Zanussi's extended histogram work.
This adds the synthetic events to have histograms from multiple
event data Adds triggers "onmatch" and "onmax" to call the
synthetic events Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with
them)
And other various fixes and clean ups"
* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
init: Have initcall_debug still work without CONFIG_TRACEPOINTS
init, tracing: Have printk come through the trace events for initcall_debug
init, tracing: instrument security and console initcall trace events
init, tracing: Add initcall trace events
tracing: Add rcu dereference annotation for test func that touches filter->prog
tracing: Add rcu dereference annotation for filter->prog
tracing: Fixup logic inversion on setting trace_global_clock defaults
tracing: Hide global trace clock from lockdep
ring-buffer: Add set/clear_current_oom_origin() during allocations
ring-buffer: Check if memory is available before allocation
lockdep: Add print_irqtrace_events() to __warn
vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
tracing: Uninitialized variable in create_tracing_map_fields()
tracing: Make sure variable string fields are NULL-terminated
tracing: Add action comparisons when testing matching hist triggers
tracing: Don't add flag strings when displaying variable references
tracing: Fix display of hist trigger expressions containing timestamps
ftrace: Drop a VLA in module_exists()
tracing: Mention trace_clock=global when warning about unstable clocks
tracing: Default to using trace_global_clock if sched_clock is unstable
...
* A rework of the filesytem-dax implementation provides for detection of
unmap operations (truncate / hole punch) colliding with in-progress
device-DMA. A fix for these collisions remains a work-in-progress
pending resolution of truncate latency and starvation regressions.
* The of_pmem driver expands the users of libnvdimm outside of x86 and
ACPI to describe an implementation of persistent memory on PowerPC with
Open Firmware / Device tree.
* Address Range Scrub (ARS) handling is completely rewritten to account for
the fact that ARS may run for 100s of seconds and there is no platform
defined way to cancel it. ARS will now no longer block namespace
initialization.
* The NVDIMM Namespace Label implementation is updated to handle label
areas as small as 1K, down from 128K.
* Miscellaneous cleanups and updates to unit test infrastructure.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
/NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
9eGJG1H+JspcbtLF9iHn
=acZQ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This cycle was was not something I ever want to repeat as there were
several late changes that have only now just settled.
Half of the branch up to commit d2c997c0f1 ("fs, dax: use
page->mapping to warn...") have been in -next for several releases.
The of_pmem driver and the address range scrub rework were late
arrivals, and the dax work was scaled back at the last moment.
The of_pmem driver missed a previous merge window due to an oversight.
A sense of obligation to rectify that miss is why it is included for
4.17. It has acks from PowerPC folks. Stephen reported a build failure
that only occurs when merging it with your latest tree, for now I have
fixed that up by disabling modular builds of of_pmem. A test merge
with your tree has received a build success report from the 0day robot
over 156 configs.
An initial version of the ARS rework was submitted before the merge
window. It is self contained to libnvdimm, a net code reduction, and
passing all unit tests.
The filesystem-dax changes are based on the wait_var_event()
functionality from tip/sched/core. However, late review feedback
showed that those changes regressed truncate performance to a large
degree. The branch was rewound to drop the truncate behavior change
and now only includes preparation patches and cleanups (with full acks
and reviews). The finalization of this dax-dma-vs-trnucate work will
need to wait for 4.18.
Summary:
- A rework of the filesytem-dax implementation provides for detection
of unmap operations (truncate / hole punch) colliding with
in-progress device-DMA. A fix for these collisions remains a
work-in-progress pending resolution of truncate latency and
starvation regressions.
- The of_pmem driver expands the users of libnvdimm outside of x86
and ACPI to describe an implementation of persistent memory on
PowerPC with Open Firmware / Device tree.
- Address Range Scrub (ARS) handling is completely rewritten to
account for the fact that ARS may run for 100s of seconds and there
is no platform defined way to cancel it. ARS will now no longer
block namespace initialization.
- The NVDIMM Namespace Label implementation is updated to handle
label areas as small as 1K, down from 128K.
- Miscellaneous cleanups and updates to unit test infrastructure"
* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
libnvdimm, of_pmem: workaround OF_NUMA=n build error
nfit, address-range-scrub: add module option to skip initial ars
nfit, address-range-scrub: rework and simplify ARS state machine
nfit, address-range-scrub: determine one platform max_ars value
powerpc/powernv: Create platform devs for nvdimm buses
doc/devicetree: Persistent memory region bindings
libnvdimm: Add device-tree based driver
libnvdimm: Add of_node to region and bus descriptors
libnvdimm, region: quiet region probe
libnvdimm, namespace: use a safe lookup for dimm device name
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
libnvdimm, testing: update the default smart ctrl_temperature
libnvdimm, testing: Add emulation for smart injection commands
nfit, address-range-scrub: introduce nfit_spa->ars_state
libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
nfit, address-range-scrub: fix scrub in-progress reporting
dax, dm: allow device-mapper to operate without dax support
dax: introduce CONFIG_DAX_DRIVER
fs, dax: use page->mapping to warn if truncate collides with a busy page
ext2, dax: introduce ext2_dax_aops
...
Subsystem:
- Add tracepoints
- Rework of the RTC/nvmem API to allow drivers to discard struct nvmem_config
after registration
- New range API, drivers can now expose the useful range of the RTC
- New offset API the core is now able to add an offset to the RTC time,
modifying the supported range.
- Multiple rtc_time64_to_tm fixes
- Handle time_t overflow on 32 bit platforms in the core instead of letting
drivers do crazy things.
- remove rtc_control API
New driver:
- Intersil ISL12026
Drivers:
- Drivers exposing the RTC non volatile memory have been converted to use nvmem
- Removed useless time and date validation
- Removed an indirection pattern that was a cargo cult from ancient drivers
- Removed VLA usage
- Fixed a possible race condition in probe functions
- AB8540 support is dropped from ab8500
- pcf85363 now has alarm support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEXx9Viay1+e7J/aM4AyWl4gNJNJIFAlrL3s4ACgkQAyWl4gNJ
NJI+MxAAgc56UEXi5chqKpHE6GHL2aPWan9duHB6FmGrKWt/FJmpJ8UGylLFlvaW
dpRGvW0Dynz45UztegcrbHMU6+B2O0hboL4/GVZpYAkcNAcu7Lf0ULho2rsQSDmW
WJpemmdRxyQY1IkWmw7z7KAkMzhAfYZiVmWmVwMRZfMcKJ3DLEldfgRtkN+g0UdB
tayWQY3mS02ki16e2figsgwZRmUUhQslDfpKlesInXOzUMmLgVWhf1QxJSEUcfs0
AMp75vD2YvVJ/RHy/6BilQbqP9EVnaG4NHqJGFSOddazA7u+3HGubEFboI8NuPXb
2fCvfrNux7pgtQsBF9dnpCqWlukE7sF5aDyIUvjYnr0vUm2D/CwdXglGvQSQVWea
5GxPTWBdaCL0V7GD5OSZcfUGyz1TN/7NSUItdLSr9YK13dL+tqhcYYm5ytXJLzvO
Z4GyUEoCOMprMJ9j5KU/TXSjauDmPDl8YZ5B93lPcNOh7y+b/2r3umBaInyZrFzX
1WJ6FWtuhbRfEwuQtgQHBsobt9eTwZfo8C2y22HBo/fdWfBv45feiiHNXt3OVrA+
aws6pwfVf1H0UsvpQkXtlPthAx3sbOTndDKAUhntb/U/zA9c7fTrfUyVOQ1bArJy
p6tl/cmlpq68AjZB+0d3zUXQkQH4Syu+EqZrWItkE6XxZm+khDE=
=l9i8
-----END PGP SIGNATURE-----
Merge tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"This contains a few series that have been in preparation for a while
and that will help systems with RTCs that will fail in 2038, 2069 or
2100.
Subsystem:
- Add tracepoints
- Rework of the RTC/nvmem API to allow drivers to discard struct
nvmem_config after registration
- New range API, drivers can now expose the useful range of the RTC
- New offset API the core is now able to add an offset to the RTC
time, modifying the supported range.
- Multiple rtc_time64_to_tm fixes
- Handle time_t overflow on 32 bit platforms in the core instead of
letting drivers do crazy things.
- remove rtc_control API
New driver:
- Intersil ISL12026
Drivers:
- Drivers exposing the RTC non volatile memory have been converted to
use nvmem
- Removed useless time and date validation
- Removed an indirection pattern that was a cargo cult from ancient
drivers
- Removed VLA usage
- Fixed a possible race condition in probe functions
- AB8540 support is dropped from ab8500
- pcf85363 now has alarm support"
* tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (128 commits)
rtc: snvs: Fix usage of snvs_rtc_enable
rtc: mt7622: fix module autoloading for OF platform drivers
rtc: isl12022: use true and false for boolean values
rtc: ab8500: Drop AB8540 support
rtc: remove a warning during scripts/kernel-doc step
rtc: 88pm860x: remove artificial limitation
rtc: 88pm80x: remove artificial limitation
rtc: st-lpc: remove artificial limitation
rtc: mrst: remove artificial limitation
rtc: mv: remove artificial limitation
rtc: hctosys: Ensure system time doesn't overflow time_t
parisc: time: stop validating rtc_time in .read_time
rtc: pcf85063: fix clearing bits in pcf85063_start_clock
rtc: at91sam9: Set name of regmap_config
rtc: s5m: Remove VLA usage
rtc: s5m: Move enum from rtc.h to rtc-s5m.c
rtc: remove VLA usage
rtc: Add useful timestamp definitions
rtc: Add one offset seconds to expand RTC range
rtc: Factor out the RTC range validation into rtc_valid_range()
...
- make it possible to load radeonfb driver when offb driver is loaded first
(Mathieu Malaterre)
- fix memory leak in offb driver (Mathieu Malaterre)
- fix unaligned access in udlfb driver (Ladislav Michl)
- convert atmel_lcdfb driver to use GPIO descriptors (Ludovic Desroches)
- avoid mismatched prototypes in sisfb driver (Arnd Bergmann)
- remove VLA usage from viafb driver (Gustavo A. R. Silva)
- add missing help text to FB_I810_I2 config option (Ulf Magnusson)
- misc fixes (Gustavo A. R. Silva, Colin Ian King, Markus Elfring)
- remove dead code from s3c-fb driver for Exynos and S5PV210 platforms
- misc cleanups (Corentin Labbe, Ladislav Michl, Ulf Magnusson, Vladimir
Zapolskiy, Markus Elfring)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJazJIDAAoJEH4ztj+gR8ILdYAQAJiFieXsawycxk2k5IUcYv0Q
ZVFSk0Y6TjDh3uo8Evv10r863PxwVrl7ii/KjfEPZYBslvUnBzEKqOHkpwLkxk1Q
e8JspXqMjgfAUq9Pvjdd88SJMxMm0UetLan6qjMcyc7mRcISS/fCSYxBiwqcOn3o
Y4etjk4hF6bvL+TXMb4faeDDn/Q61113T9kgcc9sYIqJUOnaui/lInOFGjBtwraV
IEoOlxC5ahG+l+ae0YkayTdH14LZF5DbibLufHAbIQ1cFbPF9/Iox+OizDUrqF6s
aQjGuDGxkUokCZBC7uQOqijePSCJRNTI2gQsYXyOaolQ5zXjda7yUmmDrM1fPcJB
v/oPRy6w9EwL0JOs0BVR+qJxO5+dBIAre/WSms+oMay1KR/Lbuk5u3CE7clEkg39
ChlbsrYLOK+j1fvdcsHJjjDnWZoxF11u+J2cehdObS84Vz+B8/Ltnl7hN0mcJYzu
WPXaiu337GJtC89Iggfytqn5TsYzyB0bxzc/Ti6LuNX+ptogLp9BRIO3w839saph
9ogaoYO1g/2EpzfLzh7+7evSJN9t+N//cp85VPJBpd30x1KLffwh1uCJDLOFpT0u
suEVbSscPIiMDcPoLF7l6I6t4+vXmP11JGbgw8SkIruTjGDpuWfeIuoOh7NlTLjb
C+32XFVjrKjENrrLT6UW
=XqjJ
-----END PGP SIGNATURE-----
Merge tag 'fbdev-v4.17' of git://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
"There is nothing really major here, just a couple of small bugfixes,
improvements and cleanups:
- make it possible to load radeonfb driver when offb driver is loaded
first (Mathieu Malaterre)
- fix memory leak in offb driver (Mathieu Malaterre)
- fix unaligned access in udlfb driver (Ladislav Michl)
- convert atmel_lcdfb driver to use GPIO descriptors (Ludovic
Desroches)
- avoid mismatched prototypes in sisfb driver (Arnd Bergmann)
- remove VLA usage from viafb driver (Gustavo A. R. Silva)
- add missing help text to FB_I810_I2 config option (Ulf Magnusson)
- misc fixes (Gustavo A. R. Silva, Colin Ian King, Markus Elfring)
- remove dead code from s3c-fb driver for Exynos and S5PV210
platforms
- misc cleanups (Corentin Labbe, Ladislav Michl, Ulf Magnusson,
Vladimir Zapolskiy, Markus Elfring)"
* tag 'fbdev-v4.17' of git://github.com/bzolnier/linux: (32 commits)
video: fbdev: s3c-fb: remove dead platform code for Exynos and S5PV210 platforms
video: au1100fb: Delete an unnecessary variable initialisation in au1100fb_drv_probe()
video: au1100fb: Improve a size determination in au1100fb_drv_probe()
video: au1100fb: Delete an error message for a failed memory allocation in au1100fb_drv_probe()
video/console/sticore: Delete an error message for a failed memory allocation in sti_try_rom_generic()
video: ARM CLCD: Improve a size determination in clcdfb_probe()
video: ARM CLCD: Delete an error message for a failed memory allocation in clcdfb_probe()
video: matroxfb: Delete an error message for a failed memory allocation in matroxfb_crtc2_probe()
video: s3c-fb: Improve a size determination in s3c_fb_probe()
video: s3c-fb: Delete an error message for a failed memory allocation in s3c_fb_probe()
video: fsl-diu-fb: Delete an error message for a failed memory allocation in fsl_diu_init()
video: ssd1307fb: Improve a size determination in ssd1307fb_probe()
video: smscufx: Delete an error message for a failed memory allocation in ufx_realloc_framebuffer()
video: smscufx: Return an error code only as a constant in ufx_realloc_framebuffer()
video: smscufx: Less checks in ufx_usb_probe() after error detection
video: udlfb: Return an error code only as a constant in dlfb_realloc_framebuffer()
video/fbdev/stifb: Delete an error message for a failed memory allocation in stifb_init_fb()
video/fbdev/stifb: Return -ENOMEM after a failed kzalloc() in stifb_init_fb()
video: fbdev: aty128fb: use true and false for boolean values
fbdev: aty: fix missing indentation in if statement
...
The main purpose of this pull request is a fix for a regression
in the recent PCM OSS emulation code that may lead to RCU stall.
Since syzkaller hits this too often, I send the pull request now
with a minimal collection. Possibly another pull request may
follow before RC1.
The other fixes here are for USB-audio class 2 and 3 to improve
the parser for the clock descriptors. These are rather cleanups
but good for security, too.
Last but not least, another included fix is the trivial one to
remove superfluous WARN_ON() that annoyed syzbot.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAlrLiZ4OHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE8YshAAr0j8P7eUfHs6laTZhYaTzFmGBHpabL1rIw1U
QvcqbGDXAACSELud+wQe91nML2vKsv8LE18aCRqjA5ZSY5beHa0wXurAyQdxx4tb
LPZVUEUTeh1Vs1SCrVE4ZHtMZPcDiOVsm3Y4YfMcg3bqyNXTGDyVk7Co+p4dzGA0
XU4z/K93aXPWEG8zomYNdbyiubxz7Kwuo0gVGSuRNarNDLKZS0cdOoiYCvA7vits
HVTG5O/GOxFmKTYnrxByKclFkVH5PhUylIdvDwecxyIOVM6tmKwjCRG0q8DGi1ic
QKxjbnDvbMkJBj1N9aFJ4QbgVKFhZP4DJil9T874OjscWnzXc/9z1EKlNbHWdYH+
mQNtN94Z0YK0jFPdJZf/WePIcyb7XeTAmhKY8cNhkO12bcfzQbcef9I3dsISMCc3
o0z6RQdc9KWuawXvLNIYh+/O4seWXNG026qZhZcFRpoIQ4HhaJxxsSPz7aZE47ha
AmLRUiwmv0aDHgvGgnBQuwqa2qi0tre/OyW8ciiN9uY+hAC+P7mnD/wdEBYoSPqP
0i1zFe/ikxbJXL4vg7+SXTN6pZ8l2ZRQrldXmxEB7mf7WWBM3JoDH4QBRILkobA0
qc+0NF3oxpQFOJjwefeRr3dwaRvIrqYt2agY5UwvxhNYQmiIbwgBz8VsUMK1CYNZ
95u57q8=
=3GP4
-----END PGP SIGNATURE-----
Merge tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"The main purpose of this pull request is a fix for a regression in the
recent PCM OSS emulation code that may lead to RCU stall. Since
syzkaller hits this too often, I send the pull request now with a
minimal collection. Possibly another pull request may follow before
RC1.
The other fixes here are for USB-audio class 2 and 3 to improve the
parser for the clock descriptors. These are rather cleanups but good
for security, too.
Last but not least, another included fix is the trivial one to remove
superfluous WARN_ON() that annoyed syzbot"
* tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() error
ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
ALSA: usb-audio: Add sanity checks in UAC3 clock parsers
ALSA: usb-audio: More strict sanity checks for clock parsers
ALSA: usb-audio: Refactor clock finder helpers
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJay4MSAAoJEAhfPr2O5OEVtC4QAIT57PnaoQnY/5WJS44D0/2v
+YrMDVg8qE2kU2tOtVqCZTtlivxF+QUh2IkJXmXkq0cLQ4DBlND/Ftpb1fOl9lhb
Kvy/B0Cl3v/kIcsLNZ+QAXw8mRkoOumFrG4fuz9Javb+7J6xu/RGvMRohRTMZHeX
9aTbfhDeVtumvgiYyt/MQFLwzQuoq4FGNEimxTmnp0YYz0qC5f/Iqf2/IIHek+tz
YQcBOD8lwqGhTOe81zOktgyzjoV+aWXwkvbHTCnQ/1ieuSzYIQ0J07lUEA0j/2gC
k9YptubzeynKG1o00WN+BjjdzYiND3akuOHr7Vl8BPChQr2dMxukbWZiDJSqr4vh
yJhNpoHeUoYndzfbdIUd7P+smm2/NoK1sJLwtXGUip7isr/LEWu6eGr7M7DJIGEj
xuEGxXP7pZ5xCw6yLNcv7KLToSlul4MzAoK+q/+R9bYYOIKvChJCvYuNiPBjkkOS
PWaNk0uZFLqmd53E6XsG+BoXBeVemf6QI6BD1ivaRQKfA7y3vwzclQeqd8KfZ6r/
Gn/9iNFxLhI+2ljQaoaYccCfNkfeYQ+BGNW+RHgEWUopXDQXd5ROIwmWBOWdGpLl
eM0pD/tNENAvmzHA5mRnWVPo8pClOILEvz8LxUqaMJX3UaDqzKo+dCb4wb0Lzz2G
sSZhoKsNKt+7lIkg4FDk
=+PjT
-----END PGP SIGNATURE-----
Merge tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A series of media updates/fixes for 4.17.
There are two important core fix patches in this series:
- A regression fix on Kernel 4.16 with causes it to not work with
some input devices that depend on media core
- A fix at compat32 bits with causes it to OOPS on overlay, and
affects the Kernels where the CVE-2017-13166 was backported
The remaining ones are other random fixes at the documentation and on
drivers.
The biggest part of this series is a set of 18 patches for the Intel
atomisp driver. Currently, it produces hundreds of warnings/errors on
sparse/smatch, causing me to sometimes ignore new warnings on other
drivers that are not so broken. This driver is on really poor state,
even for staging standards: it has several layers of abstraction on
it, and it supports two different hardware. Selecting between them
require to add a define (there isn't even a Kconfig option for such
purpose). Just on this smatch cleanup, I could easily get rid of 8
"do-nothing" files. So, I'm seriously considering its removal from
upstream, if I don't see any real work on addressing the problems
there along this year"
* tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (48 commits)
media: v4l2-core: fix size of devnode_nums[] bitarray
media: v4l2-compat-ioctl32: don't oops on overlay
media: i2c: adv748x: afe: fix sparse warning
media: extended-controls.rst: transmitter -> receiver
media: staging: atomisp: stop duplicating input format types
media: staging: atomisp: get rid of an unused var
media: staging: atomisp: stop mixing enum types
media: staging: atomisp: get rid of some static warnings
media: staging: atomisp: use %p to print pointers
media: staging: atomisp: remove an useless check
media: staging: atomisp: avoid a warning if 32 bits build
media: staging: atomisp: don't access a NULL var
media: staging: atomisp: Get rid of *default.host.[ch]
media: staging: atomisp: get rid of an unused function
media: staging: atomisp: remove unused set_pd_base()
media: staging: atomisp: fix endianess issues
media: staging: atomisp: add a missing include
media: staging: atomisp: get rid of stupid statements
media: staging: atomisp: declare static vars as such
media: staging: atomisp: ia_css_output.host: don't use var before check
...
c6x depends on the macro '_BIG_ENDIAN' being defined or not
to correctly select or define endian-specific macros, structures
or pieces of code.
This macro is predefined by the compiler but sparse knows nothing
about it and thus may pre-process files differently from what
gcc would.
Fix this by passing '-D_BIG_ENDIAN' when compiling a big-endian
kernel, like GCC would have done.
To: Mark Salter <msalter@redhat.com>
To: Aurelien Jacquiot <a-jacquiot@ti.com>
CC: linux-c6x-dev@linux-c6x.org
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Mark Salter <msalter@redhat.com>
Fix build error reported by the 0day bot by including the header
file for that macro.
Fixes this build error: (should fix; not tested)
arch/c6x/platforms/plldata.c: In function 'c6472_setup_clocks':
arch/c6x/platforms/plldata.c:279:33: error: implicit declaration of function 'get_coreid'; did you mean 'get_order'? [-Werror=implicit-function-declaration]
c6x_core_clk.parent = &sysclks[get_coreid() + 1];
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: linux-c6x-dev@linux-c6x.org
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Salter <msalter@redhat.com>
KTHREAD_SIZE has never been used since it has been defined for c6x arch.
Let's remove this useless definition.
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Signed-off-by: Mark Salter <msalter@redhat.com>
Pull networking fixes from David Miller:
1) The sockmap code has to free socket memory on close if there is
corked data, from John Fastabend.
2) Tunnel names coming from userspace need to be length validated. From
Eric Dumazet.
3) arp_filter() has to take VRFs properly into account, from Miguel
Fadon Perlines.
4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.
5) Missing idr_remove() in u32_delete_key(), from Cong Wang.
6) More syzbot stuff. Several use of uninitialized value fixes all
over, from Eric Dumazet.
7) Do not leak kernel memory to userspace in sctp, also from Eric
Dumazet.
8) Discard frames from unused ports in DSA, from Andrew Lunn.
9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
Falcon.
10) Do not access dp83640 PHY registers prematurely after reset, from
Esben Haabendal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vhost-net: set packet weight of tx polling to 2 * vq size
net: thunderx: rework mac addresses list to u64 array
inetpeer: fix uninit-value in inet_getpeer
dp83640: Ensure against premature access to PHY registers after reset
devlink: convert occ_get op to separate registration
ARM: dts: ls1021a: Specify TBIPA register address
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
ibmvnic: Do not reset CRQ for Mobility driver resets
ibmvnic: Fix failover case for non-redundant configuration
ibmvnic: Fix reset scheduler error handling
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix DMA mapping mistakes
tipc: use the right skb in tipc_sk_fill_sock_diag()
sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
net: dsa: Discard frames from unused ports
sctp: do not leak kernel memory to user space
soreuseport: initialise timewait reuseport field
ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
dccp: initialize ireq->ir_mark
net: fix uninit-value in __hw_addr_add_ex()
...
Pull vfs namei updates from Al Viro:
- make lookup_one_len() safe with parent locked only shared(incoming
afs series wants that)
- fix of getname_kernel() regression from 2015 (-stable fodder, that
one).
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
getname_kernel() needs to make sure that ->name != ->iname in long case
make lookup_one_len() safe to use with directory locked shared
new helper: __lookup_slow()
merge common parts of lookup_one_len{,_unlocked} into common helper
+ Documentation cleanups
+ removal of unused code
+ cause some structs to be static
+ implement Orangefs vm_operations fault callout
+ eliminate two single-use functions and put their cleaned up code in line.
+ replace a vmalloc/memset instance with vzalloc
+ fix a race condition bug in wait code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJayk8pAAoJEM9EDqnrzg2+92IP/2AqMOUodqF+AfKrAebQIAEk
DvQefRC2eQXN2Ak8KYSO7oMdqfY7xeVSaKllWsSHdnktI5L9l6AOBa7uYQYF4OQt
sqT5Wj5sbnDS54gOfFt95sUHEQXs5wegTXBE15t4uxaLSpv22mPjzqlTzyZcK7IX
nhSIxEi+utXOW1WodT2xLgS08ac693KXciNNN14NKi5DJ+h6a8Z2kZBq/6ssC1oV
nW2tIfRVWNVk6cfJ+Wq12JksEZOPemzjv2zojv3PZwCcwObOyxvf16nxnLjdkJvw
obYWmR8ZcF3TItBbkxOTfanr+UAhZ55YVYDpitJmV9fpPM9eoPAR0gNWBgq45kaS
s27pvMYf5uvGRJHgEO6hBpmi1dExDDMLfH6Y4IN+zyHKjNfgvfQKwIkzrjlmG6q6
AgO3mbbMJfid2x7lkShuP1uYqUfv03fo2xW3zxWjjxo3tEbADSgzJQ5kjPixDvcB
0ru2SlRVqgYIFqBN8VgWkXKbfcpSEcDITmFEkBKg/SkA08ry4iLk744UqIPa5n76
5DYB0u99nHPMf2BBVsz6L/mUfVuHBYUBx1iII95PqJhVlj2Yi0sIj3AHUzQ4neKS
YWGMk4VQ+B5zCObSohCRv9l3ECUqugpiJzCuZjZZS5LUkj9RO5EJ1LDhOU/31xDL
OUTgZl7f5NykxwIHh3bD
=zJ1Z
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Fixes and cleanups:
- Documentation cleanups
- removal of unused code
- make some structs static
- implement Orangefs vm_operations fault callout
- eliminate two single-use functions and put their cleaned up code in
line.
- replace a vmalloc/memset instance with vzalloc
- fix a race condition bug in wait code"
* tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
Orangefs: documentation updates
orangefs: document package install and xfstests procedure
orangefs: remove unused code
orangefs: make several *_operations structs static
orangefs: implement vm_ops->fault
orangefs: open code short single-use functions
orangefs: replace vmalloc and memset with vzalloc
orangefs: bug fix for a race condition when getting a slot
Commit 0619f0f5e3 ("selinux: wrap selinuxfs state") triggers a BUG
when SELinux is runtime-disabled (i.e. systemd or equivalent disables
SELinux before initial policy load via /sys/fs/selinux/disable based on
/etc/selinux/config SELINUX=disabled).
This does not manifest if SELinux is disabled via kernel command line
argument or if SELinux is enabled (permissive or enforcing).
Before:
SELinux: Disabled at runtime.
BUG: Dentry 000000006d77e5c7{i=17,n=null} still in use (1) [unmount of selinuxfs selinuxfs]
After:
SELinux: Disabled at runtime.
Fixes: 0619f0f5e3 ("selinux: wrap selinuxfs state")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- VHE optimizations
- EL2 address space randomization
- speculative execution mitigations ("variant 3a", aka execution past invalid
privilege register access)
- bugfixes and cleanups
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as of now)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJay19UAAoJEL/70l94x66DGKYIAIu9PTHAEwaX0et15fPW5y2x
rrtS355lSAmMrPJ1nePRQ+rProD/1B0Kizj3/9O+B9OTKKRsorRYNa4CSu9neO2k
N3rdE46M1wHAPwuJPcYvh3iBVXtgbMayk1EK5aVoSXaMXEHh+PWZextkl+F+G853
kC27yDy30jj9pStwnEFSBszO9ua/URdKNKBATNx8WUP6d9U/dlfm5xv3Dc3WtKt2
UMGmog2wh0i7ecXo7hRkMK4R7OYP3ZxAexq5aa9BOPuFp+ZdzC/MVpN+jsjq2J/M
Zq6RNyA2HFyQeP0E9QgFsYS2BNOPeLZnT5Jg1z4jyiD32lAZ/iC51zwm4oNKcDM=
=bPlD
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- VHE optimizations
- EL2 address space randomization
- speculative execution mitigations ("variant 3a", aka execution past
invalid privilege register access)
- bugfixes and cleanups
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as
of now)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
kvm: x86: fix a prototype warning
kvm: selftests: add sync_regs_test
kvm: selftests: add API testing infrastructure
kvm: x86: fix a compile warning
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
KVM: X86: Introduce handle_ud()
KVM: vmx: unify adjacent #ifdefs
x86: kvm: hide the unused 'cpu' variable
KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
kvm: Add emulation for movups/movupd
KVM: VMX: raise internal error for exception during invalid protected mode state
KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
KVM: x86: Fix misleading comments on handling pending exceptions
KVM: x86: Rename interrupt.pending to interrupt.injected
KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
x86/kvm: use Enlightened VMCS when running on Hyper-V
x86/hyper-v: detect nested features
x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
...
Commit 3c8ba0d61d ("kernel.h: Retain constant expression output for
max()/min()") rewrote our min/max macros to be very clever, but in the
meantime resurrected a variable name shadow issue that we had had
previously fixed in commit 589a9785ee ("min/max: remove sparse
warnings when they're nested").
That commit talks about the sparse warnings that this shadowing causes,
which we ignored as just a minor annoyance. But it turns out that the
sparse warning is the least of our problems. We actually have a real
bug due to the shadowing through the interaction with "min_not_zero()",
which ends up doing
min(__x, __y)
internally, and then the new declaration of "__x" and "__y" as new
variables in __cmp_once() results in a complete mess of an expression,
and "min_not_zero()" doesn't work at all.
For some odd reason, this only ever caused (reported) problems on s390,
even though it is a generic issue and most of the (obviously successful)
testing of the problematic commit had happened on other architectures.
Quoting Sebastian Ott:
"What happened is that the bio build by the partition detection code
was attempted to be split by the block layer because the block queue
had a max_sector setting of 0. blk_queue_max_hw_sectors uses
min_not_zero."
So re-introduce the use of __UNIQUE_ID() to make sure that the min/max
macros do not have these kinds of clashes.
[ That said, __UNIQUE_ID() itself has several issues that make it less
than wonderful.
In particular, the "uniqueness" has a fallback on the line number,
which means that it's not actually unique in more complex cases if you
don't build with gcc or clang (which have working unique counters that
aren't tied to line numbers).
That historical broken fallback also means that we have that pointless
"prefix" argument that doesn't actually make much sense _except_ for
the known-broken case. Oh well. ]
Fixes: 3c8ba0d61d ("kernel.h: Retain constant expression output for max()/min()")
Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ARM SA1100 updates from Russell King:
"We have support for arbitary MMIO registers providing platform GPIOs,
which allows us to abstract some of the SA11x0 CF support.
This set of updates makes that change"
* 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: sa1100/simpad: switch simpad CF to use gpiod APIs
ARM: sa1100/shannon: convert to generic CF sockets
ARM: sa1100/nanoengine: convert to generic CF sockets
ARM: sa1100/h3xxx: switch h3xxx PCMCIA to use gpiod APIs
ARM: sa1100/cerf: convert to generic CF sockets
ARM: sa1100/assabet: convert to generic CF sockets
ARM: sa1100: provide infrastructure to support generic CF sockets
pcmcia: sa1100: provide generic CF support