[ Upstream commit 5a817641f6 ]
The filesystem freezing code needs to transfer ownership of a rwsem
embedded in a percpu-rwsem from the task that does the freezing to
another one that does the thawing by calling percpu_rwsem_release()
after freezing and percpu_rwsem_acquire() before thawing.
However, the new rwsem debug code runs afoul with this scheme by warning
that the task that releases the rwsem isn't the one that acquires it,
as reported by Amir Goldstein:
DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
Call Trace:
percpu_up_write+0x1f/0x28
thaw_super_locked+0xdf/0x120
do_vfs_ioctl+0x270/0x5f1
ksys_ioctl+0x52/0x71
__x64_sys_ioctl+0x16/0x19
do_syscall_64+0x5d/0x167
entry_SYSCALL_64_after_hwframe+0x49/0xbe
To work properly with the rwsem debug code, we need to annotate that the
rwsem ownership is unknown during the tranfer period until a brave soul
comes forward to acquire the ownership. During that period, optimistic
spinning will be disabled.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Theodore Y. Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d7d760efad ]
There are use cases where a rwsem can be acquired by one task, but
released by another task. In thess cases, optimistic spinning may need
to be disabled. One example will be the filesystem freeze/thaw code
where the task that freezes the filesystem will acquire a write lock
on a rwsem and then un-owns it before returning to userspace. Later on,
another task will come along, acquire the ownership, thaw the filesystem
and release the rwsem.
Bit 0 of the owner field was used to designate that it is a reader
owned rwsem. It is now repurposed to mean that the owner of the rwsem
is not known. If only bit 0 is set, the rwsem is reader owned. If bit
0 and other bits are set, it is writer owned with an unknown owner.
One such value for the latter case is (-1L). So we can set owner to 1 for
reader-owned, -1 for writer-owned. The owner is unknown in both cases.
To handle transfer of rwsem ownership, the higher level code should
set the owner field to -1 to indicate a write-locked rwsem with unknown
owner. Optimistic spinning will be disabled in this case.
Once the higher level code figures who the new owner is, it can then
set the owner field accordingly.
Tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Theodore Y. Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 11dc13224c ]
When queuing on the qspinlock, the count field for the current CPU's head
node is incremented. This needn't be atomic because locking in e.g. IRQ
context is balanced and so an IRQ will return with node->count as it
found it.
However, the compiler could in theory reorder the initialisation of
node[idx] before the increment of the head node->count, causing an
IRQ to overwrite the initialised node and potentially corrupt the lock
state.
Avoid the potential for this harmful compiler reordering by placing a
barrier() between the increment of the head node->count and the subsequent
node initialisation.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2ce77d16db ]
Things can explode for locktorture if the user does combinations
of nwriters_stress=0 nreaders_stress=0. Fix this by not assuming
we always want to torture writer threads.
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4950276672 upstream.
Patch series "kmemcheck: kill kmemcheck", v2.
As discussed at LSF/MM, kill kmemcheck.
KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.
We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).
The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.
Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.
This patch (of 4):
Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c1e2f0eaf0 upstream.
Julia reported futex state corruption in the following scenario:
waiter waker stealer (prio > waiter)
futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
timeout=[N ms])
futex_wait_requeue_pi()
futex_wait_queue_me()
freezable_schedule()
<scheduled out>
futex(LOCK_PI, uaddr2)
futex(CMP_REQUEUE_PI, uaddr,
uaddr2, 1, 0)
/* requeues waiter to uaddr2 */
futex(UNLOCK_PI, uaddr2)
wake_futex_pi()
cmp_futex_value_locked(uaddr2, waiter)
wake_up_q()
<woken by waker>
<hrtimer_wakeup() fires,
clears sleeper->task>
futex(LOCK_PI, uaddr2)
__rt_mutex_start_proxy_lock()
try_to_take_rt_mutex() /* steals lock */
rt_mutex_set_owner(lock, stealer)
<preempted>
<scheduled in>
rt_mutex_wait_proxy_lock()
__rt_mutex_slowlock()
try_to_take_rt_mutex() /* fails, lock held by stealer */
if (timeout && !timeout->task)
return -ETIMEDOUT;
fixup_owner()
/* lock wasn't acquired, so,
fixup_pi_state_owner skipped */
return -ETIMEDOUT;
/* At this point, we've returned -ETIMEDOUT to userspace, but the
* futex word shows waiter to be the owner, and the pi_mutex has
* stealer as the owner */
futex_lock(LOCK_PI, uaddr2)
-> bails with EDEADLK, futex word says we're owner.
And suggested that what commit:
73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")
removes from fixup_owner() looks to be just what is needed. And indeed
it is -- I completely missed that requeue_pi could also result in this
case. So we need to restore that, except that subsequent patches, like
commit:
16ffa12d74 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")
changed all the locking rules. Even without that, the sequence:
- if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
- locked = 1;
- goto out;
- }
- raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
- owner = rt_mutex_owner(&q->pi_state->pi_mutex);
- if (!owner)
- owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
- raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
- ret = fixup_pi_state_owner(uaddr, q, owner);
already suggests there were races; otherwise we'd never have to look
at next_owner.
So instead of doing 3 consecutive wait_lock sections with who knows
what races, we do it all in a single section. Additionally, the usage
of pi_state->owner in fixup_owner() was only safe because only the
rt_mutex owner would modify it, which this additional case wrecks.
Luckily the values can only change away and not to the value we're
testing, this means we can do a speculative test and double check once
we have the wait_lock.
Fixes: 73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")
Reported-by: Julia Cartwright <julia@ni.com>
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Julia Cartwright <julia@ni.com>
Tested-by: Gratian Crisan <gratian.crisan@ni.com>
Cc: Darren Hart <dvhart@infradead.org>
Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is some complication between check_prevs_add() and
check_prev_add() wrt. saving stack traces. The problem is that we want
to be frugal with saving stack traces, since it consumes static
resources.
We'll only know in check_prev_add() if we need the trace, but we can
call into it multiple times. So we want to do on-demand and re-use.
A further complication is that check_prev_add() can drop graph_lock
and mess with our static resources.
In any case, the current state; after commit:
ce07a9415f ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
is that we'll assume the trace contains valid data once
check_prev_add() returns '2'. However, as noted by Josh, this is
false, check_prev_add() can return '2' before having saved a trace,
this then result in the possibility of using uninitialized data.
Testing, as reported by Wu, shows a NULL deref.
So simplify.
Since the graph_lock() thing is a debug path that hasn't
really been used in a long while, take it out back and avoid the
head-ache.
Further initialize the stack_trace to a known 'empty' state; as long
as nr_entries == 0, nothing should deref entries. We can then use the
'entries == NULL' test for a valid trace / on-demand saving.
Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ce07a9415f ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a spinner is present, there is a chance that the load of
rwsem_has_spinner() in rwsem_wake() can be reordered with
respect to decrement of rwsem count in __up_write() leading
to wakeup being missed:
spinning writer up_write caller
--------------- -----------------------
[S] osq_unlock() [L] osq
spin_lock(wait_lock)
sem->count=0xFFFFFFFF00000001
+0xFFFFFFFF00000000
count=sem->count
MB
sem->count=0xFFFFFFFE00000001
-0xFFFFFFFF00000001
spin_trylock(wait_lock)
return
rwsem_try_write_lock(count)
spin_unlock(wait_lock)
schedule()
Reordering of atomic_long_sub_return_release() in __up_write()
and rwsem_has_spinner() in rwsem_wake() can cause missing of
wakeup in up_write() context. In spinning writer, sem->count
and local variable count is 0XFFFFFFFE00000001. It would result
in rwsem_try_write_lock() failing to acquire rwsem and spinning
writer going to sleep in rwsem_down_write_failed().
The smp_rmb() will make sure that the spinner state is
consulted after sem->count is updated in up_write context.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: longman@redhat.com
Cc: parri.andrea@gmail.com
Cc: sramana@codeaurora.org
Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation. As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.
The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory. So
this is rather misleading and hard to evaluate for any benefits.
I have checked some random users and none of them has added the flag
with a specific justification. I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring. This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.
I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL. Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.
I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.
This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
seems to be a heuristic without any measured advantage for most (if not
all) its current users. The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers. So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.
[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull locking updates from Ingo Molnar:
- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.
- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)
- Fix static keys (Paolo Bonzini)
- Add killable versions of down_read() et al (Kirill Tkhai)
- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)
- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)
- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...
All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
- pv_queued_spin_steal_lock()
- trylock_clear_pending()
This change should help performance on architectures that use LL/SC.
The cmpxchg in pv_kick_node() is replaced with a relaxed version
with explicit memory barrier to make sure that it is fully ordered
in the writing of next->lock and the reading of pn->state whether
the cmpxchg is a success or failure without affecting performance in
non-LL/SC architectures.
On a 2-socket 12-core 96-thread Power8 system with pvqspinlock
explicitly enabled, the performance of a locking microbenchmark
with and without this patch on a 4.13-rc4 kernel with Xinhui's PPC
qspinlock patch were as follows:
# of thread w/o patch with patch % Change
----------- --------- ---------- --------
8 5054.8 Mop/s 5209.4 Mop/s +3.1%
16 3985.0 Mop/s 4015.0 Mop/s +0.8%
32 2378.2 Mop/s 2396.0 Mop/s +0.7%
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1502741222-24360-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
ensure the temporal IRQ events don't interact with task state, the
XHLOCK_PROC is a fundament different beast that just happens to share
the interface.
The purpose of XHLOCK_PROC is to annotate independent execution inside
one task. For example workqueues, each work should appear to run in its
own 'pristine' 'task'.
Remove XHLOCK_PROC in favour of its own interface to avoid confusion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: kernel-team@lge.com
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The new completion/crossrelease annotations interact unfavourable with
the extant flush_work()/flush_workqueue() annotations.
The problem is that when a single work class does:
wait_for_completion(&C)
and
complete(&C)
in different executions, we'll build dependencies like:
lock_map_acquire(W)
complete_acquire(C)
and
lock_map_acquire(W)
complete_release(C)
which results in the dependency chain: W->C->W, which lockdep thinks
spells deadlock, even though there is no deadlock potential since
works are ran concurrently.
One possibility would be to change the work 'lock' to recursive-read,
but that would mean hitting a lockdep limitation on recursive locks.
Also, unconditinoally switching to recursive-read here would fail to
detect the actual deadlock on single-threaded workqueues, which do
have a problem with this.
For now, forcefully disregard these locks for crossrelease.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore removes spin_unlock_wait() and related
definitions from core code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
As Boqun Feng pointed out, current->hist_id should be aligned with the
latest valid xhlock->hist_id so that hist_id_save[] storing current->hist_id
can be comparable with xhlock->hist_id. Fix it.
Additionally, the condition for overwrite-detection should be the
opposite. Fix the code and the comments as well.
<- direction to visit
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh (h: history)
^^ ^
|| start from here
|previous entry
current entry
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: linux-mm@kvack.org
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502694052-16085-3-git-send-email-byungchul.park@lge.com
[ Improve the comments some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No acquisition might be in progress on commit of a crosslock. Completion
operations enabling crossrelease are the case like:
CONTEXT X CONTEXT Y
--------- ---------
trigger completion context
complete AX
commit AX
wait_for_complete AX
acquire AX
wait
where AX is a crosslock.
When no acquisition is in progress, we should not perform commit because
the lock does not exist, which might cause incorrect memory access. So
we have to track the number of acquisitions of a crosslock to handle it.
Moreover, in case that more than one acquisition of a crosslock are
overlapped like:
CONTEXT W CONTEXT X CONTEXT Y CONTEXT Z
--------- --------- --------- ---------
acquire AX (gen_id: 1)
acquire A
acquire AX (gen_id: 10)
acquire B
commit AX
acquire C
commit AX
where A, B and C are typical locks and AX is a crosslock.
Current crossrelease code performs commits in Y and Z with gen_id = 10.
However, we can use gen_id = 1 to do it, since not only 'acquire AX in X'
but 'acquire AX in W' also depends on each acquisition in Y and Z until
their commits. So make it use gen_id = 1 instead of 10 on their commits,
which adds an additional dependency 'AX -> A' in the example above.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-8-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ring buffer can be overwritten by hardirq/softirq/work contexts.
That cases must be considered on rollback or commit. For example,
|<------ hist_lock ring buffer size ----->|
ppppppppppppiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
wrapped > iiiiiiiiiiiiiiiiiiiiiii....................
where 'p' represents an acquisition in process context,
'i' represents an acquisition in irq context.
On irq exit, crossrelease tries to rollback idx to original position,
but it should not because the entry already has been invalid by
overwriting 'i'. Avoid rollback or commit for entries overwritten.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-7-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep is a runtime locking correctness validator that detects and
reports a deadlock or its possibility by checking dependencies between
locks. It's useful since it does not report just an actual deadlock but
also the possibility of a deadlock that has not actually happened yet.
That enables problems to be fixed before they affect real systems.
However, this facility is only applicable to typical locks, such as
spinlocks and mutexes, which are normally released within the context in
which they were acquired. However, synchronization primitives like page
locks or completions, which are allowed to be released in any context,
also create dependencies and can cause a deadlock.
So lockdep should track these locks to do a better job. The 'crossrelease'
implementation makes these primitives also be tracked.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, a space for stack_trace is pinned in check_prev_add(), that
makes us not able to use external stack_trace. The simplest way to
achieve it is to pass an external stack_trace as an argument.
A more suitable solution is to pass a callback additionally along with
a stack_trace so that callers can decide the way to save or whether to
save. Actually crossrelease needs to do other than saving a stack_trace.
So pass a stack_trace and callback to handle it, to check_prev_add().
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-5-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Firstly, return 1 instead of 2 when 'prev -> next' dependency already
exists. Since the value 2 is not referenced anywhere, just return 1
indicating success in this case.
Secondly, return 2 instead of 1 when successfully added a lock_list
entry with saving stack_trace. With that, a caller can decide whether
to avoid redundant save_trace() on the caller site.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-4-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two boots + a make defconfig, the first didn't have the redundant bit
in, the second did:
lock-classes: 1168 1169 [max: 8191]
direct dependencies: 7688 5812 [max: 32768]
indirect dependencies: 25492 25937
all direct dependencies: 220113 217512
dependency chains: 9005 9008 [max: 65536]
dependency chain hlocks: 34450 34366 [max: 327680]
in-hardirq chains: 55 51
in-softirq chains: 371 378
in-process chains: 8579 8579
stack-trace entries: 108073 88474 [max: 524288]
combined max dependencies: 178738560 169094640
max locking depth: 15 15
max bfs queue depth: 320 329
cyclic checks: 9123 9190
redundant checks: 5046
redundant links: 1828
find-mask forwards checks: 2564 2599
find-mask backwards checks: 39521 39789
So it saves nearly 2k links and a fair chunk of stack-trace entries, but
as expected, makes no real difference on the indirect dependencies.
At the same time, you see the max BFS depth increase, which is also
expected, although it could easily be boot variance -- these numbers are
not entirely stable between boots.
The down side is that the cycles in the graph become larger and thus
the reports harder to read.
XXX: do we want this as a CONFIG variable, implied by LOCKDEP_SMALL?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Link: http://lkml.kernel.org/r/20170303091338.GH6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A while ago someone, and I cannot find the email just now, asked if we
could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
like we use for other things like workqueues etc. I think this should
be possible which allows reducing the 'irq' states and will reduce the
amount of __bfs() lookups we do.
Removing the 1 IRQ state results in 4 less __bfs() walks per
dependency, improving lockdep performance. And by moving this
annotation out of the lockdep code it becomes easier for the mm people
to extend.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename __down_read() in __down_read_common() and teach it
to abort waiting in case of pending signals and killable
state argument passed.
Note, that we shouldn't wake anybody up in EINTR path, as:
We check for signal_pending_state() after (!waiter.task)
test and under spinlock. So, current task wasn't able to
be woken up. It may be in two cases: a writer is owner
of the sem, or a writer is a first waiter of the sem.
If a writer is owner of the sem, no one else may work
with it in parallel. It will wake somebody, when it
call up_write() or downgrade_write().
If a writer is the first waiter, it will be woken up,
when the last active reader releases the sem, and
sem->count became 0.
Also note, that set_current_state() may be moved down
to schedule() (after !waiter.task check), as all
assignments in this type of semaphore (including wake_up),
occur under spinlock, so we can't miss anything.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: avagin@virtuozzo.com
Cc: davem@davemloft.net
Cc: fenghua.yu@intel.com
Cc: gorcunov@virtuozzo.com
Cc: heiko.carstens@de.ibm.com
Cc: hpa@zytor.com
Cc: ink@jurassic.park.msu.ru
Cc: mattst88@gmail.com
Cc: rth@twiddle.net
Cc: schwidefsky@de.ibm.com
Cc: tony.luck@intel.com
Link: http://lkml.kernel.org/r/149789533283.9059.9829416940494747182.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix ordering of link creation between node->prev and prev->next in
osq_lock(). A case in which the status of optimistic spin queue is
CPU6->CPU2 in which CPU6 has acquired the lock.
tail
v
,-. <- ,-.
|6| |2|
`-' -> `-'
At this point if CPU0 comes in to acquire osq_lock, it will update the
tail count.
CPU2 CPU0
----------------------------------
tail
v
,-. <- ,-. ,-.
|6| |2| |0|
`-' -> `-' `-'
After tail count update if CPU2 starts to unqueue itself from
optimistic spin queue, it will find an updated tail count with CPU0 and
update CPU2 node->next to NULL in osq_wait_next().
unqueue-A
tail
v
,-. <- ,-. ,-.
|6| |2| |0|
`-' `-' `-'
unqueue-B
->tail != curr && !node->next
If reordering of following stores happen then prev->next where prev
being CPU2 would be updated to point to CPU0 node:
tail
v
,-. <- ,-. ,-.
|6| |2| |0|
`-' `-' -> `-'
osq_wait_next()
node->next <- 0
xchg(node->next, NULL)
tail
v
,-. <- ,-. ,-.
|6| |2| |0|
`-' `-' `-'
unqueue-C
At this point if next instruction
WRITE_ONCE(next->prev, prev);
in CPU2 path is committed before the update of CPU0 node->prev = prev then
CPU0 node->prev will point to CPU6 node.
tail
v----------. v
,-. <- ,-. ,-.
|6| |2| |0|
`-' `-' `-'
`----------^
At this point if CPU0 path's node->prev = prev is committed resulting
in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
currently,
tail
v
,-. <- ,-. <- ,-.
|6| |2| |0|
`-' `-' `-'
`----------^
so if CPU0 gets into unqueue path of osq_lock it will keep spinning
in infinite loop as condition prev->next == node will never be true.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
[ Added pictures, rewrote comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sramana@codeaurora.org
Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This makes it possible to preserve basic futex support and compile out the
PI support when RT mutexes are not available.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1708010024190.5981@knanqh.ubzr
Pull locking fixlet from Ingo Molnar:
"Remove an unnecessary priority adjustment in the rtmutex code"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Remove unnecessary priority adjustment
We don't need to adjust priority before adding a new pi_waiter, the
priority only needs to be updated after pi_waiter change or task
priority change.
Steven Rostedt pointed out:
"Interesting, I did some git mining and this was added with the original
entry of the rtmutex.c (23f78d4a03). Looking at even that version, I
don't see the purpose of adjusting the task prio here. It is done
before anything changes in the task."
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1499926704-28841-1-git-send-email-alex.shi@linaro.org
[ Enhance the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull locking fixes from Thomas Gleixner:
- Fix the EINTR logic in rwsem-spinlock to avoid double locking by a
writer and a reader
- Add a missing include to qspinlocks
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/qspinlock: Explicitly include asm/prefetch.h
locking/rwsem-spinlock: Fix EINTR branch in __down_write_common()
Pull sparc updates from David Miller:
1) Queued spinlocks and rwlocks for sparc64, from Babu Moger.
2) Some const'ification from Arvind Yadav.
3) LDC/VIO driver infrastructure changes to facilitate future upcoming
drivers, from Jag Raman.
4) Initialize sched_clock() et al. early so that the initial printk
timestamps are all done while the implementation is available and
functioning. From Pavel Tatashin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: (38 commits)
sparc: kernel: pmc: make of_device_ids const.
sparc64: fix typo in property
sparc64: add port_id to VIO device metadata
sparc64: Enhance search for VIO device in MDESC
sparc64: enhance VIO device probing
sparc64: check if a client is allowed to register for MDESC notifications
sparc64: remove restriction on VIO device name size
sparc64: refactor code to obtain cfg_handle property from MDESC
sparc64: add MDESC node name property to VIO device metadata
sparc64: mdesc: use __GFP_REPEAT action modifier for VM allocation
sparc64: expand MDESC interface
sparc64: skip handshake for LDC channels in RAW mode
sparc64: specify the device class in VIO version info. packet
sparc64: ensure VIO operations are defined while being used
sparc: kernel: apc: make of_device_ids const
sparc/time: make of_device_ids const
sparc64: broken %tick frequency on spitfire cpus
sparc64: use prom interface to get %stick frequency
sparc64: optimize functions that access tick
sparc64: add hot-patched and inlined get_tick()
...
In architectures that use qspinlock, like x86, prefetch is loaded
indirectly via the asm/qspinlock.h include. On other architectures, like
OpenRISC, which may want to use asm-generic/qspinlock.h the built will
fail without the asm/prefetch.h include.
Fix this by including directly.
Signed-off-by: Stafford Horne <shorne@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170707195658.23840-1-shorne@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a writer could been woken up, the above branch
if (sem->count == 0)
break;
would have moved us to taking the sem. So, it's
not the time to wake a writer now, and only readers
are allowed now. Thus, 0 must be passed to __rwsem_do_wake().
Next, __rwsem_do_wake() wakes readers unconditionally.
But we mustn't do that if the sem is owned by writer
in the moment. Otherwise, writer and reader own the sem
the same time, which leads to memory corruption in
callers.
rwsem-xadd.c does not need that, as:
1) the similar check is made lockless there,
2) in __rwsem_mark_wake::try_reader_grant we test,
that sem is not owned by writer.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Cassel <niklas.cassel@axis.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 17fcbd590d "locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y"
Link: http://lkml.kernel.org/r/149762063282.19811.9129615532201147826.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
around. Highlights include:
- Conversion of a bunch of security documentation into RST
- The conversion of the remaining DocBook templates by The Amazing
Mauro Machine. We can now drop the entire DocBook build chain.
- The usual collection of fixes and minor updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZWkGAAAoJEI3ONVYwIuV6rf0P/0B3JTiVPKS/WUx53+jzbAi4
1BN7dmmuMxE1bWpgdEq+ac4aKxm07iAojuntuMj0qz/ZB1WARcmvEqqzI5i4wfq9
5MrLduLkyuWfr4MOPseKJ2VK83p8nkMOiO7jmnBsilu7fE4nF+5YY9j4cVaArfMy
cCQvAGjQzvej2eiWMGUSLHn4QFKh00aD7cwKyBVsJ08b27C9xL0J2LQyCDZ4yDgf
37/MH3puEd3HX/4qAwLonIxT3xrIrrbDturqLU7OSKcWTtGZNrYyTFbwR3RQtqWd
H8YZVg2Uyhzg9MYhkbQ2E5dEjUP4mkegcp6/JTINH++OOPpTbdTJgirTx7VTkSf1
+kL8t7+Ayxd0FH3+77GJ5RMj8LUK6rj5cZfU5nClFQKWXP9UL3IelQ3Nl+SpdM8v
ZAbR2KjKgH9KS6+cbIhgFYlvY+JgPkOVruwbIAc7wXVM3ibk1sWoBOFEujcbueWh
yDpQv3l1UX0CKr3jnevJoW26LtEbGFtC7gSKZ+3btyeSBpWFGlii42KNycEGwUW0
ezlwryDVHzyTUiKllNmkdK4v73mvPsZHEjgmme4afKAIiUilmcUF4XcqD86hISFT
t+UJLA/zEU+0sJe26o2nK6GNJzmo4oCtVyxfhRe26Ojs1n80xlYgnZRfuIYdd31Z
nwLBnwDCHAOyX91WXp9G
=cVjZ
-----END PGP SIGNATURE-----
Merge tag 'docs-4.13' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"There has been a fair amount of activity in the docs tree this time
around. Highlights include:
- Conversion of a bunch of security documentation into RST
- The conversion of the remaining DocBook templates by The Amazing
Mauro Machine. We can now drop the entire DocBook build chain.
- The usual collection of fixes and minor updates"
* tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
scripts/kernel-doc: handle DECLARE_HASHTABLE
Documentation: atomic_ops.txt is core-api/atomic_ops.rst
Docs: clean up some DocBook loose ends
Make the main documentation title less Geocities
Docs: Use kernel-figure in vidioc-g-selection.rst
Docs: fix table problems in ras.rst
Docs: Fix breakage with Sphinx 1.5 and upper
Docs: Include the Latex "ifthen" package
doc/kokr/howto: Only send regression fixes after -rc1
docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
doc: Document suitability of IBM Verse for kernel development
Doc: fix a markup error in coding-style.rst
docs: driver-api: i2c: remove some outdated information
Documentation: DMA API: fix a typo in a function name
Docs: Insert missing space to separate link from text
doc/ko_KR/memory-barriers: Update control-dependencies example
Documentation, kbuild: fix typo "minimun" -> "minimum"
docs: Fix some formatting issues in request-key.rst
doc: ReSTify keys-trusted-encrypted.txt
doc: ReSTify keys-request-key.txt
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Add CONFIG_REFCOUNT_FULL=y to allow the disabling of the 'full'
(robustness checked) refcount_t implementation with slightly lower
runtime overhead. (Kees Cook)
The lighter weight variant is the default. The two variants use the
same API. Having this variant was a precondition by some
maintainers to merge refcount_t cleanups.
- Add lockdep support for rtmutexes (Peter Zijlstra)
- liblockdep fixes and improvements (Sasha Levin, Ben Hutchings)
- ... misc fixes and improvements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
locking/refcount: Remove the half-implemented refcount_sub() API
locking/refcount: Create unchecked atomic_t implementation
locking/rtmutex: Don't initialize lockdep when not required
locking/selftest: Add RT-mutex support
locking/selftest: Remove the bad unlock ordering test
rt_mutex: Add lockdep annotations
MAINTAINERS: Claim atomic*_t maintainership
locking/x86: Remove the unused atomic_inc_short() methd
tools/lib/lockdep: Remove private kernel headers
tools/lib/lockdep: Hide liblockdep output from test results
tools/lib/lockdep: Add dummy current_gfp_context()
tools/include: Add IS_ERR_OR_NULL to err.h
tools/lib/lockdep: Add empty __is_[module,kernel]_percpu_address
tools/lib/lockdep: Include err.h
tools/include: Add (mostly) empty include/linux/sched/mm.h
tools/lib/lockdep: Use LDFLAGS
tools/lib/lockdep: Remove double-quotes from soname
tools/lib/lockdep: Fix object file paths used in an out-of-tree build
tools/lib/lockdep: Fix compilation for 4.11
tools/lib/lockdep: Don't mix fd-based and stream IO
...
pi_mutex isn't supposed to be tracked by lockdep, but just
passing NULLs for name and key will cause lockdep to spew a
warning and die, which is not what we want it to do.
Skip lockdep initialization if the caller passed NULLs for
name and key, suggesting such initialization isn't desired.
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f5694788ad ("rt_mutex: Add lockdep annotations")
Link: http://lkml.kernel.org/r/20170618140548.4763-1-alexander.levin@verizon.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PROVE_RCU_REPEATEDLY Kconfig option was initially added due to
the volume of messages from PROVE_RCU: Doing just one per boot would
have required excessive numbers of boots to locate them all. However,
PROVE_RCU messages are now relatively rare, so there is no longer any
reason to need more than one such message per boot. This commit therefore
removes the PROVE_RCU_REPEATEDLY Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Commit a5dd63efda ("lockdep: Use "WARNING" tag on lockdep splats")
substituted pr_warn() for printk() in places called out by Dmitry Vyukov.
However, this resulted in an ugly mix of pr_warn() and printk(). This
commit therefore changes printk() to pr_warn() or pr_cont(), depending
on the absence or presence of KERN_CONT. This is done in all functions
that had printk() changed to pr_warn() by the aforementioned commit.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that (PI) futexes have their own private RT-mutex interface and
implementation we can easily add lockdep annotations to the existing
RT-mutex interface.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Saw these compile errors on SPARC when queued rwlock feature is enabled.
CC kernel/locking/qrwlock.o
kernel/locking/qrwlock.c: In function ‘queued_read_lock_slowpath’:
kernel/locking/qrwlock.c:89: error: implicit declaration of function ‘arch_spin_lock’
kernel/locking/qrwlock.c:102: error: implicit declaration of function ‘arch_spin_unlock’
make[4]: *** [kernel/locking/qrwlock.o] Error 1
Include spinlock.h in qrwlock.c to fix it.
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
commit:
cfafcd117d ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
The following trace shows the problem:
ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI
ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI
ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT
Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
which then returns with -ETIMEDOUT. That wrecks the lock state, because now
the owner isn't aware it acquired the lock and removes the pending robust
list entry.
If 2161 is killed, the robust list will not clear out this futex and the
subsequent acquire on this futex will then (correctly) result in -ESRCH
which is unexpected by glibc, triggers an internal assertion and dies.
Task 2161 Task 2165
rt_mutex_wait_proxy_lock()
timeout();
/* T2161 is still queued in the waiter list */
return -ETIMEDOUT;
futex_unlock_pi()
spin_lock(hb->lock);
rtmutex_unlock()
remove_rtmutex_waiter(T2161);
mark_lock_available();
/* Make the next waiter owner of the user space side */
futex_uval = 2161;
spin_unlock(hb->lock);
spin_lock(hb->lock);
rt_mutex_cleanup_proxy_lock()
if (rtmutex_owner() !== current)
...
return FAIL;
....
return -ETIMEOUT;
This means that rt_mutex_cleanup_proxy_lock() needs to call
try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
assigned by the waker. If the rtmutex is owned by some other task then this
call is harmless and just confirmes that the waiter is not able to acquire
it.
While there, fix what looks like a merge error which resulted in
rt_mutex_cleanup_proxy_lock() having two calls to
fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
Both should have one, since both potentially touch the waiter list.
Fixes: 38d589f2fd ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Bug-Spotted-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Mauro says:
This patch series convert the remaining DocBooks to ReST.
The first version was originally
send as 3 patch series:
[PATCH 00/36] Convert DocBook documents to ReST
[PATCH 0/5] Convert more books to ReST
[PATCH 00/13] Get rid of DocBook
The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.
It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.
I also added some patches here to add PDF output for all
existing ReST books.