Pull scheduler updates from Ingo Molnar:
- massive CPU hotplug rework (Thomas Gleixner)
- improve migration fairness (Peter Zijlstra)
- CPU load calculation updates/cleanups (Yuyang Du)
- cpufreq updates (Steve Muckle)
- nohz optimizations (Frederic Weisbecker)
- switch_mm() micro-optimization on x86 (Andy Lutomirski)
- ... lots of other enhancements, fixes and cleanups.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
ARM: Hide finish_arch_post_lock_switch() from modules
sched/core: Provide a tsk_nr_cpus_allowed() helper
sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
sched/fair: Correct unit of load_above_capacity
sched/fair: Clean up scale confusion
sched/nohz: Fix affine unpinned timers mess
sched/fair: Fix fairness issue on migration
sched/core: Kill sched_class::task_waking to clean up the migration logic
sched/fair: Prepare to fix fairness problems on migration
sched/fair: Move record_wakee()
sched/core: Fix comment typo in wake_q_add()
sched/core: Remove unused variable
sched: Make hrtick_notifier an explicit call
sched/fair: Make ilb_notifier an explicit call
sched/hotplug: Make activate() the last hotplug step
sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()
sched/migration: Move CPU_ONLINE into scheduler state
sched/migration: Move calc_load_migrate() into CPU_DYING
sched/migration: Move prepare transition to SCHED_STARTING state
...
Pull support for killable rwsems from Ingo Molnar:
"This, by Michal Hocko, implements down_write_killable().
The main usecase will be to update mm_sem usage sites to use this new
API, to allow the mm-reaper introduced in commit aac4536355 ("mm,
oom: introduce oom reaper") to tear down oom victim address spaces
asynchronously with minimum latencies and without deadlock worries"
[ The vfs will want it too as the inode lock is changed from a mutex to
a rwsem due to the parallel lookup and readdir updates ]
* 'locking-rwsem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix comment on register clobbering
locking/rwsem: Fix down_write_killable()
locking/rwsem, x86: Add frame annotation for call_rwsem_down_write_failed_killable()
locking/rwsem: Provide down_write_killable()
locking/rwsem, x86: Provide __down_write_killable()
locking/rwsem, s390: Provide __down_write_killable()
locking/rwsem, ia64: Provide __down_write_killable()
locking/rwsem, alpha: Provide __down_write_killable()
locking/rwsem: Introduce basis for down_write_killable()
locking/rwsem, sparc: Drop superfluous arch specific implementation
locking/rwsem, sh: Drop superfluous arch specific implementation
locking/rwsem, xtensa: Drop superfluous arch specific implementation
locking/rwsem: Drop explicit memory barriers
locking/rwsem: Get rid of __down_write_nested()
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.
This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.
No objsize difference for !LOCKDEP kernels.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
lock_chain::base is used to store an index into the chain_hlocks[]
array, however that array contains more elements than can be indexed
using the u16.
Change the lock_chain structure to use a bitfield to encode the data
it needs and add BUILD_BUG_ON() assertions to check the fields are
wide enough.
Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of
that array; as that would wreck the collision detectoring.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all the architectures implement the necessary glue code
we can introduce down_write_killable(). The only difference wrt. regular
down_write() is that the slow path waits in TASK_KILLABLE state and the
interruption by the fatal signal is reported as -EINTR to the caller.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1460041951-22347-12-git-send-email-mhocko@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep is initialized at compile time now. Get rid of lockdep_init().
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Krinkin <krinkin.m.u@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: mm-commits@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike said:
: CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.e.
: kernel with CONFIG_UBSAN_ALIGNMENT=y fails to load without even any error
: message.
:
: The problem is that ubsan callbacks use spinlocks and might be called
: before lockdep is initialized. Particularly this line in the
: reserve_ebda_region function causes problem:
:
: lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
:
: If i put lockdep_init() before reserve_ebda_region call in
: x86_64_start_reservations kernel loads well.
Fix this ordering issue permanently: change lockdep so that it uses hlists
for the hash tables. Unlike a list_head, an hlist_head is in its
initialized state when it is all-zeroes, so lockdep is ready for operation
immediately upon boot - lockdep_init() need not have run.
The patch will also save some memory.
Probably lockdep_init() and lockdep_initialized can be done away with now.
Suggested-by: Mike Krinkin <krinkin.m.u@gmail.com>
Reported-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: mm-commits@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler updates from Thomas Gleixner:
"This series of scheduler updates depends on sched/core and timers/core
branches, which are already in your tree:
- Scheduler balancing overhaul to plug a hard to trigger race which
causes an oops in the balancer (Peter Zijlstra)
- Lockdep updates which are related to the balancing updates (Peter
Zijlstra)"
* 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched,lockdep: Employ lock pinning
lockdep: Implement lock pinning
lockdep: Simplify lock_release()
sched: Streamline the task migration locking a little
sched: Move code around
sched,dl: Fix sched class hopping CBS hole
sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
sched,dl: Remove return value from pull_dl_task()
sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
sched,rt: Remove return value from pull_rt_task()
sched: Allow balance callbacks for check_class_changed()
sched: Use replace normalize_task() with __sched_setscheduler()
sched: Replace post_schedule with a balance callback list
An apparent oversight left a hardcoded '4' in place when
LOCKSTAT_POINTS was introduced.
The contention_point[] and contending_point[] arrays in the
structs lock_class and lock_class_stats need to be the same
size for the loops in lock_stats() to be correct.
This patch allows LOCKSTAT_POINTS to be changed without
affecting the correctness of the code.
Signed-off-by: George Beshers <gbeshers@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a lockdep annotation that WARNs if you 'accidentially' unlock a
lock.
This is especially helpful for code with callbacks, where the upper
layer assumes a lock remains taken but a lower layer thinks it maybe
can drop and reacquire the lock.
By unwittingly breaking up the lock, races can be introduced.
Lock pinning is a lockdep annotation that helps with this, when you
lockdep_pin_lock() a held lock, any unlock without a
lockdep_unpin_lock() will produce a WARN. Think of this as a relative
of lockdep_assert_held(), except you don't only assert its held now,
but ensure it stays held until you release your assertion.
RFC: a possible alternative API would be something like:
int cookie = lockdep_pin_lock(&foo);
...
lockdep_unpin_lock(&foo, cookie);
Where we pick a random number for the pin_count; this makes it
impossible to sneak a lock break in without also passing the right
cookie along.
I've not done this because it ends up generating code for !LOCKDEP,
esp. if you need to pass the cookie around for some reason.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124743.906731065@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If an RCU read-side critical section occurs within an interrupt handler
or a softirq handler, it cannot have been preempted. Therefore, there is
a check in rcu_read_unlock_special() checking for this error. However,
when this check triggers, it lacks diagnostic information. This commit
therefore moves rcu_read_unlock()'s lockdep annotation to follow the
call to __rcu_read_unlock() and changes rcu_read_unlock_special()'s
WARN_ON_ONCE() to an lockdep_rcu_suspicious() in order to locate where
the offending RCU read-side critical section began. In addition, the
value of the ->rcu_read_unlock_special field is printed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull core locking updates from Ingo Molnar:
"The main updates in this cycle were:
- mutex MCS refactoring finishing touches: improve comments, refactor
and clean up code, reduce debug data structure footprint, etc.
- qrwlock finishing touches: remove old code, self-test updates.
- small rwsem optimization
- various smaller fixes/cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Revert qrwlock recusive stuff
locking/rwsem: Avoid double checking before try acquiring write lock
locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
locking/semaphore: Resolve some shadow warnings
locking/selftest: Support queued rwlock
locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
locking/Documentation: Update locking/mutex-design.txt disadvantages
locking/Documentation: Move locking related docs into Documentation/locking/
locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
locking/mutexes: Refactor optimistic spinning code
locking/mcs: Remove obsolete comment
locking/mutexes: Document quick lock release when unlocking
locking/mutexes: Standardize arguments in lock/unlock slowpaths
locking: Remove deprecated smp_mb__() barriers
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- changes related to No-CBs CPUs and NO_HZ_FULL
- RCU-tasks implementation
- torture-test updates
- miscellaneous fixes
- locktorture updates
- RCU documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
workqueue: Use cond_resched_rcu_qs macro
workqueue: Add quiescent state between work items
locktorture: Cleanup header usage
locktorture: Cannot hold read and write lock
locktorture: Fix __acquire annotation for spinlock irq
locktorture: Support rwlocks
rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
locktorture: Document boot/module parameters
rcutorture: Rename rcutorture_runnable parameter
locktorture: Add test scenario for rwsem_lock
locktorture: Add test scenario for mutex_lock
locktorture: Make torture scripting account for new _runnable name
locktorture: Introduce torture context
locktorture: Support rwsems
locktorture: Add infrastructure for torturing read locks
torture: Address race in module cleanup
locktorture: Make statistics generic
locktorture: Teach about lock debugging
locktorture: Support mutexes
locktorture: Add documentation
...
Commit f0bab73cb5 ("locking/lockdep: Restrict the use of recursive
read_lock() with qrwlock") changed lockdep to try and conform to the
qrwlock semantics which differ from the traditional rwlock semantics.
In particular qrwlock is fair outside of interrupt context, but in
interrupt context readers will ignore all fairness.
The problem modeling this is that read and write side have different
lock state (interrupts) semantics but we only have a single
representation of these. Therefore lockdep will get confused, thinking
the lock can cause interrupt lock inversions.
So revert it for now; the old rwlock semantics were already imperfectly
modeled and the qrwlock extra won't fit either.
If we want to properly fix this, I think we need to resurrect the work
by Gautham did a few years ago that split the read and write state of
locks:
http://lwn.net/Articles/332801/
FWIW the locking selftest that would've failed (and was reported by
Borislav earlier) is something like:
RL(X1); /* IRQ-ON */
LOCK(A);
UNLOCK(A);
RU(X1);
IRQ_ENTER();
RL(X1); /* IN-IRQ */
RU(X1);
IRQ_EXIT();
At which point it would report that because A is an IRQ-unsafe lock we
can suffer the following inversion:
CPU0 CPU1
lock(A)
lock(X1)
lock(A)
<IRQ>
lock(X1)
And this is 'wrong' because X1 can recurse (assuming the above lock are
in fact read-lock) but lockdep doesn't know about this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: ego@linux.vnet.ibm.com
Cc: bp@alien8.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20140930132600.GA7444@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An interface may need to assert a lock invariant and not flood the
system logs; add a lockdep helper macro equivalent to
lockdep_assert_held() which only WARNs once.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, the expedited grace-period primitives do get_online_cpus().
This greatly simplifies their implementation, but means that calls
to them holding locks that are acquired by CPU-hotplug notifiers (to
say nothing of calls to these primitives from CPU-hotplug notifiers)
can deadlock. But this is starting to become inconvenient, as can be
seen here: https://lkml.org/lkml/2014/8/5/754. The problem in this
case is that some developers need to acquire a mutex from a CPU-hotplug
notifier, but also need to hold it across a synchronize_rcu_expedited().
As noted above, this currently results in deadlock.
This commit avoids the deadlock and retains the simplicity by creating
a try_get_online_cpus(), which returns false if the get_online_cpus()
reference count could not immediately be incremented. If a call to
try_get_online_cpus() returns true, the expedited primitives operate as
before. If a call returns false, the expedited primitives fall back to
normal grace-period operations. This falling back of course results in
increased grace-period latency, but only during times when CPU hotplug
operations are actually in flight. The effect should therefore be
negligible during normal operation.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Tested-by: Lan Tianyu <tianyu.lan@intel.com>
Unlike the original unfair rwlock implementation, queued rwlock
will grant lock according to the chronological sequence of the lock
requests except when the lock requester is in the interrupt context.
Consequently, recursive read_lock calls will now hang the process if
there is a write_lock call somewhere in between the read_lock calls.
This patch updates the lockdep implementation to look for recursive
read_lock calls. A new read state (3) is used to mark those read_lock
call that cannot be recursively called except in the interrupt
context. The new read state does exhaust the 2 bits available in
held_lock:read bit field. The addition of any new read state in the
future may require a redesign of how all those bits are squeezed
together in the held_lock structure.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1407345722-61615-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 LTO changes from Peter Anvin:
"More infrastructure work in preparation for link-time optimization
(LTO). Most of these changes is to make sure symbols accessed from
assembly code are properly marked as visible so the linker doesn't
remove them.
My understanding is that the changes to support LTO are still not
upstream in binutils, but are on the way there. This patchset should
conclude the x86-specific changes, and remaining patches to actually
enable LTO will be fed through the Kbuild tree (other than keeping up
with changes to the x86 code base, of course), although not
necessarily in this merge window"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
Kbuild, lto: Handle basic LTO in modpost
Kbuild, lto: Disable LTO for asm-offsets.c
Kbuild, lto: Add a gcc-ld script to let run gcc as ld
Kbuild, lto: add ld-version and ld-ifversion macros
Kbuild, lto: Drop .number postfixes in modpost
Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
lto: Disable LTO for sys_ni
lto: Handle LTO common symbols in module loader
lto, workaround: Add workaround for initcall reordering
lto: Make asmlinkage __visible
x86, lto: Disable LTO for the x86 VDSO
initconst, x86: Fix initconst mistake in ts5500 code
initconst: Fix initconst mistake in dcdbas
asmlinkage: Make trace_hardirqs_on/off_caller visible
asmlinkage, x86: Fix 32bit memcpy for LTO
asmlinkage Make __stack_chk_failed and memcmp visible
asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
asmlinkage: Make main_extable_sort_needed visible
asmlinkage, mutex: Mark __visible
asmlinkage: Make trace_hardirq visible
...
lockdep_sys_exit can be called from assembler code, so make it
asmlinkage.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-5-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cosmetic. This doesn't really matter because a) device->mutex is
the only user of __lockdep_no_validate__ and b) this class should
be never reported as the source of problem, but if something goes
wrong "&dev->mutex" looks better than "&__lockdep_no_validate__"
as the name of the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140120182016.GA26512@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "int check" argument of lock_acquire() and held_lock->check are
misleading. This is actually a boolean: 2 means "true", everything
else is "false".
And there is no need to pass 1 or 0 to lock_acquire() depending on
CONFIG_PROVE_LOCKING, __lock_acquire() checks prove_locking at the
start and clears "check" if !CONFIG_PROVE_LOCKING.
Note: probably we can simply kill this member/arg. The only explicit
user of check => 0 is rcu_lock_acquire(), perhaps we can change it to
use lock_acquire(trylock =>, read => 2). __lockdep_no_validate means
check => 0 implicitly, but we can change validate_chain() to check
hlock->instance->key instead. Not to mention it would be nice to get
rid of lockdep_set_novalidate_class().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140120182006.GA26495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently seqlocks and seqcounts don't support lockdep.
After running across a seqcount related deadlock in the timekeeping
code, I used a less-refined and more focused variant of this patch
to narrow down the cause of the issue.
This is a first-pass attempt to properly enable lockdep functionality
on seqlocks and seqcounts.
Since seqcounts are used in the vdso gettimeofday code, I've provided
non-lockdep accessors for those needs.
I've also handled one case where there were nested seqlock writers
and there may be more edge cases.
Comments and feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In lockdep.h, the spinlock/mutex/rwsem/rwlock/lock_map acquire macros have
different definitions based on the value of CONFIG_PROVE_LOCKING. We have
separate ifdefs for each of these definitions, which seems redundant.
Introduce lock_acquire_{exclusive,shared,shared_recursive} helpers which
will have different definitions based on CONFIG_PROVE_LOCKING. Then all
other helper macros can be defined based on the above ones, which reduces
the amount of ifdefined code.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130708212350.6DD1931C15E@corp2gmr1-1.hot.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
I recently made the mistake of writing:
foo = lockdep_dereference_protected(..., lockdep_assert_held(...));
which is clearly bogus. If lockdep is disabled in the config this would
cause a compile failure, if it is enabled then it compiles and causes a
puzzling warning about dereferencing without the correct protection.
Wrap the macro in "do { ... } while (0)" to also fail compile for this
when lockdep is enabled.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit c9a4962881 ("nfsd:
make client_lock per net") compiling nfs4state.o without
CONFIG_LOCKDEP set, triggers this GCC warning:
fs/nfsd/nfs4state.c: In function ‘free_client’:
fs/nfsd/nfs4state.c:1051:19: warning: unused variable ‘nn’ [-Wunused-variable]
The cause of that warning is that lockdep_assert_held() compiles
away if CONFIG_LOCKDEP is not set. Silence this warning by using
the argument to lockdep_assert_held() as a nop if CONFIG_LOCKDEP
is not set.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Link: http://lkml.kernel.org/r/1359060797.1325.33.camel@x61.thuisdomein
Signed-off-by: Ingo Molnar <mingo@kernel.org>
--
include/linux/lockdep.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
down_write_nest_lock() provides a means to annotate locking scenario
where an outer lock is guaranteed to serialize the order nested locks
are being acquired.
This is analogoue to already existing mutex_lock_nest_lock() and
spin_lock_nest_lock().
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.
Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.
Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.
Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].
Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().
* Patch orginally from Peter. Hugh modified it a bit and wrote the
description.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
zap_locks() is used by printk() in a last ditch effort to get data
out, clearly we cannot trust lock state after this so make it disable
lock debugging.
Also don't treat printk recursion through lockdep as a normal
recursion bug but try hard to get the lockdep splat out.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kqxwmo4xz37e1s8w0xopvr0q@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Long ago, using TREE_RCU with PREEMPT would result in "scheduling
while atomic" diagnostics if you blocked in an RCU read-side critical
section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
this diagnostic. This commit therefore adds a replacement diagnostic
based on PROVE_RCU.
Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
used for things that have nothing to do with rcu_dereference(), rename
lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
argument that is a string indicating what is suspicious. This third
argument is passed in from a new third argument to rcu_lockdep_assert().
Update all calls to rcu_lockdep_assert() to add an informative third
argument.
Also, add a pair of rcu_lockdep_assert() calls from within
rcu_note_context_switch(), one complaining if a context switch occurs
in an RCU-bh read-side critical section and another complaining if a
context switch occurs in an RCU-sched read-side critical section.
These are present only if the PROVE_RCU kernel parameter is enabled.
Finally, fix some checkpatch whitespace complaints in lockdep.c.
Again, you must enable PROVE_RCU to see these new diagnostics. But you
are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In order to convert i_mmap_lock to a mutex we need a mutex equivalent to
spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation.
As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of
the locking pattern where an outer lock serializes the acquisition order
of nested locks. That is, if every time you lock multiple locks A, say A1
and A2 you first acquire N, the order of acquiring A1 and A2 is
irrelevant.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop
workqueue: relax lockdep annotation on flush_work()
During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized. During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.
lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available. While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, the lockdep annotation in flush_work() requires exclusive
access on the workqueue the target work is queued on and triggers
warning if a work is trying to flush another work on the same
workqueue; however, this is no longer true as workqueues can now
execute multiple works concurrently.
This patch adds lock_map_acquire_read() and make process_one_work()
hold read access to the workqueue while executing a work and
start_flush_work() check for write access if concurrnecy level is one
or the workqueue has a rescuer (as only one execution resource - the
rescuer - is guaranteed to be available under memory pressure), and
read access if higher.
This better represents what's going on and removes spurious lockdep
warnings which are triggered by fake dependency chain created through
flush_work().
* Peter pointed out that flushing another work from a WQ_MEM_RECLAIM
wq breaks forward progress guarantee under memory pressure.
Condition check accordingly updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (96 commits)
apic, x86: Use BIOS settings for IBS and MCE threshold interrupt LVT offsets
apic, x86: Check if EILVT APIC registers are available (AMD only)
x86: ioapic: Call free_irte only if interrupt remapping enabled
arm: Use ARCH_IRQ_INIT_FLAGS
genirq, ARM: Fix boot on ARM platforms
genirq: Fix CONFIG_GENIRQ_NO_DEPRECATED=y build
x86: Switch sparse_irq allocations to GFP_KERNEL
genirq: Switch sparse_irq allocator to GFP_KERNEL
genirq: Make sparse_lock a mutex
x86: lguest: Use new irq allocator
genirq: Remove the now unused sparse irq leftovers
genirq: Sanitize dynamic irq handling
genirq: Remove arch_init_chip_data()
x86: xen: Sanitise sparse_irq handling
x86: Use sane enumeration
x86: uv: Clean up the direct access to irq_desc
x86: Make io_apic.c local functions static
genirq: Remove irq_2_iommu
x86: Speed up the irq_remapped check in hot pathes
intr_remap: Simplify the code further
...
Fix up trivial conflicts in arch/x86/Kconfig
Current lockdep_map only caches one class with subclass == 0,
and looks up hash table of classes when subclass != 0.
It seems that this has no problem because the case of
subclass != 0 is rare. But locks of struct rq are
acquired with subclass == 1 when task migration is executed.
Task migration is high frequent event, so I modified lockdep
to cache subclasses.
I measured the score of perf bench sched messaging.
This patch has slightly but certain (order of milli seconds
or 10 milli seconds) effect when lots of tasks are running.
I'll show the result in the tail of this description.
NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
cached in the instances of lockdep_map.
I discussed with Peter Zijlstra in LinuxCon Japan about
this approach and he taught me that caching every subclasses(8)
is cleary waste of memory. So number of cached classes
should be configurable.
=== Score comparison of benchmarks ===
# "min" means best score, and "max" means worst score
for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done
before: min: 0.565000, max: 0.583000, avg: 0.572500
after: min: 0.559000, max: 0.568000, avg: 0.563300
# with more processes
for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done
before: min: 2.274000, max: 2.298000, avg: 2.286300
after: min: 2.242000, max: 2.270000, avg: 2.259700
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286269311-28336-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
early_init_irq_lock_class() is called way before anything touches the
irq descriptors. In case of SPARSE_IRQ=y this is a NOP operation
because the radix tree is empty at this point. For the SPARSE_IRQ=n
case it's sufficient to set the lock class in early_init_irq().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The conversion of device->sem to device->mutex resulted in lockdep
warnings. Create a novalidate class for now until the driver folks
come up with separate classes. That way we have at least the basic
mutex debugging coverage.
Add a checkpatch error so the usage is reserved for device->mutex.
[ tglx: checkpatch and compile fix for LOCKDEP=n ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.
Move lockdep extern declarations to linux/lockdep.h
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We still can apply DaveM's generation count optimization to
BFS, based on the following idea:
- before doing each BFS, increase the global generation id
by 1
- if one node in the graph has been visited, mark it as
visited by storing the current global generation id into
the node's dep_gen_id field
- so we can decide if one node has been visited already, by
comparing the node's dep_gen_id with the global generation id.
By applying DaveM's generation count optimization to current
implementation of BFS, we gain the following advantages:
- we save MAX_LOCKDEP_ENTRIES/8 bytes memory;
- we remove the bitmap_zero(bfs_accessed, MAX_LOCKDEP_ENTRIES);
in each BFS, which is very time-consuming since
MAX_LOCKDEP_ENTRIES may be very large.(16384UL)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "David S. Miller" <davem@davemloft.net>
LKML-Reference: <1248274089-6358-1-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
spin_lock_nest_lock() allows to take many instances of the same
class, this can easily lead to overflow of MAX_LOCK_DEPTH.
To avoid this overflow, we'll stop accounting instances but
start reference counting the class in the held_lock structure.
[ We could maintain a list of instances, if we'd move the hlock
stuff into __lock_acquired(), but that would require
significant modifications to the current code. ]
We restrict this mode to spin_lock_nest_lock() only, because it
degrades the lockdep quality due to lost of instance.
For lockstat this means we don't track lock statistics for any
but the first lock in the series.
Currently nesting is limited to 11 bits because that was the
spare space available in held_lock. This yields a 2048
instances maximium.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a lockdep helper to validate that we indeed are the owner
of a lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some cleanups of the lockdep code after the BFS series:
- Remove the last traces of the generation id
- Fixup comment style
- Move the bfs routines into lockdep.c
- Cleanup the bfs routines
[ tom.leiming@gmail.com: Fix crash ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-11-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently lockdep will print the 1st circle detected if it
exists when acquiring a new (next) lock.
This patch prints the shortest path from the next lock to be
acquired to the previous held lock if a circle is found.
The patch still uses the current method to check circle, and
once the circle is found, breadth-first search algorithem is
used to compute the shortest path from the next lock to the
previous lock in the forward lock dependency graph.
Printing the shortest path will shorten the dependency chain,
and make troubleshooting for possible circular locking easier.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-2-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some filesystems need to set lockdep map for i_mutex differently for
different directories. For example OCFS2 has system directories (for
orphan inode tracking and for gathering all system files like journal
or quota files into a single place) which have different locking
locking rules than standard directories. For a filesystem setting
lockdep map is naturaly done when the inode is read but we have to
modify unlock_new_inode() not to overwrite the lockdep map the filesystem
has set.
Acked-by: peterz@infradead.org
CC: mingo@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
SGI has observed that on large systems, interrupts are not serviced for a
long period of time when waiting for a rwlock. The following patch series
re-enables irqs while waiting for the lock, resembling the code which is
already there for spinlocks.
I only made the ia64 version, because the patch adds some overhead to the
fast path. I assume there is currently no demand to have this for other
architectures, because the systems are not so large. Of course, the
possibility to implement raw_{read|write}_lock_flags for any architecture
is still there.
This patch:
The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation
depending on the config options, so that IRQ's are re-enabled when
possible, but they remain disabled if CONFIG_LOCKDEP is set.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>