Граф коммитов

348 Коммитов

Автор SHA1 Сообщение Дата
Linus Torvalds 3f4d9925e9 Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rcu fix and x86 irq fix from Ingo Molnar:

 - Fix a bug that caused an RCU warning splat.

 - Two x86 irq related fixes: a hotplug crash fix and an ACPI IRQ
   registry fix.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Clear need_qs flag to prevent splat

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable()
  x86/irq: Fix regression caused by commit b568b8601f
2015-02-21 10:36:06 -08:00
Tejun Heo ad853b48cb rcu: use %*pb[l] to print bitmaps including cpumasks and nodemasks
printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:37 -08:00
Paul E. McKenney c0135d07b0 rcu: Clear need_qs flag to prevent splat
If the scheduling-clock interrupt sets the current tasks need_qs flag,
but if the current CPU passes through a quiescent state in the meantime,
then rcu_preempt_qs() will fail to clear the need_qs flag, which can fool
RCU into thinking that additional rcu_read_unlock_special() processing
is needed.  This commit therefore clears the need_qs flag before checking
for additional processing.

For this problem to occur, we need rcu_preempt_data.passed_quiesce equal
to true and current->rcu_read_unlock_special.b.need_qs also equal to true.
This condition can occur as follows:

1.	CPU 0 is aware of the current preemptible RCU grace period,
	but has not yet passed through a quiescent state.  Among other
	things, this means that rcu_preempt_data.passed_quiesce is false.

2.	Task A running on CPU 0 enters a preemptible RCU read-side
	critical section.

3.	CPU 0 takes a scheduling-clock interrupt, which notices the
	RCU read-side critical section and the need for a quiescent state,
	and thus sets current->rcu_read_unlock_special.b.need_qs to true.

4.	Task A is preempted, enters the scheduler, eventually invoking
	rcu_preempt_note_context_switch() which in turn invokes
	rcu_preempt_qs().

	Because rcu_preempt_data.passed_quiesce is false,
	control enters the body of the "if" statement, which sets
	rcu_preempt_data.passed_quiesce to true.

5.	At this point, CPU 0 takes an interrupt.  The interrupt
	handler contains an RCU read-side critical section, and
	the rcu_read_unlock() notes that current->rcu_read_unlock_special
	is nonzero, and thus invokes rcu_read_unlock_special().

6.	Once in rcu_read_unlock_special(), the fact that
	current->rcu_read_unlock_special.b.need_qs is true becomes
	apparent, so rcu_read_unlock_special() invokes rcu_preempt_qs().
	Recursively, given that we interrupted out of that same
	function in the preceding step.

7.	Because rcu_preempt_data.passed_quiesce is now true,
	rcu_preempt_qs() does nothing, and simply returns.

8.	Upon return to rcu_read_unlock_special(), it is noted that
	current->rcu_read_unlock_special is still nonzero (because
	the interrupted rcu_preempt_qs() had not yet gotten around
	to clearing current->rcu_read_unlock_special.b.need_qs).

9.	Execution proceeds to the WARN_ON_ONCE(), which notes that
	we are in an interrupt handler and thus duly splats.

The solution, as noted above, is to make rcu_read_unlock_special()
clear out current->rcu_read_unlock_special.b.need_qs after calling
rcu_preempt_qs().  The interrupted rcu_preempt_qs() will clear it again,
but this is harmless.  The worst that happens is that we clobber another
attempt to set this field, but this is not a problem because we just
got done reporting a quiescent state.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix embarrassing build bug noted by Sasha Levin. ]
Tested-by: Sasha Levin <sasha.levin@oracle.com>
2015-02-11 15:46:43 -08:00
Paul E. McKenney 78e691f4ae Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD
doc.2015.01.07a: Documentation updates.
fixes.2015.01.15a: Miscellaneous fixes.
preempt.2015.01.06a: Changes to handling of lists of preempted tasks.
srcu.2015.01.06a: SRCU updates.
stall.2015.01.16a: RCU CPU stall-warning updates and fixes.
torture.2015.01.11a: RCU torture-test updates and fixes.
2015-01-15 23:34:34 -08:00
Paul E. McKenney a94844b22a rcu: Optionally run grace-period kthreads at real-time priority
Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings.  However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.

Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default.  The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter.  Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15 23:25:04 -08:00
Paul E. McKenney 917963d0b3 rcutorture: Check from beginning to end of grace period
Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one.  This commit
tightens up these checks by measuring from the start and end of the same
grace period.  This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.

We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-10 19:08:02 -08:00
Paul E. McKenney 9733e4f0a9 rcu: Make _batches_completed() functions return unsigned long
Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns.  However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field.  This patch cleans this up by changing these functions' return
types to unsigned long.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-10 19:07:56 -08:00
Paul E. McKenney e3663b1024 rcu: Handle gpnum/completed wrap while dyntick idle
Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap.  For
example, consider the following sequence of events:

o	CPU 1 encounters a quiescent state while waiting for grace period
	5 to complete, but then enters dyntick-idle mode.

o	While CPU 1 is in dyntick-idle mode, the grace-period counters
	wrap around so that the grace period number is now 4.

o	Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
	and grace period 5 begins.

o	The quiescent state that CPU 1 passed through during the old
	grace period 5 looks like it applies to the new grace period
	5.  Therefore, the new grace period 5 completes without CPU 1
	having passed through a quiescent state.

This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time.  At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems.  However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible.  It is therefore time to
close this race.

This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:05:28 -08:00
Paul E. McKenney fc908ed33e rcu: Make RCU_CPU_STALL_INFO include number of fqs attempts
One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute.  One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:05:25 -08:00
Lai Jiangshan abaf3f9d27 rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid priority-inversion
The patch dfeb9765ce ("Allow post-unlock reference for rt_mutex")
ensured rcu-boost safe even the rt_mutex has post-unlock reference.

But rt_mutex allowing post-unlock reference is definitely a bug and it was
fixed by the commit 27e35715df ("rtmutex: Plug slow unlock race").
This fix made the previous patch (dfeb9765ce) useless.

And even worse, the priority-inversion introduced by the the previous
patch still exists.

rcu_read_unlock_special() {
	rt_mutex_unlock(&rnp->boost_mtx);
	/* Priority-Inversion:
	 * the current task had been deboosted and preempted as a low
	 * priority task immediately, it could wait long before reschedule in,
	 * and the rcu-booster also waits on this low priority task and sleeps.
	 * This priority-inversion makes rcu-booster can't work
	 * as expected.
	 */
	complete(&rnp->boost_completion);
}

Just revert the patch to avoid it.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:52 -08:00
Paul E. McKenney 5d0b024973 rcu: Don't bother affinitying rcub kthreads away from offline CPUs
When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
rcu_node structure are now offline, it affinities the corresponding
RCU-boost ("rcub") kthread away from those CPUs.  This is pointless
because the kthread cannot run on those offline CPUs in any case.
This commit therefore removes this unneeded code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:50 -08:00
Paul E. McKenney 3e9f5c70d8 rcu: Don't spawn rcub kthreads on root rcu_node structure
Now that offlining CPUs no longer moves leaf rcu_node structures'
->blkd_tasks lists to the root, there is no way for the root rcu_node
structure's ->blkd_task list to be nonempty, unless the root node is also
the sole leaf node.  This commit therefore refrains from creating an rcub
kthread for the root rcu_node structure unless it is also the sole leaf.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:47 -08:00
Paul E. McKenney 96e92021d4 rcu: Make use of rcu_preempt_has_tasks()
Given that there is now arcu_preempt_has_tasks() function that checks
to see if the ->blkd_tasks list is non-empty, this commit makes use of it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:46 -08:00
Paul E. McKenney d19fb8d1f3 rcu: Don't migrate blocked tasks even if all corresponding CPUs offline
When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure.  Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited.  Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.

Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.)  This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running.  This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU.  So we now need CPU hotplug
operations to avoid undue latencies.

This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special().  This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them.  This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:44 -08:00
Paul E. McKenney b6a932d1d9 rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearing
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty.  This is the final commit in preparation
for the rework of RCU priority boosting:  It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:43 -08:00
Paul E. McKenney 8af3a5e78c rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting.  This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:41 -08:00
Paul E. McKenney 74e871ac6c rcu: Rename "empty" to "empty_norm" in preparation for boost rework
This commit undertakes a simple variable renaming to make way for
some rework of RCU priority boosting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:40 -08:00
Paul E. McKenney b08ea27d95 rcu: Protect rcu_boost() lockless accesses with ACCESS_ONCE()
This commit prevents random compiler optimizations by applying
ACCESS_ONCE() to lockless accesses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:39 -08:00
Paul E. McKenney 41050a0096 rcu: Fix rcu_barrier() race that could result in too-short wait
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU.  However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another.  It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked.  Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:15 -08:00
Paul E. McKenney 9ea6c58856 Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD
cpu.2014.11.03a: Changes for per-CPU variables.
doc.2014.11.13a: Documentation updates.
fixes.2014.11.13a: Miscellaneous fixes.
signal.2014.10.29a: Signal changes.
rt.2014.10.29a: Real-time changes.
torture.2014.11.03a: torture-test changes.
2014-11-13 10:39:04 -08:00
Paul E. McKenney 8fa7845df5 rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
The "cpu" argument to rcu_cleanup_after_idle() is always the current
CPU, so drop it.  This moves the smp_processor_id() from the caller to
rcu_cleanup_after_idle(), saving argument-passing overhead.  Again,
the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:56 -08:00
Paul E. McKenney 198bbf8127 rcu: Remove "cpu" argument to rcu_prepare_for_idle()
The "cpu" argument to rcu_prepare_for_idle() is always the current
CPU, so drop it.  This in turn allows two of the uses of "cpu" in
this function to be replaced with a this_cpu_ptr() and the third by
smp_processor_id(), replacing that of the call to rcu_prepare_for_idle().
Again, the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:49 -08:00
Paul E. McKenney aa6da5140b rcu: Remove "cpu" argument to rcu_needs_cpu()
The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it.  This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr().  Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:43 -08:00
Paul E. McKenney 38200cf247 rcu: Remove "cpu" argument to rcu_note_context_switch()
The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it.  This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:34 -08:00
Paul E. McKenney 86aea0e6e7 rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
Because rcu_preempt_check_callbacks()'s argument is guaranteed to
always be the current CPU, drop the argument and replace per_cpu()
with __this_cpu_read().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:26 -08:00
Christoph Lameter 28ced795cb rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(&rcu_dynticks)
For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp).  Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:19:26 -08:00
Paul E. McKenney bbe5d7a93a rcu: Fix for rcuo online-time-creation reorganization bug
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
contains checks for the case where CPUs are brought online out of
order, re-wiring the rcuo leader-follower relationships as needed.
Unfortunately, this rewiring was broken.  This apparently went undetected
due to the tendency of systems to bring CPUs online in order.  This commit
nevertheless fixes the rewiring.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:20:08 -07:00
Pranith Kumar 28f6569ab7 rcu: Remove redundant TREE_PREEMPT_RCU config option
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:20:05 -07:00
Clark Williams 21871d7eff rcu: Unify boost and kthread priorities
Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this
value for both the per-CPU kthreads (rcuc/N) and the rcu boosting
threads (rcub/n).

Also, create the module_parameter rcutree.kthread_prio to be used on
the kernel command line at boot to set a new value (rcutree.kthread_prio=N).

Signed-off-by: Clark Williams <clark.williams@gmail.com>
[ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:19:41 -07:00
Paul E. McKenney 61cfd0970e rcu: Move RCU_BOOST variable declarations, eliminating #ifdef
There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c.  This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-28 13:49:28 -07:00
Paul E. McKenney 0eafa46823 rcu: Remove CONFIG_RCU_CPU_STALL_VERBOSE
The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible
RCU's CPU stall warnings to dump out any preempted tasks that are blocking
the current RCU grace period.  This information is useful, and the default
has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years.  It is therefore
time for this commit to remove this Kconfig parameter, so that future
kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-28 13:48:13 -07:00
Paul E. McKenney d7e2993396 rcu: Make rcu_barrier() understand about missing rcuo kthreads
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a4, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
2014-10-28 13:24:13 -07:00
Paul E. McKenney dd56af42bd rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
Currently, the expedited grace-period primitives do get_online_cpus().
This greatly simplifies their implementation, but means that calls
to them holding locks that are acquired by CPU-hotplug notifiers (to
say nothing of calls to these primitives from CPU-hotplug notifiers)
can deadlock.  But this is starting to become inconvenient, as can be
seen here: https://lkml.org/lkml/2014/8/5/754.  The problem in this
case is that some developers need to acquire a mutex from a CPU-hotplug
notifier, but also need to hold it across a synchronize_rcu_expedited().
As noted above, this currently results in deadlock.

This commit avoids the deadlock and retains the simplicity by creating
a try_get_online_cpus(), which returns false if the get_online_cpus()
reference count could not immediately be incremented.  If a call to
try_get_online_cpus() returns true, the expedited primitives operate as
before.  If a call returns false, the expedited primitives fall back to
normal grace-period operations.  This falling back of course results in
increased grace-period latency, but only during times when CPU hotplug
operations are actually in flight.  The effect should therefore be
negligible during normal operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Tested-by: Lan Tianyu <tianyu.lan@intel.com>
2014-09-18 16:22:27 -07:00
Paul E. McKenney 96b4672703 Merge branch 'rcu-tasks.2014.09.10a' into HEAD
rcu-tasks.2014.09.10a: Add RCU-tasks flavor of RCU.
2014-09-16 10:10:44 -07:00
Paul E. McKenney e98d06dd6c Merge branches 'doc.2014.09.07a', 'fixes.2014.09.10a', 'nocb-nohz.2014.09.16b' and 'torture.2014.09.07a' into HEAD
doc.2014.09.07a: Documentation updates.
fixes.2014.09.10a: Miscellaneous fixes.
nocb-nohz.2014.09.16b: No-CBs CPUs and NO_HZ_FULL updates.
torture.2014.09.07a: Torture-test updates.
2014-09-16 10:08:34 -07:00
Paul E. McKenney c847f14217 rcu: Avoid misordering in nocb_leader_wait()
The NOCB follower wakeup ordering depends on the store to the tail
pointer happening before the wakeup.  However, because atomic_long_add()
does not return a value, it does not provide ordering guarantees, and
the locking in wake_up() only guarantees that the store will happen
before the unlock, which might be too late.  Even though this is only a
theoretical issue, this commit adds a smp_mb__after_atomic() after the
final atomic_long_add() to provide the needed ordering guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 1772947bd0 rcu: Handle NOCB callbacks from irq-disabled idle code
If an RCU callback is queued on a no-CBs CPU from idle code with irqs
disabled, and if that CPU stays idle forever after, the callback will
never be invoked.  This commit therefore adds a check for this situation
in ____call_rcu_nocb(), invoking the RCU core solely for the purpose
of the ensuing return-to-idle transition.  (If the CPU doesn't return
to idle, the next scheduling-clock interrupt will fix things up.)

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 39953dfd40 rcu: Avoid misordering in __call_rcu_nocb_enqueue()
The NOCB leader wakeup ordering depends on the store to the header
happening before the check for the leader already being awake.  However,
because atomic_long_add() does not return a value, it does not provide
ordering guarantees, the incorrect comment in wake_nocb_leader()
notwithstanding.  This commit therefore adds a smp_mb__after_atomic()
after the final atomic_long_add() to provide the needed ordering
guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 663e131090 rcu: Don't track sysidle state if no nohz_full= CPUs
If there are no nohz_full= CPUs, then there is currently no reason to
track sysidle state.  This commit therefore short-circuits this state
tracking if !tick_nohz_full_enabled().

Note that these checks will need to be revisited if nohz_full= state
can ever be changed at runtime.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 417e8d2655 rcu: Eliminate redundant rcu_sysidle_state variable
Now that we have rcu_state_p, which references rcu_preempt_state for
TREE_PREEMPT_RCU and rcu_sched_state for TREE_RCU, we don't need a
separate rcu_sysidle_state variable.  This commit therefore eliminates
rcu_preempt_state in favor of rcu_state_p.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Pranith Kumar 22c2f66961 rcu: Check for have_rcu_nocb_mask instead of rcu_nocb_mask
If we configure a kernel with CONFIG_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_NONE=y and
CONFIG_CPUMASK_OFFSTACK=n and do not pass in a rcu_nocb= boot parameter, the
cpumask rcu_nocb_mask can be garbage instead of NULL.

Hence this commit replaces checks for rcu_nocb_mask == NULL with a check for
have_rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 35ce7f29a4 rcu: Create rcuo kthreads only for onlined CPUs
RCU currently uses for_each_possible_cpu() to spawn rcuo kthreads,
which can result in more rcuo kthreads than one would expect, for
example, derRichard reported 64 CPUs worth of rcuo kthreads on an
8-CPU image.  This commit therefore creates rcuo kthreads only for
those CPUs that actually come online.

This was reported by derRichard on the OFTC IRC network.

Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 9386c0b75d rcu: Rationalize kthread spawning
Currently, RCU spawns kthreads from several different early_initcall()
functions.  Although this has served RCU well for quite some time,
as more kthreads are added a more deterministic approach is required.
This commit therefore causes all of RCU's early-boot kthreads to be
spawned from a single early_initcall() function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar f4aa84ba24 rcu: Return false instead of 0 in rcu_nocb_adopt_orphan_cbs()
Return false instead of 0 in rcu_nocb_adopt_orphan_cbs() as this has
bool as return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar 4afc7e269b rcu: Use false for return in __call_rcu_nocb()
Return false instead of 0 in __call_rcu_nocb() as this has bool as
return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar 0a9e1e111b rcu: Use true/false for return in rcu_nocb_adopt_orphan_cbs()
Return true/false in rcu_nocb_adopt_orphan_cbs() instead of 0/1 as
this function has return type of bool.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar c271d3a957 rcu: Use true/false for return in __call_rcu_nocb()
Return true/false instead of 0/1 in __call_rcu_nocb() as this returns a
bool type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar 949cccdbe6 rcu: Check the return value of zalloc_cpumask_var()
This commit checks the return value of the zalloc_cpumask_var() used for
allocating cpumask for rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Paul E. McKenney f4579fc57c rcu: Fix attempt to avoid unsolicited offloading of callbacks
Commit b58cc46c5f (rcu: Don't offload callbacks unless specifically
requested) failed to adjust the callback lists of the CPUs that are
known to be no-CBs CPUs only because they are also nohz_full= CPUs.
This failure can result in callbacks that are posted during early boot
getting stranded on nxtlist for CPUs whose no-CBs property becomes
apparent late, and there can also be spurious warnings about offline
CPUs posting callbacks.

This commit fixes these problems by adding an early-boot rcu_init_nohz()
that properly initializes the no-CBs CPUs.

Note that kernels built with CONFIG_RCU_NOCB_CPU_ALL=y or with
CONFIG_RCU_NOCB_CPU=n do not exhibit this bug.  Neither do kernels
booted without the nohz_full= boot parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:07:59 -07:00
Paul E. McKenney 284a8c93af rcu: Per-CPU operation cleanups to rcu_*_qs() functions
The rcu_bh_qs(), rcu_preempt_qs(), and rcu_sched_qs() functions use
old-style per-CPU variable access and write to ->passed_quiesce even
if it is already set.  This commit therefore updates to use the new-style
per-CPU variable access functions and avoids the spurious writes.
This commit also eliminates the "cpu" argument to these functions because
they are always invoked on the indicated CPU.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:35 -07:00
Paul E. McKenney 1d082fd061 rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch()
The rcu_preempt_note_context_switch() function is on a scheduling fast
path, so it would be good to avoid disabling irqs.  The reason that irqs
are disabled is to synchronize process-level and irq-handler access to
the task_struct ->rcu_read_unlock_special bitmask.  This commit therefore
makes ->rcu_read_unlock_special instead be a union of bools with a short
allowing single-access checks in RCU's __rcu_read_unlock().  This results
in the process-level and irq-handler accesses being simple loads and
stores, so that irqs need no longer be disabled.  This commit therefore
removes the irq disabling from rcu_preempt_note_context_switch().

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:34 -07:00
Paul E. McKenney 176f8f7a52 rcu: Make TASKS_RCU handle nohz_full= CPUs
Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution.  There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state.  The grace period would therefore
extend indefinitely.  This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:30 -07:00
Paul E. McKenney bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Paul E. McKenney 73a860cd58 rcu: Replace flush_signals() with WARN_ON(signal_pending())
Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible.  Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:20 -07:00
Paul E. McKenney 9fdd3bc900 rcu: Break more call_rcu() deadlock involving scheduler and perf
Commit 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler
and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake
the rcuo kthread due to the queue being initially empty, but did not
do anything for the case where the queue was overflowing.  This commit
therefore also defers wakeup for the overflow case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:17 -07:00
Pranith Kumar d0bc90fd37 rcu: Return bool type for rcu_try_advance_all_cbs()
Return a bool type instead of 0 in rcu_try_advance_all_cbs().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:10 -07:00
Pranith Kumar bf33eb1aef rcu: Fix sparse warning about rcu_batches_completed_preempt() being non-static
fix sparse warning about rcu_batches_completed_preempt() being non-static by
marking it as static

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:08 -07:00
Pranith Kumar 4de376a1b1 rcu: Remove remaining read-modify-write ACCESS_ONCE() calls
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:07 -07:00
Pranith Kumar 11ed7f934c rcu: Make nocb leader kthreads process pending callbacks after spawning
The nocb callbacks generated before the nocb kthreads are spawned are
enqueued in the nocb queue for later processing. Commit fbce7497ee ("rcu:
Parallelize and economize NOCB kthread wakeups") introduced nocb leader kthreads
which checked the nocb_leader_wake flag to see if there were any such pending
callbacks. A case was reported in which newly spawned leader kthreads were not
processing the pending callbacks as this flag was not set, which led to a boot
hang.

The following commit ensures that the newly spawned nocb kthreads process the
pending callbacks by allowing the kthreads to run immediately after spawning
instead of waiting. This is done by inverting the logic of nocb_leader_wake
tests to nocb_leader_sleep which allows us to use the default initialization of
this flag to 0 to let the kthreads run.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Link: http://www.spinics.net/lists/kernel/msg1802899.html
[ paulmck: Backported to v3.17-rc2. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Amit Shah <amit.shah@redhat.com>
2014-08-28 05:59:59 -07:00
Paul E. McKenney 187497fa5e rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
If there isn't a nohz_full= kernel parameter specified, then
tick_nohz_full_mask can legitimately be NULL.  This can cause
problems when RCU's boot code tries to cpumask_or() this value into
rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
in doing the cpumask_or() in the first place because this will cause
RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
rcu_nocb_mask.

This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
and checks for !tick_nohz_full_running otherwise, this latter check
catching cases when there was no nohz_full= kernel parameter specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-16 10:44:46 -07:00
Paul E. McKenney 1823172ab5 Merge branches 'doc.2014.07.08a', 'fixes.2014.07.09a', 'maintainers.2014.07.08b', 'nocbs.2014.07.07a' and 'torture.2014.07.07a' into HEAD
doc.2014.07.08a: Documentation updates.
fixes.2014.07.09a: Miscellaneous fixes.
maintainers.2014.07.08b: Maintainership updates.
nocbs.2014.07.07a: Callback-offloading fixes.
torture.2014.07.07a: Torture-test updates.
2014-07-09 09:16:54 -07:00
Pranith Kumar b41d1b924d rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
following sparse warning:

kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:51 -07:00
Pranith Kumar 615e41c605 rcu: Fix a sparse warning in rcu_initiate_boost()
This commit annotates rcu_initiate_boost() fixes the following sparse
warning:

	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:45 -07:00
Paul E. McKenney c0f489d2c6 rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
Binding the grace-period kthreads to the timekeeping CPU resulted in
significant performance decreases for some workloads.  For more detail,
see:

https://lkml.org/lkml/2014/6/3/395 for benchmark numbers

https://lkml.org/lkml/2014/6/4/218 for CPU statistics

It turns out that it is necessary to bind the grace-period kthreads
to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
In other cases, it suffices to bind the grace-period kthreads to the
set of non-nohz_full CPUs.

This commit therefore creates a tick_nohz_not_full_mask that is the
complement of tick_nohz_full_mask, and then binds the grace-period
kthread to the set of CPUs indicated by this new mask, which covers
the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
case still binds the grace-period kthreads to the timekeeping CPU.
This commit also includes the tick_nohz_full_enabled() check suggested
by Frederic Weisbecker.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Created housekeeping_affine() and housekeeping_mask per
  fweisbec feedback. ]
2014-07-09 09:15:02 -07:00
Paul E. McKenney abaa93d9e1 rcu: Simplify priority boosting by putting rt_mutex in rcu_node
RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:01 -07:00
Paul E. McKenney dfeb9765ce rcu: Allow post-unlock reference for rt_mutex
The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee.  When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
to boost the next task that is blocking the current RCU grace period.

But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it.  But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it.  This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.

This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:00 -07:00
Paul E. McKenney 4da117cfa7 rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
removes the redundant ACCESS_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:14:35 -07:00
Paul E. McKenney b58cc46c5f rcu: Don't offload callbacks unless specifically requested
Enabling NO_HZ_FULL currently has the side effect of enabling callback
offloading on all CPUs.  This results in lots of additional rcuo kthreads,
and can also increase context switching and wakeups, even in cases where
callback offloading is neither needed nor particularly desirable.  This
commit therefore enables callback offloading on a given CPU only if
specifically requested at build time or boot time, or if that CPU has
been specifically designated (again, either at build time or boot time)
as a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney fbce7497ee rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney 4a81e8328d rcu: Reduce overhead of cond_resched() checks for RCU
Commit ac1bea8578 (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls.  Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.

This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable.  However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs.  These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
  ktest build robot.  Also fixed smp_mb() comment as noted by
  Oleg Nesterov. ]

Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-23 11:19:32 -07:00
Linus Torvalds 776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Uma Sharma e534165bbf rcu: Variable name changed in tree_plugin.h and used in tree.c
The variable and struct both having the name "rcu_state" confuses
sparse in some situations, so this commit changes the variable to
"rcu_state_p" in order to avoid this confusion.  This also makes
things easier for human readers.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
[ paulmck: Changed the declaration and several additional uses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 11:41:04 -07:00
Christoph Lameter fa07a58f71 rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr()
__this_cpu_ptr is being phased out.

One special case is increment_cpu_stall_ticks().
A per cpu variable is incremented so use raw_cpu_inc().

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:35 -07:00
Paul E. McKenney becb41bfe0 rcu: Make large and small sysidle systems use same state machine
Currently, small systems move back into RCU_SYSIDLE_NOT from
RCU_SYSIDLE_SHORT and large systems do not.  This works because moving
aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness,
and on small systems, the performance impact should be negligible.  That
said, this difference does make RCU a bit more complex, and RCU does not
seem to be suffering from any lack of complexity.  This commit therefore
adjusts small-system operation to match that of large systems, so that
the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:24 -07:00
Paul E. McKenney 5057f55e54 rcu: Bind RCU grace-period kthreads if NO_HZ_FULL
Currently, RCU binds the grace-period kthreads to the timekeeping
CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y.  This means that these
kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and
CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on
random CPUs.  Given that we are trying to reduce the amount of manual
tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit
makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where
CONFIG_NO_HZ_FULL_SYSIDLE=n.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:19 -07:00
Andreea-Cristina Bernat a381d757d9 rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state()
This patch merges the function rcu_force_quiescent_state() with
rcu_sched_force_quiescent_state(), using the rcu_state pointer.  Firstly,
the rcu_sched_force_quiescent_state() function is deleted from the file
kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
calling force_quiescent_state with the argument rcu_preempt_state pointer
was deleted as well.  The new function that combines the old ones uses
the rcu_state pointer and is located after rcu_batches_completed_bh()
in kernel/rcu/tree.c.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:07 -07:00
Andreea-Cristina Bernat 495aa969db rcu: Consolidate kfree_call_rcu() to use rcu_state pointer
kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
This patch uses the rcu_state_pointer to combine the two definitions into one.
The resulting function is placed after the closing of the preprocessor
conditional CONFIG_TREE_PREEMPT_RCU.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:01 -07:00
Paul E. McKenney 48a7639ce8 rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread.  This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.

Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off.  This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.

However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs.  In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress.  Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling.  In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls.  Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.

In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.

This commit therefore makes this change.  The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed.  A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:44:07 -07:00
Paul E. McKenney 365187fbc0 rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUs
In the old days, the only source of requests for future grace periods
was NOCB CPUs.  This has changed: CPUs routinely post requests for
future grace periods in order to promote power efficiency and reduce
OS jitter with minimal impact on grace-period latency.  This commit
therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
instead of rcu_nocb_needs_gp().  The latter is no longer used, so is
now removed.  This commit also adds tracing for the irq_work_queue()
wakeup case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:43:32 -07:00
Liu Ping Fan 24342c963a rcu: Fix incorrect notes for code
Signed-off-by: Liu Ping Fan <kernelfans@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:43:19 -07:00
Peter Zijlstra 4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Paul E. McKenney 322efba5b6 Merge branches 'doc.2014.02.24a', 'fixes.2014.02.26a' and 'rt.2014.02.17b' into HEAD
doc.2014.02.24a: Documentation changes
fixes.2014.02.26a: Miscellaneous fixes
rt.2014.02.17b: Response-time-related changes
2014-02-26 06:36:09 -08:00
Paul E. McKenney f1f399d128 rcu: Optimize RCU_FAST_NO_HZ for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then no CPU will ever have RCU callbacks
because these callbacks will instead be handled by the rcuo kthreads.
However, the current version of RCU_FAST_NO_HZ nevertheless checks for RCU
callbacks.  This commit therefore creates static inline implementations
of rcu_prepare_for_idle() and rcu_cleanup_after_idle() that are no-ops
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:33 -08:00
Paul E. McKenney ffa83fb565 rcu: Optimize rcu_needs_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always
return false, however, the current version nevertheless checks
for RCU callbacks.  This commit therefore creates a static inline
implementation of rcu_needs_cpu() that unconditionally returns false
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:09 -08:00
Paul E. McKenney 2f33b512a5 rcu: Optimize rcu_is_nocb_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_is_nocb_cpu() will always
return true, however, the current version nevertheless checks
rcu_nocb_mask.  This commit therefore creates a static inline
implementation of rcu_is_nocb_cpu() that unconditionally returns
true when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:32:48 -08:00
Paul Bolle 52e2bb958a rcu: Disambiguate CONFIG_RCU_NOCB_CPUs
This commit fixes a grammar issue in the rcu_nohz_full_cpu() comment
header, so that it is clear that the plural is CPUs not Kconfig options.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:02:08 -08:00
Paul E. McKenney 87de1cfdc5 rcu: Stop tracking FSF's postal address
All of the RCU source files have the usual GPL header, which contains a
long-obsolete postal address for FSF.  To avoid the need to track the
FSF office's movements, this commit substitutes the URL where GPL may
be found.

Reported-by: Greg KH <gregkh@linuxfoundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:01:37 -08:00
Linus Torvalds a693c46e14 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 - add RCU torture scripts/tooling
 - static analysis improvements
 - update RCU documentation
 - miscellaneous fixes

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  rcu: Remove "extern" from function declarations in kernel/rcu/rcu.h
  rcu: Remove "extern" from function declarations in include/linux/*rcu*.h
  rcu/torture: Dynamically allocate SRCU output buffer to avoid overflow
  rcu: Don't activate RCU core on NO_HZ_FULL CPUs
  rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
  rcu: Add an RCU_INITIALIZER for global RCU-protected pointers
  rcu: Make rcu_assign_pointer's assignment volatile and type-safe
  bonding: Use RCU_INIT_POINTER() for better overhead and for sparse
  rcu: Add comment on evaluate-once properties of rcu_assign_pointer().
  rcu: Provide better diagnostics for blocking in RCU callback functions
  rcu: Improve SRCU's grace-period comments
  rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values
  rcu: Fix coccinelle warnings
  rcutorture: Stop tracking FSF's postal address
  rcutorture: Move checkarg to functions.sh
  rcutorture: Flag errors and warnings with color coding
  rcutorture: Record results from repeated runs of the same test scenario
  rcutorture: Test summary at end of run with less chattiness
  rcutorture: Update comment in kvm.sh listing typical RCU trace events
  rcutorture: Add tracing-enabled version of TREE08
  ...
2014-01-20 10:25:12 -08:00
Ingo Molnar bb799d3b98 Linux 3.13-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSrhGrAAoJEHm+PkMAQRiGsNoH/jIK3CsQ2lbW7yRLXmfgtbzz
 i2Kep6D4SDvmaLpLYOVC8xNYTiE8jtTbSXHomwP5wMZ63MQDhBfnEWsEWqeZ9+D9
 3Q46p0QWuoBgYu2VGkoxTfygkT6hhSpwWIi3SeImbY4fg57OHiUil/+YGhORM4Qc
 K4549OCTY3sIrgmWL77gzqjRUo+pQ4C73NKqZ3+5nlOmYBZC1yugk8mFwEpQkwhK
 4NRNU760Fo+XIht/bINqRiPMddzC15p0mxvJy3cDW8bZa1tFSS9SB7AQUULBbcHL
 +2dFlFOEb5SV1sNiNPrJ0W+h2qUh2e7kPB0F8epaBppgbwVdyQoC2u4uuLV2ZN0=
 =lI2r
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc4' into core/locking

Merge Linux 3.13-rc4, to refresh this rather old tree with the latest fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:27:08 +01:00
Ingo Molnar 73a7ac2808 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull v3.14 RCU updates from Paul E. McKenney.

The main changes:

  * Update RCU documentation.

  * Miscellaneous fixes.

  * Add RCU torture scripts.

  * Static-analysis improvements.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-16 11:43:41 +01:00
Paul E. McKenney 6303b9c87d rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods
RCU must ensure that there is the equivalent of a full memory
barrier between any memory access preceding grace period and any
memory access following that same grace period, regardless of
which CPU(s) happen to execute the two memory accesses.
Therefore, downgrading UNLOCK+LOCK to no longer imply a full
memory barrier requires some adjustments to RCU.

This commit therefore adds smp_mb__after_unlock_lock()
invocations as needed after the RCU lock acquisitions that need
to be part of a full-memory-barrier UNLOCK+LOCK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-16 11:36:16 +01:00
Paul E. McKenney a096932f0c rcu: Don't activate RCU core on NO_HZ_FULL CPUs
Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
if the RCU core needs anything from this CPU.  If so, RCU raises
RCU_SOFTIRQ to carry out any needed processing.

This approach has worked well historically, but it is undesirable on
NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
in userspace, so that scheduling-clock interrupts can be disabled while
there is only one runnable task on the CPU in question.  Unfortunately,
raising any softirq has the potential to wake up ksoftirqd, which would
provide the second runnable task on that CPU, preventing disabling of
scheduling-clock interrupts.

What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
relying on the grace-period kthreads' quiescent-state forcing to
do any needed RCU work on behalf of those CPUs.

This commit therefore refrains from raising RCU_SOFTIRQ on any
NO_HZ_FULL CPUs during any grace periods that have been in effect
for less than one second.  The one-second limit handles the case
where an inappropriate workload is running on a NO_HZ_FULL CPU
that features lots of scheduling-clock interrupts, but no idle
or userspace time.

Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <bitbucket@online.de>
Toasted-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-12-12 12:34:15 -08:00
Lai Jiangshan 79a62f957e rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
After commit #10f39bb1b2c1 (rcu: protect __rcu_read_unlock() against
scheduler-using irq handlers), it is no longer possible to enter
the main body of rcu_read_lock_special() from an NMI, interrupt, or
softirq handler.  In theory, this implies that the check for "in_irq()
|| in_serving_softirq()" must always fail, so that in theory this check
could be removed entirely.

In practice, this commit wraps this condition with a WARN_ON_ONCE().
If this warning never triggers, then the condition will be removed
entirely.

[ paulmck: And one way of triggering the WARN_ON() is if a scheduling
  clock interrupt occurs in an RCU read-side critical section, setting
  RCU_READ_UNLOCK_NEED_QS, which is handled by rcu_read_unlock_special().
  Updated this commit to return if only that bit was set. ]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-12 12:34:00 -08:00
Paul E. McKenney 96d3fd0d31 rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-03 10:10:18 -08:00
Paul E. McKenney 78e4bc34e5 rcu: Fix and comment ordering around wait_event()
It is all too easy to forget that wait_event() does not necessarily
imply a full memory barrier.  The case where it does not is where the
condition transitions to true just as wait_event() starts execution.
This is actually a feature: The standard use of wait_event() involves
locking, in which case the locks provide the needed ordering (you hold a
lock across the wake_up() and acquire that same lock after wait_event()
returns).

Given that I did forget that wait_event() does not necessarily imply a
full memory barrier in one case, this commit fixes that case.  This commit
also adds comments calling out the placement of existing memory barriers
relied on by wait_event() calls.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-03 10:10:18 -08:00
Thomas Gleixner d689fe222a NOHZ: Check for nohz active instead of nohz enabled
RCU and the fine grained idle time accounting functions check
tick_nohz_enabled. But that variable is merily telling that NOHZ has
been enabled in the config and not been disabled on the command line.

But it does not tell anything about nohz being active. That's what all
this should check for.

Matthew reported, that the idle accounting on his old P1 machine
showed bogus values, when he enabled NOHZ in the config and did not
disable it on the kernel command line. The reason is that his machine
uses (refined) jiffies as a clocksource which explains why the "fine"
grained accounting went into lala land, because it depends on when the
system goes and leaves idle relative to the jiffies increment.

Provide a tick_nohz_active indicator and let RCU and the accounting
code use this instead of tick_nohz_enable.

Reported-and-tested-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: john.stultz@linaro.org
Cc: mwhitehe@redhat.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311132052240.30673@ionos.tec.linutronix.de
2013-11-19 14:59:50 +01:00
Peter Zijlstra 1696a8bee3 locking: Move the rtmutex code to kernel/locking/
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-p9ijt8div0hwldexwfm4nlhj@git.kernel.org
[ Fixed build failure in kernel/rcu/tree_plugin.h. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 09:23:59 +01:00
Paul E. McKenney 4102adab91 rcu: Move RCU-related source code to kernel/rcu directory
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2013-10-15 12:53:31 -07:00