From bedc1969150d480c462cdac320fa944b694a7162 Mon Sep 17 00:00:00 2001 From: Ding Tianhong Date: Wed, 15 Jun 2016 15:27:36 +0800 Subject: [PATCH 01/22] rcu: Fix soft lockup for rcu_nocb_kthread Carrying out the following steps results in a softlockup in the RCU callback-offload (rcuo) kthreads: 1. Connect to ixgbevf, and set the speed to 10Gb/s. 2. Use ifconfig to bring the nic up and down repeatedly. [ 317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready [ 368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15] [ 368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000 [ 368.106005] RIP: 0010:[] [] fib_table_lookup+0x14/0x390 [ 368.106005] RSP: 0018:ffff88061fc83ce8 EFLAGS: 00000286 [ 368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001 [ 368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00 [ 368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000 [ 368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58 [ 368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0 [ 368.106005] FS: 0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000 [ 368.106005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0 [ 368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 368.106005] Stack: [ 368.106005] 00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00 [ 368.106005] ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146 [ 368.106005] ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0 [ 368.106005] Call Trace: [ 368.106005] [ 368.106005] [ 368.106005] [] ip_route_input_noref+0x516/0xbd0 [ 368.106005] [] ? skb_release_data+0xd6/0x110 [ 368.106005] [] ? kfree_skb+0x3a/0xa0 [ 368.106005] [] ip_rcv_finish+0x29f/0x350 [ 368.106005] [] ip_rcv+0x234/0x380 [ 368.106005] [] __netif_receive_skb_core+0x676/0x870 [ 368.106005] [] __netif_receive_skb+0x18/0x60 [ 368.106005] [] process_backlog+0xae/0x180 [ 368.106005] [] net_rx_action+0x152/0x240 [ 368.106005] [] __do_softirq+0xef/0x280 [ 368.106005] [] call_softirq+0x1c/0x30 [ 368.106005] [ 368.106005] [ 368.106005] [] do_softirq+0x65/0xa0 [ 368.106005] [] local_bh_enable+0x94/0xa0 [ 368.106005] [] rcu_nocb_kthread+0x232/0x370 [ 368.106005] [] ? wake_up_bit+0x30/0x30 [ 368.106005] [] ? rcu_start_gp+0x40/0x40 [ 368.106005] [] kthread+0xcf/0xe0 [ 368.106005] [] ? kthread_create_on_node+0x140/0x140 [ 368.106005] [] ret_from_fork+0x58/0x90 [ 368.106005] [] ? kthread_create_on_node+0x140/0x140 ==================================cut here============================== It turns out that the rcuos callback-offload kthread is busy processing a very large quantity of RCU callbacks, and it is not reliquishing the CPU while doing so. This commit therefore adds an cond_resched_rcu_qs() within the loop to allow other tasks to run. Signed-off-by: Ding Tianhong [ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0082fce402a0..85c5a883c6e3 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2173,6 +2173,7 @@ static int rcu_nocb_kthread(void *arg) cl++; c++; local_bh_enable(); + cond_resched_rcu_qs(); list = next; } trace_rcu_batch_end(rdp->rsp->name, c, !!list, 0, 0, 1); From e1ef69217f68b8407245e9e353cf88cc2f9ebc18 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 20 Jun 2016 07:51:22 +0900 Subject: [PATCH 02/22] rcutorture: Remove outdated config option description CONFIG_RCU_TORTURE_TEST_RUNNABLE was removed by commit 4e9a073f60367 ("torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code"), but the documentation was not updated accordingly. This commit therefore updates the documentation to reflect CONFIG_RCU_TORTURE_TEST_RUNNABLE's removal and to add a description for the alternative module parameter. Signed-off-by: SeongJae Park Signed-off-by: Paul E. McKenney --- Documentation/RCU/torture.txt | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index 118e7c176ce7..278f6a9383b6 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -10,21 +10,6 @@ status messages via printk(), which can be examined via the dmesg command (perhaps grepping for "torture"). The test is started when the module is loaded, and stops when the module is unloaded. -CONFIG_RCU_TORTURE_TEST_RUNNABLE - -It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will -result in the tests being loaded into the base kernel. In this case, -the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify -whether the RCU torture tests are to be started immediately during -boot or whether the /proc/sys/kernel/rcutorture_runnable file is used -to enable them. This /proc file can be used to repeatedly pause and -restart the tests, regardless of the initial state specified by the -CONFIG_RCU_TORTURE_TEST_RUNNABLE config option. - -You will normally -not- want to start the RCU torture tests during boot -(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing -this can sometimes be useful in finding boot-time bugs. - MODULE PARAMETERS From ed2bec07fd1aa47f1c06be92c164c13c70fb7a45 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 9 Aug 2016 21:15:15 -0700 Subject: [PATCH 03/22] documentation: Record reason for rcu_head two-byte alignment There is an assertion in __call_rcu() that checks only the bottom bit of the rcu_head pointer, rather than the bottom two (as might be expected for 32-bit systems) or the bottom three (as might be expected for 64-bit systems). This choice might be a bit surprising in these days of ubiquitous 32-bit and 64-bit systems. This commit therefore records the reason for this odd alignment check, namely that m68k guarantees only two-byte alignment despite being a 32-bit architectures. Signed-off-by: Paul E. McKenney --- .../RCU/Design/Requirements/Requirements.html | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index ece410f40436..a4d3838130e4 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2493,6 +2493,28 @@ or some future “lazy” variant of call_rcu() that might one day be created for energy-efficiency purposes. +

+That said, there are limits. +RCU requires that the rcu_head structure be aligned to a +two-byte boundary, and passing a misaligned rcu_head +structure to one of the call_rcu() family of functions +will result in a splat. +It is therefore necessary to exercise caution when packing +structures containing fields of type rcu_head. +Why not a four-byte or even eight-byte alignment requirement? +Because the m68k architecture provides only two-byte alignment, +and thus acts as alignment's least common denominator. + +

+The reason for reserving the bottom bit of pointers to +rcu_head structures is to leave the door open to +“lazy” callbacks whose invocations can safely be deferred. +Deferring invocation could potentially have energy-efficiency +benefits, but only if the rate of non-lazy callbacks decreases +significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open +in case it one day becomes useful. +

Performance, Scalability, Response Time, and Reliability

From f7b8eb847e35b18d3ec333774691a905bf16017f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 24 Jun 2016 11:30:32 -0700 Subject: [PATCH 04/22] rcu: Consolidate expedited grace period machinery The functions synchronize_rcu_expedited() and synchronize_sched_expedited() have nearly identical code. This commit therefore consolidates this code into a new _synchronize_rcu_expedited() function. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 62 ++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 33 deletions(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 6d86ab6ec2c9..1549f456fb7b 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -516,6 +516,33 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) mutex_unlock(&rsp->exp_wake_mutex); } +/* + * Given an rcu_state pointer and a smp_call_function() handler, kick + * off the specified flavor of expedited grace period. + */ +static void _synchronize_rcu_expedited(struct rcu_state *rsp, + smp_call_func_t func) +{ + unsigned long s; + + /* If expedited grace periods are prohibited, fall back to normal. */ + if (rcu_gp_is_normal()) { + wait_rcu_gp(rsp->call); + return; + } + + /* Take a snapshot of the sequence number. */ + s = rcu_exp_gp_seq_snap(rsp); + if (exp_funnel_lock(rsp, s)) + return; /* Someone else did our work for us. */ + + /* Initialize the rcu_node tree in preparation for the wait. */ + sync_rcu_exp_select_cpus(rsp, func); + + /* Wait and clean up, including waking everyone. */ + rcu_exp_wait_wake(rsp, s); +} + /** * synchronize_sched_expedited - Brute-force RCU-sched grace period * @@ -534,29 +561,13 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) */ void synchronize_sched_expedited(void) { - unsigned long s; struct rcu_state *rsp = &rcu_sched_state; /* If only one CPU, this is automatically a grace period. */ if (rcu_blocking_is_gp()) return; - /* If expedited grace periods are prohibited, fall back to normal. */ - if (rcu_gp_is_normal()) { - wait_rcu_gp(call_rcu_sched); - return; - } - - /* Take a snapshot of the sequence number. */ - s = rcu_exp_gp_seq_snap(rsp); - if (exp_funnel_lock(rsp, s)) - return; /* Someone else did our work for us. */ - - /* Initialize the rcu_node tree in preparation for the wait. */ - sync_rcu_exp_select_cpus(rsp, sync_sched_exp_handler); - - /* Wait and clean up, including waking everyone. */ - rcu_exp_wait_wake(rsp, s); + _synchronize_rcu_expedited(rsp, sync_sched_exp_handler); } EXPORT_SYMBOL_GPL(synchronize_sched_expedited); @@ -620,23 +631,8 @@ static void sync_rcu_exp_handler(void *info) void synchronize_rcu_expedited(void) { struct rcu_state *rsp = rcu_state_p; - unsigned long s; - /* If expedited grace periods are prohibited, fall back to normal. */ - if (rcu_gp_is_normal()) { - wait_rcu_gp(call_rcu); - return; - } - - s = rcu_exp_gp_seq_snap(rsp); - if (exp_funnel_lock(rsp, s)) - return; /* Someone else did our work for us. */ - - /* Initialize the rcu_node tree in preparation for the wait. */ - sync_rcu_exp_select_cpus(rsp, sync_rcu_exp_handler); - - /* Wait for ->blkd_tasks lists to drain, then wake everyone up. */ - rcu_exp_wait_wake(rsp, s); + _synchronize_rcu_expedited(rsp, sync_rcu_exp_handler); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); From 8b355e3bc1408be238ae4695fb6318ae502cae8e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 29 Jun 2016 13:46:25 -0700 Subject: [PATCH 05/22] rcu: Drive expedited grace periods from workqueue The current implementation of expedited grace periods has the user task drive the grace period. This works, but has downsides: (1) The user task must awaken tasks piggybacking on this grace period, which can result in latencies rivaling that of the grace period itself, and (2) User tasks can receive signals, which interfere with RCU CPU stall warnings. This commit therefore uses workqueues to drive the grace periods, so that the user task need not do the awakening. A subsequent commit will remove the now-unnecessary code allowing for signals. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 1 + kernel/rcu/tree_exp.h | 46 ++++++++++++++++++++++++++++++++++++----- kernel/rcu/tree_trace.c | 7 ++++--- 3 files changed, 46 insertions(+), 8 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index f714f873bf9d..e99a5234d9ed 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -400,6 +400,7 @@ struct rcu_data { #ifdef CONFIG_RCU_FAST_NO_HZ struct rcu_head oom_head; #endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */ + atomic_long_t exp_workdone0; /* # done by workqueue. */ atomic_long_t exp_workdone1; /* # done by others #1. */ atomic_long_t exp_workdone2; /* # done by others #2. */ atomic_long_t exp_workdone3; /* # done by others #3. */ diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 1549f456fb7b..97f5ffe42b58 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -500,7 +500,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) * next GP, to proceed. */ mutex_lock(&rsp->exp_wake_mutex); - mutex_unlock(&rsp->exp_mutex); rcu_for_each_node_breadth_first(rsp, rnp) { if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) { @@ -516,6 +515,29 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) mutex_unlock(&rsp->exp_wake_mutex); } +/* Let the workqueue handler know what it is supposed to do. */ +struct rcu_exp_work { + smp_call_func_t rew_func; + struct rcu_state *rew_rsp; + unsigned long rew_s; + struct work_struct rew_work; +}; + +/* + * Work-queue handler to drive an expedited grace period forward. + */ +static void wait_rcu_exp_gp(struct work_struct *wp) +{ + struct rcu_exp_work *rewp; + + /* Initialize the rcu_node tree in preparation for the wait. */ + rewp = container_of(wp, struct rcu_exp_work, rew_work); + sync_rcu_exp_select_cpus(rewp->rew_rsp, rewp->rew_func); + + /* Wait and clean up, including waking everyone. */ + rcu_exp_wait_wake(rewp->rew_rsp, rewp->rew_s); +} + /* * Given an rcu_state pointer and a smp_call_function() handler, kick * off the specified flavor of expedited grace period. @@ -523,6 +545,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s) static void _synchronize_rcu_expedited(struct rcu_state *rsp, smp_call_func_t func) { + struct rcu_data *rdp; + struct rcu_exp_work rew; + struct rcu_node *rnp; unsigned long s; /* If expedited grace periods are prohibited, fall back to normal. */ @@ -536,11 +561,22 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp, if (exp_funnel_lock(rsp, s)) return; /* Someone else did our work for us. */ - /* Initialize the rcu_node tree in preparation for the wait. */ - sync_rcu_exp_select_cpus(rsp, func); + /* Marshall arguments and schedule the expedited grace period. */ + rew.rew_func = func; + rew.rew_rsp = rsp; + rew.rew_s = s; + INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); + schedule_work(&rew.rew_work); - /* Wait and clean up, including waking everyone. */ - rcu_exp_wait_wake(rsp, s); + /* Wait for expedited grace period to complete. */ + rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id()); + rnp = rcu_get_root(rsp); + wait_event(rnp->exp_wq[(s >> 1) & 0x3], + sync_exp_work_done(rsp, + &rdp->exp_workdone0, s)); + + /* Let the next expedited grace period start. */ + mutex_unlock(&rsp->exp_mutex); } /** diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c index 86782f9a4604..b1f28972872c 100644 --- a/kernel/rcu/tree_trace.c +++ b/kernel/rcu/tree_trace.c @@ -185,16 +185,17 @@ static int show_rcuexp(struct seq_file *m, void *v) int cpu; struct rcu_state *rsp = (struct rcu_state *)m->private; struct rcu_data *rdp; - unsigned long s1 = 0, s2 = 0, s3 = 0; + unsigned long s0 = 0, s1 = 0, s2 = 0, s3 = 0; for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(rsp->rda, cpu); + s0 += atomic_long_read(&rdp->exp_workdone0); s1 += atomic_long_read(&rdp->exp_workdone1); s2 += atomic_long_read(&rdp->exp_workdone2); s3 += atomic_long_read(&rdp->exp_workdone3); } - seq_printf(m, "s=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu enq=%d sc=%lu\n", - rsp->expedited_sequence, s1, s2, s3, + seq_printf(m, "s=%lu wd0=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu enq=%d sc=%lu\n", + rsp->expedited_sequence, s0, s1, s2, s3, atomic_long_read(&rsp->expedited_normal), atomic_read(&rsp->expedited_need_qs), rsp->expedited_sequence / 2); From 908d2c1fd156d414008e1b7e1fb5a7716e013231 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 29 Jun 2016 14:34:59 -0700 Subject: [PATCH 06/22] rcu: Stop disabling expedited RCU CPU stall warnings Now that RCU expedited grace periods are always driven by a workqueue, there is no need to account for signal reception, and thus no need to disable expedited RCU CPU stall warnings due to signal reception. This commit therefore removes the signal-reception checks, leaving a WARN_ON() to catch possible future bugs. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 97f5ffe42b58..3a647eb96f23 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -427,12 +427,7 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp) jiffies_stall); if (ret > 0 || sync_rcu_preempt_exp_done(rnp_root)) return; - if (ret < 0) { - /* Hit a signal, disable CPU stall warnings. */ - swait_event(rsp->expedited_wq, - sync_rcu_preempt_exp_done(rnp_root)); - return; - } + WARN_ON(ret < 0); /* workqueues should not be signaled. */ pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {", rsp->name); ndetected = 0; From 24a6cff286030b98149ff10b968cba31280fcb7a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 29 Jun 2016 14:49:29 -0700 Subject: [PATCH 07/22] rcu: Make expedited RCU CPU stall warnings respond to controls The expedited RCU CPU stall warnings currently responds to neither the panic_on_rcu_stall sysctl setting nor the rcupdate.rcu_cpu_stall_suppress kernel boot parameter. This commit therefore updates the expedited code to respond to these two controls. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 3a647eb96f23..f316683b18f1 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -428,6 +428,9 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp) if (ret > 0 || sync_rcu_preempt_exp_done(rnp_root)) return; WARN_ON(ret < 0); /* workqueues should not be signaled. */ + if (rcu_cpu_stall_suppress) + continue; + panic_on_rcu_stall(); pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {", rsp->name); ndetected = 0; From 98834b83785e1388fa8672cf4f8de09974d15e86 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 29 Jun 2016 17:04:19 -0700 Subject: [PATCH 08/22] rcu: Exclude RCU-offline CPUs from expedited grace periods The expedited RCU grace periods currently rely on a failure indication from smp_call_function_single() to determine that a given CPU is offline. This works after a fashion, but is more contorted and less precise than relying on RCU's internal state. This commit therefore takes a first step towards relying on internal state. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index f316683b18f1..3bc4b3dda801 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -359,7 +359,8 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp, struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu); if (raw_smp_processor_id() == cpu || - !(atomic_add_return(0, &rdtp->dynticks) & 0x1)) + !(atomic_add_return(0, &rdtp->dynticks) & 0x1) || + !(rnp->qsmaskinitnext & rdp->grpmask)) mask_ofl_test |= rdp->grpmask; } mask_ofl_ipi = rnp->expmask & ~mask_ofl_test; From 385c859f678e8ee6b0b122086f34e72a0e861cef Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Jun 2016 12:16:11 -0700 Subject: [PATCH 09/22] rcu: Use RCU's online-CPU state for expedited IPI retry This commit improves the accuracy of the interaction between CPU hotplug operations and RCU's expedited grace periods by using RCU's online-CPU state to determine when failed IPIs should be retried. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 3bc4b3dda801..24343eb87b58 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -385,17 +385,16 @@ retry_ipi: mask_ofl_ipi &= ~mask; continue; } - /* Failed, raced with offline. */ + /* Failed, raced with CPU hotplug operation. */ raw_spin_lock_irqsave_rcu_node(rnp, flags); - if (cpu_online(cpu) && + if ((rnp->qsmaskinitnext & mask) && (rnp->expmask & mask)) { + /* Online, so delay for a bit and try again. */ raw_spin_unlock_irqrestore_rcu_node(rnp, flags); schedule_timeout_uninterruptible(1); - if (cpu_online(cpu) && - (rnp->expmask & mask)) - goto retry_ipi; - raw_spin_lock_irqsave_rcu_node(rnp, flags); + goto retry_ipi; } + /* CPU really is offline, so we can ignore it. */ if (!(rnp->expmask & mask)) mask_ofl_ipi &= ~mask; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); From 94d44776737266eccafee32b985fe31fd5e021ca Mon Sep 17 00:00:00 2001 From: Jisheng Zhang Date: Wed, 22 Jun 2016 17:19:27 +0800 Subject: [PATCH 10/22] rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads Commit abedf8e2419f ("rcu: Use simple wait queues where possible in rcutree") converts Tree RCU's wait queues to simple wait queues, but it incorrectly reverts the commit 2aa792e6faf1 ("rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads"). This can result in redundant self-wakeups. This commit therefore replaces the simple wait-queue wakeups with rcu_gp_kthread_wake(), thus avoiding the redundant wakeups. Signed-off-by: Jisheng Zhang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5d80925e7fc8..cc1779a7ec5f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2344,7 +2344,7 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags) WARN_ON_ONCE(!rcu_gp_in_progress(rsp)); WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags); - swake_up(&rsp->gp_wq); /* Memory barrier implied by swake_up() path. */ + rcu_gp_kthread_wake(rsp); } /* @@ -2970,7 +2970,7 @@ static void force_quiescent_state(struct rcu_state *rsp) } WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); - swake_up(&rsp->gp_wq); /* Memory barrier implied by swake_up() path. */ + rcu_gp_kthread_wake(rsp); } /* From 379d9ecb3cc9d5d043216185904c00e54c736a96 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Jun 2016 10:37:20 -0700 Subject: [PATCH 11/22] sched: Make wake_up_nohz_cpu() handle CPUs going offline Both timers and hrtimers are maintained on the outgoing CPU until CPU_DEAD time, at which point they are migrated to a surviving CPU. If a mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems will splat in native_smp_send_reschedule() when attempting to wake up the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel: [ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40 [ 7976.741595] Modules linked in: [ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1 [ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 7976.741595] 0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000 [ 7976.741595] 0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18 [ 7976.741595] 0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00 [ 7976.741595] Call Trace: [ 7976.741595] [] dump_stack+0x67/0x99 [ 7976.741595] [] __warn+0xcc/0xf0 [ 7976.741595] [] warn_slowpath_null+0x18/0x20 [ 7976.741595] [] native_smp_send_reschedule+0x39/0x40 [ 7976.741595] [] wake_up_nohz_cpu+0x82/0x190 [ 7976.741595] [] internal_add_timer+0x7a/0x80 [ 7976.741595] [] mod_timer+0x187/0x2b0 [ 7976.741595] [] rcu_torture_reader+0x33d/0x380 [ 7976.741595] [] ? sched_torture_read_unlock+0x30/0x30 [ 7976.741595] [] ? rcu_bh_torture_read_lock+0x80/0x80 [ 7976.741595] [] kthread+0xdf/0x100 [ 7976.741595] [] ret_from_fork+0x1f/0x40 [ 7976.741595] [] ? kthread_create_on_node+0x200/0x200 However, in this case, the wakeup is redundant, because the timer migration will reprogram timer hardware as needed. Note that the fact that preemption is disabled does not avoid the splat, as the offline operation has already passed both the synchronize_sched() and the stop_machine() that would be blocked by disabled preemption. This commit therefore modifies wake_up_nohz_cpu() to avoid attempting to wake up offline CPUs. It also adds a comment stating that the caller must tolerate lost wakeups when the target CPU is going offline, and suggesting the CPU_DEAD notifier as a recovery mechanism. Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Frederic Weisbecker Cc: Thomas Gleixner --- kernel/sched/core.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5c883fe8e440..2a18856f00ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -580,6 +580,8 @@ static bool wake_up_full_nohz_cpu(int cpu) * If needed we can still optimize that later with an * empty IRQ. */ + if (cpu_is_offline(cpu)) + return true; /* Don't try to wake offline CPUs. */ if (tick_nohz_full_cpu(cpu)) { if (cpu != smp_processor_id() || tick_nohz_tick_stopped()) @@ -590,6 +592,11 @@ static bool wake_up_full_nohz_cpu(int cpu) return false; } +/* + * Wake up the specified CPU. If the CPU is going offline, it is the + * caller's responsibility to deal with the lost wakeup, for example, + * by hooking into the CPU_DEAD notifier like timers and hrtimers do. + */ void wake_up_nohz_cpu(int cpu) { if (!wake_up_full_nohz_cpu(cpu)) From e77b7041258e11ba198951553d3acf1e371a9053 Mon Sep 17 00:00:00 2001 From: Paul Gortmaker Date: Fri, 15 Jul 2016 12:19:41 -0400 Subject: [PATCH 12/22] rcu: Don't use modular infrastructure in non-modular code The Kconfig currently controlling compilation of tree.c is: init/Kconfig:config TREE_RCU init/Kconfig: bool ...and update.c and sync.c are "obj-y" meaning that none are ever built as a module by anyone. Since MODULE_ALIAS is a no-op for non-modular code, we can remove them from these files. We leave moduleparam.h behind since the files instantiate some boot time configuration parameters with module_param() still. Cc: "Paul E. McKenney" Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 -- kernel/rcu/update.c | 3 +-- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index cc1779a7ec5f..e83446062f65 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -41,7 +41,6 @@ #include #include #include -#include #include #include #include @@ -60,7 +59,6 @@ #include "tree.h" #include "rcu.h" -MODULE_ALIAS("rcutree"); #ifdef MODULE_PARAM_PREFIX #undef MODULE_PARAM_PREFIX #endif diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index f0d8322bc3ec..f19271dce0a9 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -46,7 +46,7 @@ #include #include #include -#include +#include #include #include @@ -54,7 +54,6 @@ #include "rcu.h" -MODULE_ALIAS("rcupdate"); #ifdef MODULE_PARAM_PREFIX #undef MODULE_PARAM_PREFIX #endif From 3563a438f124cb0b8cfd350c86de2f26c63d8837 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Jul 2016 09:39:11 -0700 Subject: [PATCH 13/22] rcu: Avoid redundant quiescent-state chasing Currently, __note_gp_changes() checks to see if the CPU has slept through multiple grace periods. If it has, it resynchronizes that CPU's view of the grace-period state, which includes whether or not the current grace period needs a quiescent state from this CPU. The fact of this need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm and rdp->core_needs_qs. The former tells RCU's context-switch code to go get a quiescent state and the latter says that it needs to be reported. The current code unconditionally sets the former to true, but correctly sets the latter. This does not result in failures, but it does unnecessarily increase the amount of work done on average at context-switch time. This commit therefore correctly sets both fields. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e83446062f65..733902c33dd2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1846,6 +1846,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) { bool ret; + bool need_gp; /* Handle the ends of any preceding grace periods first. */ if (rdp->completed == rnp->completed && @@ -1872,9 +1873,10 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, */ rdp->gpnum = rnp->gpnum; trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpustart")); - rdp->cpu_no_qs.b.norm = true; + need_gp = !!(rnp->qsmask & rdp->grpmask); + rdp->cpu_no_qs.b.norm = need_gp; rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr); - rdp->core_needs_qs = !!(rnp->qsmask & rdp->grpmask); + rdp->core_needs_qs = need_gp; zero_cpu_stall_ticks(rdp); WRITE_ONCE(rdp->gpwrap, false); } From 7ec99de36f402618ae44147ac7fa9a07e4757a5f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Jun 2016 13:58:26 -0700 Subject: [PATCH 14/22] rcu: Provide exact CPU-online tracking for RCU Up to now, RCU has assumed that the CPU-online process makes it from CPU_UP_PREPARE to set_cpu_online() within one jiffy. Given the recent rise of virtualized environments, this assumption is very clearly obsolete. Failing to meet this deadline can result in RCU paying attention to an incoming CPU for one jiffy, then ignoring it until the grace period following the one in which that CPU sets itself online. This situation might prove to be fatally disappointing to any RCU read-side critical sections that had the misfortune to execute during the time in which RCU was ignoring the slow-to-come-online CPU. This commit therefore updates RCU's internal CPU state-tracking information at notify_cpu_starting() time, thus providing RCU with an exact transition of the CPU's state from offline to online. Note that this means that incoming CPUs must not use RCU read-side critical section (other than those of SRCU) until notify_cpu_starting() time. Note also that the CPU_STARTING notifiers -are- allowed to use RCU read-side critical sections. (Of course, CPU-hotplug notifiers are rapidly becoming obsolete, so you need to act fast!) If a given architecture or CPU family needs to use RCU read-side critical sections earlier, the call to rcu_cpu_starting() from notify_cpu_starting() will need to be architecture-specific, with architectures that need early use being required to hand-place the call to rcu_cpu_starting() at some point preceding the call to notify_cpu_starting(). Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 1 + kernel/cpu.c | 1 + kernel/rcu/tree.c | 32 +++++++++++++++++++++++++++++--- 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 1aa62e1a761b..321f9ed552a9 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -334,6 +334,7 @@ void rcu_sched_qs(void); void rcu_bh_qs(void); void rcu_check_callbacks(int user); void rcu_report_dead(unsigned int cpu); +void rcu_cpu_starting(unsigned int cpu); #ifndef CONFIG_TINY_RCU void rcu_end_inkernel_boot(void); diff --git a/kernel/cpu.c b/kernel/cpu.c index 341bf80f80bd..9482ceb928e0 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -889,6 +889,7 @@ void notify_cpu_starting(unsigned int cpu) struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE); + rcu_cpu_starting(cpu); /* All CPU_STARTING notifiers can use RCU. */ while (st->state < target) { struct cpuhp_step *step; diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5d80925e7fc8..d2973fb85e8c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3792,8 +3792,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp) rnp = rdp->mynode; mask = rdp->grpmask; raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ - rnp->qsmaskinitnext |= mask; - rnp->expmaskinitnext |= mask; if (!rdp->beenonline) WRITE_ONCE(rsp->ncpus, READ_ONCE(rsp->ncpus) + 1); rdp->beenonline = true; /* We have now been online. */ @@ -3860,6 +3858,32 @@ int rcutree_dead_cpu(unsigned int cpu) return 0; } +/* + * Mark the specified CPU as being online so that subsequent grace periods + * (both expedited and normal) will wait on it. Note that this means that + * incoming CPUs are not allowed to use RCU read-side critical sections + * until this function is called. Failing to observe this restriction + * will result in lockdep splats. + */ +void rcu_cpu_starting(unsigned int cpu) +{ + unsigned long flags; + unsigned long mask; + struct rcu_data *rdp; + struct rcu_node *rnp; + struct rcu_state *rsp; + + for_each_rcu_flavor(rsp) { + rdp = this_cpu_ptr(rsp->rda); + rnp = rdp->mynode; + mask = rdp->grpmask; + raw_spin_lock_irqsave_rcu_node(rnp, flags); + rnp->qsmaskinitnext |= mask; + rnp->expmaskinitnext |= mask; + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + } +} + #ifdef CONFIG_HOTPLUG_CPU /* * The CPU is exiting the idle loop into the arch_cpu_idle_dead() @@ -4209,8 +4233,10 @@ void __init rcu_init(void) * or the scheduler are operational. */ pm_notifier(rcu_pm_notify, 0); - for_each_online_cpu(cpu) + for_each_online_cpu(cpu) { rcutree_prepare_cpu(cpu); + rcu_cpu_starting(cpu); + } } #include "tree_exp.h" From 0c6d4576c45736f829dc3390ac95181b2ed21bc7 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 17 Aug 2016 14:21:04 +0200 Subject: [PATCH 15/22] cpu/hotplug: Get rid of CPU_STARTING reference CPU_STARTING is scheduled for removal. There is no use of it in drivers and core code uses it only for compatibility with old-style CPU-hotplug notifiers. This patch removes therefore removes CPU_STARTING from an RCU-related comment. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index 9482ceb928e0..ff8bc3817dde 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -889,7 +889,7 @@ void notify_cpu_starting(unsigned int cpu) struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE); - rcu_cpu_starting(cpu); /* All CPU_STARTING notifiers can use RCU. */ + rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */ while (st->state < target) { struct cpuhp_step *step; From 0ffd374b2207a1a0cba9f2dbcc799198482391d5 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Thu, 18 Aug 2016 14:57:22 +0200 Subject: [PATCH 16/22] rcutorture: Convert to hotplug state machine Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Cc: Josh Triplett Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 52 +++++++++++------------------------------ 1 file changed, 14 insertions(+), 38 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 971e2b138063..dc9814860645 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1362,12 +1362,12 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag) onoff_interval, onoff_holdoff); } -static void rcutorture_booster_cleanup(int cpu) +static int rcutorture_booster_cleanup(unsigned int cpu) { struct task_struct *t; if (boost_tasks[cpu] == NULL) - return; + return 0; mutex_lock(&boost_mutex); t = boost_tasks[cpu]; boost_tasks[cpu] = NULL; @@ -1375,9 +1375,10 @@ static void rcutorture_booster_cleanup(int cpu) /* This must be outside of the mutex, otherwise deadlock! */ torture_stop_kthread(rcu_torture_boost, t); + return 0; } -static int rcutorture_booster_init(int cpu) +static int rcutorture_booster_init(unsigned int cpu) { int retval; @@ -1577,28 +1578,7 @@ static void rcu_torture_barrier_cleanup(void) } } -static int rcutorture_cpu_notify(struct notifier_block *self, - unsigned long action, void *hcpu) -{ - long cpu = (long)hcpu; - - switch (action & ~CPU_TASKS_FROZEN) { - case CPU_ONLINE: - case CPU_DOWN_FAILED: - (void)rcutorture_booster_init(cpu); - break; - case CPU_DOWN_PREPARE: - rcutorture_booster_cleanup(cpu); - break; - default: - break; - } - return NOTIFY_OK; -} - -static struct notifier_block rcutorture_cpu_nb = { - .notifier_call = rcutorture_cpu_notify, -}; +static enum cpuhp_state rcutor_hp; static void rcu_torture_cleanup(void) @@ -1638,11 +1618,8 @@ rcu_torture_cleanup(void) for (i = 0; i < ncbflooders; i++) torture_stop_kthread(rcu_torture_cbflood, cbflood_task[i]); if ((test_boost == 1 && cur_ops->can_boost) || - test_boost == 2) { - unregister_cpu_notifier(&rcutorture_cpu_nb); - for_each_possible_cpu(i) - rcutorture_booster_cleanup(i); - } + test_boost == 2) + cpuhp_remove_state(rcutor_hp); /* * Wait for all RCU callbacks to fire, then do flavor-specific @@ -1869,14 +1846,13 @@ rcu_torture_init(void) test_boost == 2) { boost_starttime = jiffies + test_boost_interval * HZ; - register_cpu_notifier(&rcutorture_cpu_nb); - for_each_possible_cpu(i) { - if (cpu_is_offline(i)) - continue; /* Heuristic: CPU can go offline. */ - firsterr = rcutorture_booster_init(i); - if (firsterr) - goto unwind; - } + + firsterr = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "RCU_TORTURE", + rcutorture_booster_init, + rcutorture_booster_cleanup); + if (firsterr < 0) + goto unwind; + rcutor_hp = firsterr; } firsterr = torture_shutdown_init(shutdown_secs, rcu_torture_cleanup); if (firsterr) From 31257c3c8b7307f106d67345755d937cb5fb8bd4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 18 Jun 2016 07:45:43 -0700 Subject: [PATCH 17/22] torture: Convert torture_shutdown() to hrtimer Upcoming changes to the timer wheel introduce significant inaccuracy and possibly also an ultimate limit on timeout duration. This is a problem for the current implementation of torture_shutdown() because (1) shutdown times are user-specified, and can therefore be quite long, and (2) the torture scripting will kill a test instance that runs for more than a few minutes longer than scheduled. This commit therefore converts the torture_shutdown() timed waits to an hrtimer, thus avoiding too-short torture test runs as well as death by scripting. Signed-off-by: Paul E. McKenney Acked-by: Arnd Bergmann --- kernel/torture.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/kernel/torture.c b/kernel/torture.c index 75961b3decfe..0d887eb62856 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -446,9 +447,8 @@ EXPORT_SYMBOL_GPL(torture_shuffle_cleanup); * Variables for auto-shutdown. This allows "lights out" torture runs * to be fully scripted. */ -static int shutdown_secs; /* desired test duration in seconds. */ static struct task_struct *shutdown_task; -static unsigned long shutdown_time; /* jiffies to system shutdown. */ +static ktime_t shutdown_time; /* time to system shutdown. */ static void (*torture_shutdown_hook)(void); /* @@ -471,20 +471,20 @@ EXPORT_SYMBOL_GPL(torture_shutdown_absorb); */ static int torture_shutdown(void *arg) { - long delta; - unsigned long jiffies_snap; + ktime_t ktime_snap; VERBOSE_TOROUT_STRING("torture_shutdown task started"); - jiffies_snap = jiffies; - while (ULONG_CMP_LT(jiffies_snap, shutdown_time) && + ktime_snap = ktime_get(); + while (ktime_before(ktime_snap, shutdown_time) && !torture_must_stop()) { - delta = shutdown_time - jiffies_snap; if (verbose) pr_alert("%s" TORTURE_FLAG - "torture_shutdown task: %lu jiffies remaining\n", - torture_type, delta); - schedule_timeout_interruptible(delta); - jiffies_snap = jiffies; + "torture_shutdown task: %llu ms remaining\n", + torture_type, + ktime_ms_delta(shutdown_time, ktime_snap)); + set_current_state(TASK_INTERRUPTIBLE); + schedule_hrtimeout(&shutdown_time, HRTIMER_MODE_ABS); + ktime_snap = ktime_get(); } if (torture_must_stop()) { torture_kthread_stopping("torture_shutdown"); @@ -511,10 +511,9 @@ int torture_shutdown_init(int ssecs, void (*cleanup)(void)) { int ret = 0; - shutdown_secs = ssecs; torture_shutdown_hook = cleanup; - if (shutdown_secs > 0) { - shutdown_time = jiffies + shutdown_secs * HZ; + if (ssecs > 0) { + shutdown_time = ktime_add(ktime_get(), ktime_set(ssecs, 0)); ret = torture_create_kthread(torture_shutdown, NULL, shutdown_task); } From 4ffa66992476c94d8b4d33b2c792d336a400ada2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Jun 2016 11:56:38 -0700 Subject: [PATCH 18/22] torture: Add task state to writer-task stall printk()s This commit adds a dump of the scheduler state for stalled rcutorture writer tasks. This addition provides yet more debug for the intermittent "failures to proceed", where grace periods move ahead but the rcutorture writer tasks fail to do so. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 971e2b138063..f0f32f888ec5 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1238,6 +1238,7 @@ rcu_torture_stats_print(void) long pipesummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; long batchsummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; static unsigned long rtcv_snap = ULONG_MAX; + struct task_struct *wtp; for_each_possible_cpu(cpu) { for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) { @@ -1312,10 +1313,12 @@ rcu_torture_stats_print(void) rcutorture_get_gp_data(cur_ops->ttype, &flags, &gpnum, &completed); - pr_alert("??? Writer stall state %s(%d) g%lu c%lu f%#x\n", + wtp = READ_ONCE(writer_task); + pr_alert("??? Writer stall state %s(%d) g%lu c%lu f%#x ->state %#lx\n", rcu_torture_writer_state_getname(), rcu_torture_writer_state, - gpnum, completed, flags); + gpnum, completed, flags, + wtp == NULL ? ~0UL : wtp->state); show_rcu_gp_kthreads(); rcu_ftrace_dump(DUMP_ALL); } From 472213a675e21185416101a77102253f93713fa9 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sat, 13 Aug 2016 15:54:35 +0900 Subject: [PATCH 19/22] rcutorture: Print out barrier error as document says Tests for rcu_barrier() were introduced by commit fae4b54f28f0 ("rcu: Introduce rcutorture testing for rcu_barrier()"). This commit updated the documentation to say that the "rtbe" field in rcutorture's dmesg output indicates test failure. However, the code was not updated, only the documentation. This commit therefore updates the code to match the updated documentation. Signed-off-by: SeongJae Park Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f0f32f888ec5..ac29017623e5 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1259,8 +1259,9 @@ rcu_torture_stats_print(void) atomic_read(&n_rcu_torture_alloc), atomic_read(&n_rcu_torture_alloc_fail), atomic_read(&n_rcu_torture_free)); - pr_cont("rtmbe: %d rtbke: %ld rtbre: %ld ", + pr_cont("rtmbe: %d rtbe: %ld rtbke: %ld rtbre: %ld ", atomic_read(&n_rcu_torture_mberror), + n_rcu_torture_barrier_error, n_rcu_torture_boost_ktrerror, n_rcu_torture_boost_rterror); pr_cont("rtbf: %ld rtb: %ld nt: %ld ", From a56fefa2605cf8e125ef09451487f30336128028 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 21 Aug 2016 16:54:39 +0900 Subject: [PATCH 20/22] rcuperf: Consistently insert space between flag and message A few rcuperf dmesg output messages have no space between the flag and the start of the message. In contrast, every other messages consistently supplies a single space. This difference makes rcuperf dmesg output hard to read and to mechanically parse. This commit therefore fixes this problem by modifying a pr_alert() call and PERFOUT_STRING() macro function to provide that single space. Signed-off-by: SeongJae Park Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index d38ab08a3fe7..123ccbd22449 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -52,7 +52,7 @@ MODULE_AUTHOR("Paul E. McKenney "); #define PERF_FLAG "-perf:" #define PERFOUT_STRING(s) \ - pr_alert("%s" PERF_FLAG s "\n", perf_type) + pr_alert("%s" PERF_FLAG " %s\n", perf_type, s) #define VERBOSE_PERFOUT_STRING(s) \ do { if (verbose) pr_alert("%s" PERF_FLAG " %s\n", perf_type, s); } while (0) #define VERBOSE_PERFOUT_ERRSTRING(s) \ @@ -400,9 +400,8 @@ rcu_perf_writer(void *arg) sp.sched_priority = 0; sched_setscheduler_nocheck(current, SCHED_NORMAL, &sp); - pr_alert("%s" PERF_FLAG - "rcu_perf_writer %ld has %d measurements\n", - perf_type, me, MIN_MEAS); + pr_alert("%s%s rcu_perf_writer %ld has %d measurements\n", + perf_type, PERF_FLAG, me, MIN_MEAS); if (atomic_inc_return(&n_rcu_perf_writer_finished) >= nrealwriters) { schedule_timeout_interruptible(10); From 489bb3d252d41392ce52590e49f0ae8782fb016e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 21 Aug 2016 16:54:40 +0900 Subject: [PATCH 21/22] torture: TOROUT_STRING(): Insert a space between flag and message The TOROUT_STRING() macro does not insert a space between the flag and the message. In contrast, other similar torture-test dmesg messages consistently supply a single space character. This difference makes the output hard to read and to mechanically parse. This commit therefore adds a space character between flag and message in TOROUT_STRING() output. Signed-off-by: SeongJae Park Signed-off-by: Paul E. McKenney --- include/linux/torture.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/torture.h b/include/linux/torture.h index 6685a73736a2..a45702eb3e7b 100644 --- a/include/linux/torture.h +++ b/include/linux/torture.h @@ -43,7 +43,7 @@ #define TORTURE_FLAG "-torture:" #define TOROUT_STRING(s) \ - pr_alert("%s" TORTURE_FLAG s "\n", torture_type) + pr_alert("%s" TORTURE_FLAG " %s\n", torture_type, s) #define VERBOSE_TOROUT_STRING(s) \ do { if (verbose) pr_alert("%s" TORTURE_FLAG " %s\n", torture_type, s); } while (0) #define VERBOSE_TOROUT_ERRSTRING(s) \ From 12adfd882c5f37548acaba4f043a158b3c54468b Mon Sep 17 00:00:00 2001 From: Chris Wilson Date: Sat, 23 Jul 2016 19:27:50 +0100 Subject: [PATCH 22/22] list: Expand list_first_entry_or_null() Due to the use of READ_ONCE() in list_empty() the compiler cannot optimise !list_empty() ? list_first_entry() : NULL very well. By manually expanding list_first_entry_or_null() we can take advantage of the READ_ONCE() to avoid the list element changing under the test while the compiler can generate smaller code. Signed-off-by: Chris Wilson Cc: "Paul E. McKenney" Cc: Andrew Morton Cc: Dan Williams Cc: Jan Kara Cc: Josef Bacik Cc: linux-kernel@vger.kernel.org Signed-off-by: Paul E. McKenney --- include/linux/list.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/list.h b/include/linux/list.h index 5183138aa932..5809e9a2de5b 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -381,8 +381,11 @@ static inline void list_splice_tail_init(struct list_head *list, * * Note that if the list is empty, it returns NULL. */ -#define list_first_entry_or_null(ptr, type, member) \ - (!list_empty(ptr) ? list_first_entry(ptr, type, member) : NULL) +#define list_first_entry_or_null(ptr, type, member) ({ \ + struct list_head *head__ = (ptr); \ + struct list_head *pos__ = READ_ONCE(head__->next); \ + pos__ != head__ ? list_entry(pos__, type, member) : NULL; \ +}) /** * list_next_entry - get the next element in list