sched/cputime: Improve scalability by not accounting thread group tasks pending runtime

Commit: d670ec1317 ("posix-cpu-timers: Cure SMP wobbles") started accounting thread group tasks pending runtime in thread_group_cputime(). Another commit: 6e998916df ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") updated scheduler runtime statistics (call update_curr()) when reading task pending runtime. Those changes cause bad performance of SYS_times() and SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on larger systems with many CPUs. While we would like to have cpuclock monotonicity kept i.e. have problems fixed by above commits stay fixed, we also would like to have good performance. However when we notice that change from commit d670ec1317 is not longer needed to solve problem addressed by that commit, because of change from the second commit 6e998916df, we can get room for optimization. Since we update task while reading it's pending runtime in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will see updated values and on testcase from d670ec1317 process cpuclock will not be smaller than thread cpuclock. I tested the patch on testcases from commits d670ec1317, 6e998916df and some other cpuclock/cputimers testcases and did not found cpuclock monotonicity problems or other malfunction. This patch has the drawback that we will not provide thread group cputime up-to-date to the last moment. For example when arming cputime timer, we will arm it with possibly a bit outdated values and that timer will trigger earlier compared to behaviour without the patch. However that was the behaviour before d670ec1317 commit (kernel v3.1) so it's unlikely to affect applications. Patch improves related syscall performance, as measured by Giovanni's benchmarks described in commit: 6075620b05 ("sched/cputime: Mitigate performance regression in times()/clock_gettime()") The benchmark results are: SYS_clock_gettime(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%) 5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%) 8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%) 12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%) 21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%) 30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%) 48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%) 79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%) 110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%) 128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%) SYS_times(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%) 5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%) 8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%) 12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%) 21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%) 30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%) 48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%) 79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%) 110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%) 128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%) Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-17 11:30:44 +02:00 · 2016-08-17 11:30:44 +02:00 · a1eb1411b4
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@ -306,6 +306,26 @@ static inline cputime_t account_other_time(cputime_t max)
 	return accounted;
 }

+#ifdef CONFIG_64BIT
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	return t->se.sum_exec_runtime;
+}
+#else
+static u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	u64 ns;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(t, &rf);
+	ns = t->se.sum_exec_runtime;
+	task_rq_unlock(rq, t, &rf);
+
+	return ns;
+}
+#endif
+
 /*
 * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
 * tasks (sum on group iteration) belonging to @tsk's group.
@ -318,6 +338,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;

+	/*
+	 * Update current task runtime to account pending time since last
+	 * scheduler action or thread_group_cputime() call. This thread group
+	 * might have other running tasks on different CPUs, but updating
+	 * their runtime can affect syscall performance, so we skip account
+	 * those pending times and rely only on values updated on tick or
+	 * other scheduler action.
+	 */
+	if (same_thread_group(current, tsk))
+		(void) task_sched_runtime(current);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@ -332,7 +363,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;