sched: fix share (re)distribution

fix __aggregate_redistribute_shares() related lockup reported by David S. Miller. The problem this code tries to solve is 'accurately' calculating the 'fair' share of the group weight for each cpu. The current code falls back to a global group rebalance in case the sched_domain's span it looks at has no shares, but does have tasks. The reason it gets stuck here, is because its inherently racy - if someone steals the last task after we compute the agg->rq_weight, but before we rebalance, we'll never get out of the loop. We could of course go fix that, but while looking at this issue I found that this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite common - and given it walks the whole machine, thats very bad. The new approach is simple (why didn't I think of it before?), we set the aggregate shares to the full task group weight, and each larger sched domain that encounters an aggregate shares larger than the weight, clips it (it already re-distributes anyway). This nicely converges to the desired global picture where the sum of all shares equals the task group weight. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-25 00:25:08 +02:00 · 2008-04-25 00:25:08 +02:00 · 3f5087a2ba
--- a/kernel/sched.c
+++ b/kernel/sched.c
@ -1656,42 +1656,6 @@ void aggregate_group_weight(struct task_group *tg, struct sched_domain *sd)
 	aggregate(tg, sd)->task_weight = task_weight;
 }

-/*
- * Redistribute tg->shares amongst all tg->cfs_rq[]s.
- */
-static void __aggregate_redistribute_shares(struct task_group *tg)
-{
-	int i, max_cpu = smp_processor_id();
-	unsigned long rq_weight = 0;
-	unsigned long shares, max_shares = 0, shares_rem = tg->shares;
-
-	for_each_possible_cpu(i)
-		rq_weight += tg->cfs_rq[i]->load.weight;
-
-	for_each_possible_cpu(i) {
-		/*
-		 * divide shares proportional to the rq_weights.
-		 */
-		shares = tg->shares * tg->cfs_rq[i]->load.weight;
-		shares /= rq_weight + 1;
-
-		tg->cfs_rq[i]->shares = shares;
-
-		if (shares > max_shares) {
-			max_shares = shares;
-			max_cpu = i;
-		}
-		shares_rem -= shares;
-	}
-
-	/*
-	 * Ensure it all adds up to tg->shares; we can loose a few
-	 * due to rounding down when computing the per-cpu shares.
-	 */
-	if (shares_rem)
-		tg->cfs_rq[max_cpu]->shares += shares_rem;
-}
-
 /*
 * Compute the weight of this group on the given cpus.
 */
@ -1701,18 +1665,11 @@ void aggregate_group_shares(struct task_group *tg, struct sched_domain *sd)
 	unsigned long shares = 0;
 	int i;

-again:
 	for_each_cpu_mask(i, sd->span)
 		shares += tg->cfs_rq[i]->shares;

-	/*
-	 * When the span doesn't have any shares assigned, but does have
-	 * tasks to run do a machine wide rebalance (should be rare).
-	 */
-	if (unlikely(!shares && aggregate(tg, sd)->rq_weight)) {
-		__aggregate_redistribute_shares(tg);
-		goto again;
-	}
+	if ((!shares && aggregate(tg, sd)->rq_weight) || shares > tg->shares)
+		shares = tg->shares;

 	aggregate(tg, sd)->shares = shares;
 }