sched: remove old sched doc
Fabio Checconi noticed that Documentation/scheduler/sched-design.txt was a stale copy of the old scheduler. Remove it. Reported-by: Fabio Checconi <fabio@gandalf.sssup.it> Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit is contained in:
Родитель
2abdad0a4c
Коммит
733a0771df
|
@ -1,165 +0,0 @@
|
||||||
Goals, Design and Implementation of the
|
|
||||||
new ultra-scalable O(1) scheduler
|
|
||||||
|
|
||||||
|
|
||||||
This is an edited version of an email Ingo Molnar sent to
|
|
||||||
lkml on 4 Jan 2002. It describes the goals, design, and
|
|
||||||
implementation of Ingo's new ultra-scalable O(1) scheduler.
|
|
||||||
Last Updated: 18 April 2002.
|
|
||||||
|
|
||||||
|
|
||||||
Goal
|
|
||||||
====
|
|
||||||
|
|
||||||
The main goal of the new scheduler is to keep all the good things we know
|
|
||||||
and love about the current Linux scheduler:
|
|
||||||
|
|
||||||
- good interactive performance even during high load: if the user
|
|
||||||
types or clicks then the system must react instantly and must execute
|
|
||||||
the user tasks smoothly, even during considerable background load.
|
|
||||||
|
|
||||||
- good scheduling/wakeup performance with 1-2 runnable processes.
|
|
||||||
|
|
||||||
- fairness: no process should stay without any timeslice for any
|
|
||||||
unreasonable amount of time. No process should get an unjustly high
|
|
||||||
amount of CPU time.
|
|
||||||
|
|
||||||
- priorities: less important tasks can be started with lower priority,
|
|
||||||
more important tasks with higher priority.
|
|
||||||
|
|
||||||
- SMP efficiency: no CPU should stay idle if there is work to do.
|
|
||||||
|
|
||||||
- SMP affinity: processes which run on one CPU should stay affine to
|
|
||||||
that CPU. Processes should not bounce between CPUs too frequently.
|
|
||||||
|
|
||||||
- plus additional scheduler features: RT scheduling, CPU binding.
|
|
||||||
|
|
||||||
and the goal is also to add a few new things:
|
|
||||||
|
|
||||||
- fully O(1) scheduling. Are you tired of the recalculation loop
|
|
||||||
blowing the L1 cache away every now and then? Do you think the goodness
|
|
||||||
loop is taking a bit too long to finish if there are lots of runnable
|
|
||||||
processes? This new scheduler takes no prisoners: wakeup(), schedule(),
|
|
||||||
the timer interrupt are all O(1) algorithms. There is no recalculation
|
|
||||||
loop. There is no goodness loop either.
|
|
||||||
|
|
||||||
- 'perfect' SMP scalability. With the new scheduler there is no 'big'
|
|
||||||
runqueue_lock anymore - it's all per-CPU runqueues and locks - two
|
|
||||||
tasks on two separate CPUs can wake up, schedule and context-switch
|
|
||||||
completely in parallel, without any interlocking. All
|
|
||||||
scheduling-relevant data is structured for maximum scalability.
|
|
||||||
|
|
||||||
- better SMP affinity. The old scheduler has a particular weakness that
|
|
||||||
causes the random bouncing of tasks between CPUs if/when higher
|
|
||||||
priority/interactive tasks, this was observed and reported by many
|
|
||||||
people. The reason is that the timeslice recalculation loop first needs
|
|
||||||
every currently running task to consume its timeslice. But when this
|
|
||||||
happens on eg. an 8-way system, then this property starves an
|
|
||||||
increasing number of CPUs from executing any process. Once the last
|
|
||||||
task that has a timeslice left has finished using up that timeslice,
|
|
||||||
the recalculation loop is triggered and other CPUs can start executing
|
|
||||||
tasks again - after having idled around for a number of timer ticks.
|
|
||||||
The more CPUs, the worse this effect.
|
|
||||||
|
|
||||||
Furthermore, this same effect causes the bouncing effect as well:
|
|
||||||
whenever there is such a 'timeslice squeeze' of the global runqueue,
|
|
||||||
idle processors start executing tasks which are not affine to that CPU.
|
|
||||||
(because the affine tasks have finished off their timeslices already.)
|
|
||||||
|
|
||||||
The new scheduler solves this problem by distributing timeslices on a
|
|
||||||
per-CPU basis, without having any global synchronization or
|
|
||||||
recalculation.
|
|
||||||
|
|
||||||
- batch scheduling. A significant proportion of computing-intensive tasks
|
|
||||||
benefit from batch-scheduling, where timeslices are long and processes
|
|
||||||
are roundrobin scheduled. The new scheduler does such batch-scheduling
|
|
||||||
of the lowest priority tasks - so nice +19 jobs will get
|
|
||||||
'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
|
|
||||||
in essence SCHED_IDLE, from an interactiveness point of view.
|
|
||||||
|
|
||||||
- handle extreme loads more smoothly, without breakdown and scheduling
|
|
||||||
storms.
|
|
||||||
|
|
||||||
- O(1) RT scheduling. For those RT folks who are paranoid about the
|
|
||||||
O(nr_running) property of the goodness loop and the recalculation loop.
|
|
||||||
|
|
||||||
- run fork()ed children before the parent. Andrea has pointed out the
|
|
||||||
advantages of this a few months ago, but patches for this feature
|
|
||||||
do not work with the old scheduler as well as they should,
|
|
||||||
because idle processes often steal the new child before the fork()ing
|
|
||||||
CPU gets to execute it.
|
|
||||||
|
|
||||||
|
|
||||||
Design
|
|
||||||
======
|
|
||||||
|
|
||||||
The core of the new scheduler contains the following mechanisms:
|
|
||||||
|
|
||||||
- *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
|
|
||||||
array and an 'expired' array. The active array contains all tasks that
|
|
||||||
are affine to this CPU and have timeslices left. The expired array
|
|
||||||
contains all tasks which have used up their timeslices - but this array
|
|
||||||
is kept sorted as well. The active and expired array is not accessed
|
|
||||||
directly, it's accessed through two pointers in the per-CPU runqueue
|
|
||||||
structure. If all active tasks are used up then we 'switch' the two
|
|
||||||
pointers and from now on the ready-to-go (former-) expired array is the
|
|
||||||
active array - and the empty active array serves as the new collector
|
|
||||||
for expired tasks.
|
|
||||||
|
|
||||||
- there is a 64-bit bitmap cache for array indices. Finding the highest
|
|
||||||
priority task is thus a matter of two x86 BSFL bit-search instructions.
|
|
||||||
|
|
||||||
the split-array solution enables us to have an arbitrary number of active
|
|
||||||
and expired tasks, and the recalculation of timeslices can be done
|
|
||||||
immediately when the timeslice expires. Because the arrays are always
|
|
||||||
access through the pointers in the runqueue, switching the two arrays can
|
|
||||||
be done very quickly.
|
|
||||||
|
|
||||||
this is a hybride priority-list approach coupled with roundrobin
|
|
||||||
scheduling and the array-switch method of distributing timeslices.
|
|
||||||
|
|
||||||
- there is a per-task 'load estimator'.
|
|
||||||
|
|
||||||
one of the toughest things to get right is good interactive feel during
|
|
||||||
heavy system load. While playing with various scheduler variants i found
|
|
||||||
that the best interactive feel is achieved not by 'boosting' interactive
|
|
||||||
tasks, but by 'punishing' tasks that want to use more CPU time than there
|
|
||||||
is available. This method is also much easier to do in an O(1) fashion.
|
|
||||||
|
|
||||||
to establish the actual 'load' the task contributes to the system, a
|
|
||||||
complex-looking but pretty accurate method is used: there is a 4-entry
|
|
||||||
'history' ringbuffer of the task's activities during the last 4 seconds.
|
|
||||||
This ringbuffer is operated without much overhead. The entries tell the
|
|
||||||
scheduler a pretty accurate load-history of the task: has it used up more
|
|
||||||
CPU time or less during the past N seconds. [the size '4' and the interval
|
|
||||||
of 4x 1 seconds was found by lots of experimentation - this part is
|
|
||||||
flexible and can be changed in both directions.]
|
|
||||||
|
|
||||||
the penalty a task gets for generating more load than the CPU can handle
|
|
||||||
is a priority decrease - there is a maximum amount to this penalty
|
|
||||||
relative to their static priority, so even fully CPU-bound tasks will
|
|
||||||
observe each other's priorities, and will share the CPU accordingly.
|
|
||||||
|
|
||||||
the SMP load-balancer can be extended/switched with additional parallel
|
|
||||||
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
|
|
||||||
can be supported easily by changing the load-balancer. Right now it's
|
|
||||||
tuned for my SMP systems.
|
|
||||||
|
|
||||||
i skipped the prev->mm == next->mm advantage - no workload i know of shows
|
|
||||||
any sensitivity to this. It can be added back by sacrificing O(1)
|
|
||||||
schedule() [the current and one-lower priority list can be searched for a
|
|
||||||
that->mm == current->mm condition], but costs a fair number of cycles
|
|
||||||
during a number of important workloads, so i wanted to avoid this as much
|
|
||||||
as possible.
|
|
||||||
|
|
||||||
- the SMP idle-task startup code was still racy and the new scheduler
|
|
||||||
triggered this. So i streamlined the idle-setup code a bit. We do not call
|
|
||||||
into schedule() before all processors have started up fully and all idle
|
|
||||||
threads are in place.
|
|
||||||
|
|
||||||
- the patch also cleans up a number of aspects of sched.c - moves code
|
|
||||||
into other areas of the kernel where it's appropriate, and simplifies
|
|
||||||
certain code paths and data constructs. As a result, the new scheduler's
|
|
||||||
code is smaller than the old one.
|
|
||||||
|
|
||||||
Ingo
|
|
Загрузка…
Ссылка в новой задаче