WSL2-Linux-Kernel/kernel/sched
Mathias Krause 512e21c150 sched/fair: Prevent dead task groups from regaining cfs_rq's
[ Upstream commit b027789e5e ]

Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

    CPU1:                                      CPU2:
      :                                        timer IRQ:
      :                                          do_sched_cfs_period_timer():
      :                                            :
      :                                            distribute_cfs_runtime():
      :                                              rcu_read_lock();
      :                                              :
      :                                              unthrottle_cfs_rq():
    sched_offline_group():                             :
      :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
      list_del_rcu(&tg->list);                           :
 (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
      :                                                    :
 (2)  list_del_rcu(&tg->siblings);                         :
      :                                                    tg_unthrottle_up():
      unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
        :                                                    :
        list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
        :                                                    :
        :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
 (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
      :                                                      :
      :                                                    :
      :                                                  :
      :                                                :
      :                                              :
 (4)  :                                              rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.

On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

  [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
  [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25 09:48:32 +01:00
..
Makefile sched: Trivial core scheduling cookie management 2021-05-12 11:43:31 +02:00
autogroup.c sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-25 09:48:32 +01:00
autogroup.h
clock.c
completion.c
core.c sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-25 09:48:32 +01:00
core_sched.c sched: prctl() core-scheduling interface 2021-05-12 11:43:31 +02:00
cpuacct.c sched: Wrap rq::lock access 2021-05-12 11:43:26 +02:00
cpudeadline.c
cpudeadline.h
cpufreq.c
cpufreq_schedutil.c cpufreq: schedutil: Use kobject release() method to free sugov_tunables 2021-08-06 15:34:55 +02:00
cpupri.c
cpupri.h
cputime.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
deadline.c sched/deadline: Fix missing clock update in migrate_task_rq_dl() 2021-08-06 14:25:24 +02:00
debug.c sched/fair: Null terminate buffer when updating tunable_scaling 2021-10-01 13:57:57 +02:00
fair.c sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-25 09:48:32 +01:00
features.h
idle.c sched/idle: Make the idle timer expire in hard interrupt context 2021-09-09 10:36:16 +02:00
isolation.c sched/isolation: Reconcile rcu_nocbs= and nohz_full= 2021-05-13 14:12:47 +02:00
loadavg.c sched: Make multiple runqueue task counters 32-bit 2021-05-12 21:34:17 +02:00
membarrier.c
pelt.c
pelt.h Merge branch 'sched/urgent' into sched/core, to resolve conflicts 2021-06-18 11:31:25 +02:00
psi.c Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2021-07-01 17:22:14 -07:00
rt.c sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-25 09:48:32 +01:00
sched-pelt.h
sched.h sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-25 09:48:32 +01:00
smp.h
stats.c
stats.h sched: Introduce task_is_running() 2021-06-18 11:43:07 +02:00
stop_task.c sched: Introduce sched_class::pick_task() 2021-05-12 11:43:28 +02:00
swait.c
topology.c sched/topology: Skip updating masks for non-online nodes 2021-08-20 12:32:57 +02:00
wait.c rq-qos: fix missed wake-ups in rq_qos_throttle try two 2021-06-08 15:12:57 -06:00
wait_bit.c